Show HN: NanoEuler – GPT-2 scale model in pure C/CUDA from scratch
Key takeaways
- A GPT-2-class language model built entirely from scratch in C/CUDA — no Py Torch, no autograd, no ML libraries.
- This is a research/educational artifact, built in public.
- x = x + f(x) Read it as a step of numerical integration.
A GPT-2-class language model built entirely from scratch in C/CUDA — no Py Torch, no autograd, no ML libraries. The forward and backward passes are written and verified by hand, and the whole training pipeline lives in this repo: a hand-written byte-level BPE tokenizer, pretraining on a books + web corpus, and supervised fine-tuning into a chat model (RLHF/DPO planned). It runs on CPU (libm + Open MP) for a small showcase model, and a full from-scratch CUDA engine — cu BLAS matmuls, a hand-written FlashAttention, validated against a CPU reference by a full-model gradient check — trains a ~116M-parameter model on a single RTX 4070.
Status & honesty. This is a research/educational artifact, built in public. At ~116M parameters trained on a single consumer GPU, it is a text generator in the spirit of GPT-2-small: fluent-ish English, no real world knowledge. It is not a capable assistant — the chat model demonstrates that the pretrain→SFT pipeline works end to end, it is not a useful chatbot. The point of the project is the from-scratch engineering and the complete, understandable training pipeline.
make check # verify the backward pass (gradient check, double precision) make # build the training binary ./nanoeuler train # train the small showcase model (~0.76M params) ./nanoeuler train big # train the larger model (~10M params; meant for a GPU) ./nanoeuler chat # REPL: type a prompt, the model continues it Why "Euler"? A residual block computes