LLM from scratch: the paper trail

#ai #llm-from-scratch

Starting point

I want to build an LLM from scratch in Rust, following Sebastian Raschka's Build a Large Language Model (From Scratch) but reimplementing everything in Rust with Candle. This exercise is about learning how an LLM works. I also wanted to map the intellectual lineage: where did each piece of the modern transformer come from, and what's the thread that connects them?

The deep timeline

Most explanations start with "Attention Is All You Need" (2017) as if the transformer appeared from nowhere. It didn't. Nearly every component has a separate origin story, and the progression tells you something about why each piece exists.

Information theory and statistical language models (1948--1950)

Shannon's "A Mathematical Theory of Communication" (1948) introduced the idea of modelling language as a stochastic process. His follow-up, "Prediction and Entropy of Printed English" (1950), estimated the entropy of English text and built simple n-gram models to do it. The bigram model (predict the next word from the current word alone) is Shannon's construction. Every LLM is still doing what Shannon described: estimating the probability distribution over next tokens given context.

Shannon, C.E. "A Mathematical Theory of Communication." Bell System Technical Journal, 1948.
Shannon, C.E. "Prediction and Entropy of Printed English." Bell System Technical Journal, 1950.

Backpropagation (1986)

The algorithm that makes training neural networks practical. Rumelhart, Hinton, and Williams didn't invent the chain rule, but they showed how to apply it efficiently to multi-layer networks. Without this, everything after 1986 in this timeline doesn't happen.

Rumelhart, D.E., Hinton, G.E., Williams, R.J. "Learning representations by back-propagating errors." Nature, 1986.

BPE as a compression algorithm (1994)

Philip Gage published Byte Pair Encoding as a data compression technique in C Users Journal. The idea: start with individual bytes, iteratively merge the most frequent adjacent pair into a new symbol, repeat. It sat in the compression literature for over twenty years before anyone thought to use it for tokenisation.

Gage, P. "A New Algorithm for Data Compression." C Users Journal, 1994.

LSTMs and the vanishing gradient problem (1997)

Hochreiter and Schmidhuber introduced Long Short-Term Memory networks to solve the vanishing gradient problem in recurrent neural networks. RNNs in theory can handle sequences of any length, but in practice gradients either vanish or explode over long sequences. LSTMs added gating mechanisms (forget gate, input gate, output gate) that let information persist across many timesteps. This was the dominant approach to sequence modelling for nearly two decades, and understanding why it was eventually superseded by attention is part of the story.

Hochreiter, S., Schmidhuber, J. "Long Short-Term Memory." Neural Computation, 1997.

Word2vec and the embedding insight (2013)

Mikolov et al. showed that you could train a simple neural network to predict words from context (or context from words) and the learned weight matrix would capture semantic relationships as vector arithmetic. "king - man + woman = queen" became the famous example. The key insight for the LLM story: words can be represented as dense vectors in a continuous space, and proximity in that space reflects meaning. This is the conceptual ancestor of every embedding layer in every transformer.

Two architectures: Skip-gram (predict context from word) and CBOW (predict word from context).

Mikolov, T. et al. "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781, 2013.
Mikolov, T. et al. "Distributed Representations of Words and Phrases and their Compositionality." NeurIPS, 2013.

GloVe (2014)

Pennington et al. at Stanford took a different approach: instead of predicting context words, train on the global co-occurrence matrix directly. The result was similar quality embeddings with a cleaner theoretical foundation (the loss function directly optimises for the log of co-occurrence probabilities). GloVe embeddings were the standard pre-trained word vectors before contextual embeddings (ELMo, BERT) took over.

Pennington, J., Socher, R., Manning, C.D. "GloVe: Global Vectors for Word Representation." EMNLP, 2014.

Attention for sequence-to-sequence (2014)

Bahdanau et al. introduced attention as a mechanism for neural machine translation. The problem: encoder-decoder models compressed an entire input sentence into a single fixed-length vector, which was a bottleneck for long sentences. Their solution: let the decoder "attend" to different parts of the encoder's output at each step, weighting by relevance. This is the direct ancestor of transformer attention, though the mechanism was additive (a small neural network computing alignment scores) rather than the dot-product attention that came later.

Bahdanau, D., Cho, K., Bengio, Y. "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR, 2015 (arXiv 2014).

Luong et al. (2015) simplified this to dot-product attention, which is closer to what the transformer uses.

Luong, M.-T., Pham, H., Manning, C.D. "Effective Approaches to Attention-based Neural Machine Translation." EMNLP, 2015.

Residual connections (2015)

He et al. introduced skip connections (residual connections) in ResNet for image classification. The idea: instead of learning the full transformation, learn the residual (the difference from identity). This solved the degradation problem in very deep networks and enabled training networks with 100+ layers. The transformer uses residual connections around every sub-layer (attention and feed-forward). Without them, gradients vanish in deep transformers just as they did in deep CNNs.

He, K. et al. "Deep Residual Learning for Image Recognition." CVPR, 2016 (arXiv 2015).

Layer normalisation (2016)

Ba, Kiros, and Hinton proposed normalising across features within a single training example, rather than across the batch (as in batch normalisation). For sequence models this matters: batch normalisation computes statistics across different training examples, which breaks for variable-length sequences and single-example inference. Layer norm computes statistics within each token independently, which works regardless of batch size.

Ba, J.L., Kiros, J.R., Hinton, G.E. "Layer Normalization." arXiv:1607.06450, 2016.

BPE for NLP (2016)

Sennrich, Haddow, and Birch adapted Gage's compression algorithm for subword tokenisation in neural machine translation. Start with characters, merge the most frequent pair, repeat until vocabulary reaches a target size. This was the breakthrough that solved the open-vocabulary problem: you don't need a fixed word list, rare words decompose into known subword units, and the vocabulary stays manageable. GPT-2 uses a byte-level variant of this. It's the tokenisation scheme I'll implement from scratch.

Sennrich, R., Haddow, B., Birch, A. "Neural Machine Translation of Rare Words with Subword Units." ACL, 2016.

GELU activation (2016)

Hendrycks and Gimpel proposed the Gaussian Error Linear Unit as a smoother alternative to ReLU. The key property: unlike ReLU, GELU doesn't have a hard zero threshold, so neurons don't "die" permanently. It became the default activation in transformers (GPT-2, BERT, and most successors use it). The practical difference from ReLU is modest, but it's measurable in training stability.

Hendrycks, D., Gimpel, K. "Gaussian Error Linear Units (GELUs)." arXiv:1606.08415, 2016.

The Transformer (2017)

Vaswani et al., "Attention Is All You Need." The paper that synthesised everything above into a single architecture: multi-head scaled dot-product attention, positional encoding (sinusoidal), residual connections, layer normalisation, position-wise feed-forward networks. It dispenses with recurrence and convolution entirely. The title is accurate: attention is the only novel mechanism, but the architectural choices around it are what make it work.

Key details that matter for implementation:

The scaling factor 1/sqrt(d_k) prevents dot products from growing large in high dimensions, which would push softmax into near-one-hot regions where gradients vanish. This was a practical fix, not a theoretical necessity.
Multi-head attention: 8 heads of 64 dimensions each rather than one head of 512. Different heads learn different relationship types. This is empirically true and has been confirmed by subsequent interpretability research.
The original paper uses Post-Norm (layer norm after the residual add). This matters later.
Vaswani, A. et al. "Attention Is All You Need." NeurIPS, 2017.

GPT-1 (2018)

Radford et al. at OpenAI showed that a transformer decoder trained with language modelling (predict the next token) on a large corpus could then be fine-tuned on downstream tasks with minimal task-specific architecture. The "pre-train then fine-tune" paradigm. 12 layers, 768 dimensions, 117M parameters. Trained on BookCorpus.

Notable shift from the original transformer: GPT uses only the decoder half (causal/autoregressive attention), not the encoder-decoder structure. This is because generation is inherently left-to-right.

Radford, A. et al. "Improving Language Understanding by Generative Pre-Training." OpenAI, 2018.

BERT (2018)

Devlin et al. went the other direction: use the transformer encoder with bidirectional attention. Masked language modelling (predict masked tokens from both left and right context) instead of autoregressive prediction. BERT dominated NLP benchmarks for years and is still widely used for classification and retrieval tasks. Important to the LLM story as a contrast: the causal mask in GPT-style models is what makes generation possible, while BERT's bidirectional attention is why it can't generate text token-by-token.

Devlin, J. et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL, 2019 ( arXiv 2018).

Unigram LM tokenisation (2018)

Kudo proposed an alternative to BPE: start with a large vocabulary and iteratively remove tokens that contribute least to the likelihood of the training corpus. Conceptually the reverse of BPE (which builds up). Used in SentencePiece and by T5, among others. Worth noting because BPE isn't the only viable approach, even if it's the dominant one.

Kudo, T. "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates." ACL, 2018.

GPT-2 (2019)

Radford et al. scaled up: 48 layers, 1600 dimensions, 1.5B parameters (the XL variant). Trained on WebText (40GB of web pages filtered by Reddit upvotes). The key claim: "Language Models are Unsupervised Multitask Learners" -- a sufficiently large language model can perform tasks zero-shot without any fine-tuning.

Architectural changes from GPT-1:

Pre-Norm (layer norm before attention and FFN, not after). This is the change that tripped me up during implementation.
Larger vocabulary (50,257 tokens via byte-level BPE).
Context window of 1024 tokens.

GPT-2 Small (124M parameters) is the model I'm reimplementing. It's the right scale: small enough to train on a single GPU, large enough to demonstrate real capabilities.

Radford, A. et al. "Language Models are Unsupervised Multitask Learners." OpenAI, 2019.

AdamW (2019)

Loshchilov and Hutter identified a subtle bug in the standard Adam optimiser: the way it combines weight decay with adaptive learning rates is mathematically wrong. Adam applies weight decay to the gradient before the adaptive scaling, which means highly-scaled parameters get less regularisation. AdamW applies weight decay directly to the weights, decoupled from the gradient. The fix is simple but the impact on training stability is real. AdamW is now the default optimiser for transformer training.

Loshchilov, I., Hutter, F. "Decoupled Weight Decay Regularization." ICLR, 2019.

GPT-3 and in-context learning (2020)

Brown et al. scaled to 175B parameters and demonstrated that large language models can learn new tasks from a few examples in the prompt (few-shot learning) without any gradient updates. The "in-context learning" phenomenon. This is the paper that shifted the field from "fine-tune for each task" to "prompt engineering."

Not directly relevant to the implementation (you can't train GPT-3 on a single GPU), but essential context for understanding why LLMs matter.

Brown, T.B. et al. "Language Models are Few-Shot Learners." NeurIPS, 2020.

Nucleus sampling (2020)

Holtzman et al. identified the "neural text degeneration" problem: greedy and beam search produce repetitive, bland text. Top-k sampling is better but the fixed k is arbitrary. Nucleus sampling (top-p) dynamically adjusts: include the smallest set of tokens whose cumulative probability exceeds p. When the model is confident, this might be 3 tokens; when uncertain, 200. Most production LLM deployments use nucleus sampling.

Holtzman, A. et al. "The Curious Case of Neural Text Degeneration." ICLR, 2020.

Pre-Norm vs Post-Norm analysis (2020)

Xiong et al. formally analysed why Pre-Norm (GPT-2 style) trains more stably than Post-Norm (original Transformer). In Post-Norm, the expected gradient magnitude varies across layers, requiring careful learning rate warmup. Pre-Norm keeps gradient magnitudes more uniform. This paper explains the practical discovery that GPT-2 made: Pre-Norm just works better, especially for deeper models. Nearly every modern LLM uses Pre-Norm.

Xiong, R. et al. "On Layer Normalization in the Transformer Architecture." ICML, 2020.

SwiGLU (2020)

Shazeer proposed replacing the standard FFN (two linear layers with GELU) with a gated linear unit variant using Swish activation. The gate multiplies a non-linear transformation of the input with a linear transformation, giving the network more control over information flow. Used in PaLM, Llama, and many recent models. I won't implement this in the series (GPT-2 uses standard FFN), but it shows where the field has gone since.

Shazeer, N. "GLU Variants Improve Transformer." arXiv:2002.05202, 2020.

InstructGPT and RLHF (2022)

Ouyang et al. described the three-stage pipeline: supervised fine-tuning on human demonstrations, reward model training on human preference data, and reinforcement learning (PPO) against the reward model. The headline result: a 1.3B parameter model fine-tuned with RLHF was preferred by humans over the 175B base GPT-3. This is the paper that proved alignment isn't just about scale.

The three-model dance (policy, reward, reference) and the KL penalty to prevent reward hacking are the conceptually tricky parts. This is what made ChatGPT possible.

Ouyang, L. et al. "Training language models to follow instructions with human feedback." NeurIPS, 2022.

DPO (2023)

Rafailov et al. showed you can skip the reward model entirely. Given pairs of preferred and rejected responses, DPO optimises the policy directly using a supervised-learning-style loss. The insight: the reward model and RL objective can be algebraically collapsed into a single loss function over preference pairs. Dramatically simpler to implement and more stable to train than PPO-based RLHF.

Rafailov, R. et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS, 2023.

Speculative decoding (2023)

Two independent papers arrived at the same idea: use a small, fast "draft" model to generate several candidate tokens, then verify them in parallel with the large model. Tokens the large model agrees with are accepted for free; rejected tokens get resampled. This doesn't change the output distribution but can give 2-3x speedups.

Leviathan, Y., Kalman, M., Matias, Y. "Fast Inference from Transformers via Speculative Decoding." ICML, 2023.
Chen, C. et al. "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv:2302.01318, 2023.

Min-p sampling (2023--2024)

A community-driven proposal for a simpler alternative to top-p. Instead of cumulating probabilities from the top, set a floor: include any token whose probability is at least p times the highest probability. Adaptive like top-p but without the sorting overhead and with more intuitive behaviour at the extremes. Adopted by llama.cpp and several inference frameworks.

Nguyen, M. et al. "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs." arXiv, 2024.

GRPO and verifiable rewards (2024--2025)

DeepSeek introduced Group Relative Policy Optimisation for training reasoning models. The key shift: for tasks with verifiable answers (maths, code), you don't need human preferences or a learned reward model. Generate a group of responses, score them with a verifiable reward (did the code pass tests?), and optimise relative to the group's performance. No critic network needed. This is what powers DeepSeek-R1.

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, 2025.

Common Threads

The recurring pattern: practical fix first, theory later. The scaling factor in attention was discovered by experiment before anyone proved why it was necessary. Pre-Norm vs Post-Norm was a "this works better" observation in GPT-2 that wasn't formally analysed until 2020. AdamW fixed a bug that had been hiding in plain sight. The field moves by building and observing, then explaining.

Two lineages converging. The attention mechanism came from machine translation (Bahdanau). Residual connections came from computer vision (He et al.). Layer norm came from recurrent networks. BPE came from data compression. The transformer is a chimera. Each component has its own history and its own set of problems it was solving. Understanding those original problems is part of understanding why the transformer works.

The post-training revolution. Looking at the timeline, the architecture stabilised around 2019 (GPT-2). Almost everything since then has been about scale, training data, and post-training alignment. The fundamental building blocks haven't changed much. SwiGLU and rotary embeddings are refinements, not reinventions. The real action has been in RLHF, DPO, GRPO, and the question of how to make a base model actually useful.

Tokenisation is still unsettled. BPE won the default position in 2016 and hasn't been displaced, but the field keeps returning to it. Unigram LM, byte-level approaches, Grapheme Pair Encoding for better multilingual handling. The weird artefacts of BPE (inconsistent number tokenisation, leading spaces as part of tokens, mid-morpheme splits) are well-known irritations that nobody has cleanly solved.

The Rust ML ecosystem

For the implementation side, the relevant projects:

Candle (Hugging Face) --- Pure Rust ML framework, the primary tool for this series. Tensor operations, automatic differentiation, GPU support via CUDA and Metal. Actively maintained.
tch-rs --- Rust bindings to PyTorch's C++ backend (libtorch). More mature than Candle but carries the weight of PyTorch's complexity. Good for production, less good for "understand every layer" learning.
Burn --- Another pure Rust framework with a focus on portability across backends. More abstracted than Candle.
dfdx --- Experimental Rust framework attempting compile-time tensor shape checking via the type system. The promise: catch shape mismatches at compile time, not runtime. The reality: Rust's type system isn't quite expressive enough yet (would need dependent types). Interesting to mention, probably not practical to use.

The book

Raschka, S. Build a Large Language Model (From Scratch). Manning, 2024. The reference implementation is Python/PyTorch. The series follows the same conceptual arc but reimplements in Rust/Candle. The value of the book is the pedagogical sequencing: it builds up from tokenisation through the full architecture to training and inference, each chapter depending on the previous.

Things to read next

Elhage, N. et al. "A Mathematical Framework for Transformer Circuits." Anthropic, 2021. --- Interpretability research showing that individual attention heads and MLP neurons encode identifiable computations. Relevant to the claim in the outline that "different heads learn different patterns."
Olsson, C. et al. "In-context Learning and Induction Heads." Anthropic, 2022. --- How specific attention head patterns (induction heads) give rise to in-context learning.
Karpathy, A. "Let's build GPT: from scratch, in code, spelled out." YouTube, 2023. --- The best existing walkthrough of building a transformer from scratch, in Python. Sets the bar for the series.
Clark, K. et al. "What Does BERT Look At? An Analysis of BERT's Attention." ACL Workshop, 2019. --- Early work showing that specific attention heads track syntactic relationships (subject-verb agreement, coreference).

← Back to research