Building a Language Transformer Step by Step

Preparation: Standing on the Shoulders of Giants
The Experiment
Results - Architecture Comparison (5,000 steps)
Results - Scaling Up
Generated Names
Key Findings
Architecture Summary
What the Loss Means
Conclusion

After months of reading about transformers and LLMs, I finally decided to build one from scratch. Not by copy-pasting code, but by incrementally adding each architectural component and measuring its impact. The result was a character-level name generator trained on 32,033 names, and the journey taught me more than any paper or tutorial could.

Preparation: Standing on the Shoulders of Giants

Before diving into code, I spent time building intuition through two excellent resources:

“Build a Large Language Model (From Scratch)” by Sebastian Raschka was my theoretical foundation. The book walks through every component of a transformer with clear explanations and diagrams. Reading it gave me a mental model of how attention, embeddings, and layer normalization fit together — knowledge that proved essential when debugging my own implementation.

Andrej Karpathy’s YouTube series (Neural Networks: Zero to Hero) was equally valuable. His “Let’s build GPT” video demystified the architecture by building it live on screen. Watching someone think through the design decisions — why we use residual connections, how attention matrices work, what LayerNorm actually does — made the concepts stick in a way that reading alone couldn’t. His makemore repository became the dataset and benchmark for my experiments.

With this foundation, I was ready to build.

The Experiment

I incrementally built a character-level transformer for name generation. Each step adds one architectural improvement. All models were trained with batch size 32, AdamW optimizer, and per-name padding with masked loss.

Results - Architecture Comparison (5,000 steps)

Config	N_EMBD	Heads	Layers	Params	Train	Test
baseline	32	1	1	2,908	2.35	2.35
double embd	64	1	1	8,860	2.34	2.34
2 heads	32	2	1	5,948	2.25	2.23
4 layers	32	2	4	18,332	2.00	2.04
+ MLP	32	2	4	51,740	1.97	2.02
+ LayerNorm	32	2	4	52,252	1.96	1.99
+ RoPE	32	2	4	52,252	1.94	1.98
+ GELU	32	2	4	52,252	1.94	1.94

Results - Scaling Up

Config	Steps	Train	Test	Notes
N_EMBD=32, 2 heads	5,000	1.94	1.94	Baseline final model
N_EMBD=64, 4 heads	5,000	1.84	1.92	Matches makemore architecture
N_EMBD=64, 4 heads + dropout	5,000	1.95	2.00	Dropout slows convergence
N_EMBD=64, 4 heads + dropout	20,000	1.75	1.85	Longer training helps
+ LR schedule, weight decay, grad clip	20,000	1.72	1.86	Training improvements

Makemore’s default transformer achieves ~1.92 test loss with N_EMBD=64, 4 heads, 4 layers.

Generated Names

Sample outputs from the final model (N_EMBD=64, 4 heads, 20k steps with all training improvements):

kaelynn, aileigh, elyce, yadi, ovani, derella, nyailee, ranyah, niaa, sett

Key Findings

Depth beats width

Doubling embedding size from 32 to 64 (3x params) gave almost no improvement (2.35 -> 2.34). Adding a second attention head with fewer total params (5,948 vs 8,860) dropped loss by 0.12. Stacking 4 layers was the single biggest improvement, dropping test loss from 2.23 to 2.04. The model benefits far more from multiple layers of processing than from wider representations at a single layer.

Data handling matters most

Before adding per-name padding, our best model achieved 2.36 test loss. After switching to per-name padding with masked loss (same architecture), it dropped to 1.94. This was a larger improvement than all architectural changes combined. The reason: without padding, the model wasted capacity trying to predict across name boundaries — an impossible task that added noise to every gradient update.

MLP adds capacity but needs regularization

Adding the feed-forward network (MLP) to each layer tripled the parameter count (18k -> 52k) but only modestly improved results. It also widened the train-test gap (2.00/2.04 -> 1.97/2.02), suggesting mild overfitting. The MLP lets the model transform representations nonlinearly after attention gathers information, but at this small scale the effect is limited.

LayerNorm and RoPE help incrementally

LayerNorm stabilized training and closed the train-test gap slightly. RoPE (Rotary Position Embeddings) gave the model awareness of character positions without adding any parameters. Neither was dramatic at this scale, but both are essential for larger models — LayerNorm enables training deep networks, and RoPE enables generalization to longer sequences.

GELU vs ReLU is negligible at small scale

Switching from ReLU to GELU activation in the MLP had no measurable effect. The smoother gradient flow matters more when networks are deeper and wider.

Scaling up helps significantly

Doubling N_EMBD to 64 and using 4 heads (matching makemore’s architecture) dropped test loss from 1.94 to 1.92 at 5k steps. With longer training (20k steps), the model reached 1.85 test loss — surpassing makemore’s default.

Dropout trades speed for generalization

Adding 20% dropout increased the train-test gap initially and slowed convergence. At 5k steps, it actually hurt test loss (1.92 -> 2.00). But it prevents overfitting during longer training runs, allowing the model to keep improving past where it would otherwise plateau.

Training improvements compound

Learning rate scheduling (warmup + cosine decay), weight decay (0.01), and gradient clipping (max_norm=1.0) together produced smoother training curves. The cosine decay prevents the learning rate from being too high in later steps when fine-tuning. Weight decay acts as regularization. Gradient clipping prevents instability from occasional large gradients.

Architecture Summary

The final model is a proper transformer decoder:

Input tokens
    -> Token Embedding (28 vocab -> 64 dim)
    -> 4x Transformer Blocks:
        -> LayerNorm -> Multi-Head Attention (4 heads, RoPE, dropout) -> Residual
        -> LayerNorm -> MLP (64 -> 256 -> 64, GELU, dropout) -> Residual
    -> Linear (64 -> 28 vocab)
    -> Cross-entropy loss (masked on PAD tokens)

Training config:

20,000 steps
Batch size 32
AdamW optimizer with weight decay 0.01
Learning rate: warmup to 1e-3 over 200 steps, cosine decay to 1e-4
Gradient clipping: max_norm=1.0
Dropout: 0.2

What the Loss Means

Cross Entropy Loss

A loss of 1.86 means the model assigns ~15.6% probability on average to the correct next character (e^(-1.86)). Random guessing over 27 characters would give ~3.7% (loss = 3.30). Perfect prediction is impossible because many positions are genuinely ambiguous — after “ma”, the next character could be r, d, k, x, t, and many others.

Progress through this project:

Start: 2.35 test loss (~9.5% confidence)
Final: 1.86 test loss (~15.6% confidence)
Improvement: ~1.6x more confident on the correct character

Conclusion

Building a transformer incrementally taught me that the magic isn’t in any single component — it’s in how they work together. Data preprocessing had the biggest impact. Depth mattered more than width. And the “modern” improvements (LayerNorm, RoPE, GELU) are less about dramatic gains and more about enabling scale.

Yi's Blog

Make, Observe, and Analyze