HRM Explained: A 27M Parameter Model That Reasons Without Chain-of-Thought

The Big Idea
The Two-Level Architecture
Adaptive Computation Time (ACT): The Outer Loop
z_H and z_L: The Model’s Working Memory
Puzzle Embeddings: Per-Puzzle Identity
The Q-Learning Halting Mechanism
Training: Two Optimizers, One Loss
- The gradient efficiency trick
Limitations: No Branching, No Backtracking
The Follow-Up Critiques
- ARC Prize Team Analysis
- Ge, Liao & Poggio Analysis (arXiv 2510.00355)
What’s Genuinely Interesting

What if you could build a model that solves complex Sudoku puzzles, navigates mazes, and tackles abstract reasoning — all with just 27 million parameters and 1,000 training examples? No pre-training on massive datasets, no Chain-of-Thought prompting, no language at all. That’s the claim behind the Hierarchical Reasoning Model (HRM) from Sapient Intelligence.

In this post, I’ll walk through how HRM actually works by tracing the code and architecture step by step. I’ll also cover the important follow-up critiques that question some of these claims.

The Big Idea

Current LLMs reason by writing out their thinking step by step (Chain-of-Thought). This works, but it’s slow, requires huge models, and needs lots of training data. HRM takes a completely different approach: it reasons in latent space — inside the model’s hidden states — through iterative refinement.

The core insight is borrowed from neuroscience: the human brain processes information hierarchically, with slow abstract planning and fast detailed computation happening at different timescales. HRM mimics this with two transformer modules that talk to each other.

The Two-Level Architecture

HRM has two recurrent transformer modules:

H-level (High-level planner) — 4 transformer layers, responsible for slow, abstract reasoning. Think of it as the part that asks: “What strategy should I use?”

L-level (Low-level executor) — 4 transformer layers, responsible for fast, detailed computation. This handles: “What goes in this specific cell?”

They interact in a nested loop:

For each H-cycle (2x):
    For each L-cycle (2x):
        z_L = L_level(z_L, z_H + input_embeddings)
    z_H = H_level(z_H, z_L)

The L-level refines its understanding using the H-level’s guidance plus the raw input. Then the H-level updates its plan based on what L found. Both use non-causal attention — every position can see every other position simultaneously.

One important detail: both modules are ReasoningModule wrappers that add the injection to the hidden state before running through their transformer layers:

def forward(self, hidden_states, input_injection, **kwargs):
    hidden_states = hidden_states + input_injection   # inject
    for layer in self.layers:
        hidden_states = layer(hidden_states=hidden_states, **kwargs)
    return hidden_states

So L doesn’t replace its state — it adds z_H + input to its existing state, then processes. Same for H adding z_L.

Adaptive Computation Time (ACT): The Outer Loop

The H/L cycles above describe what happens within a single step. But HRM can take multiple steps, deciding dynamically how long to think. This is the Adaptive Computation Time (ACT) wrapper.

Each call to model.forward(carry, batch) is one ACT step. The training/evaluation loop calls it repeatedly:

# Evaluation loop
while True:
    carry, _, metrics, preds, all_finish = model(carry, batch)
    if all_finish:
        break

The model can take up to 16 ACT steps (configurable). At each step, it decides: halt or continue?

Here’s how the two levels of looping connect:

ACT Step 1  ──→  H/L cycles (2x2) inside  ──→  logits + Q-values
                                                   │
                                            Q says "continue"
                                                   ↓
ACT Step 2  ──→  H/L cycles (2x2) inside  ──→  logits + Q-values
                 (carry from step 1                │
                  flows in)                  Q says "continue"
                                                   ↓
ACT Step 3  ──→  H/L cycles (2x2) inside  ──→  logits + Q-values
                                                   │
                                            Q says "HALT"
                                                   ↓
                                            Final answer used

With 16 ACT steps, each containing 2 H-cycles x 2 L-cycles, the model can perform up to 64 L-passes + 32 H-passes — massive computational depth from a tiny model, because the same weights are reused every time.

z_H and z_L: The Model’s Working Memory

So what exactly are z_H and z_L? They’re hidden state tensors — the model’s evolving “thoughts” at each level.

Let’s make this concrete with a Sudoku example. A 9x9 puzzle gets flattened into 81 integers:

inputs = [5, 3, 0, 0, 7, 0, 0, 0, 0, 6, 0, 0, ...]
          cell1  cell2  cell3  ...              cell81

Each integer gets embedded into a 512-dimensional vector. Then a puzzle embedding (more on this later) is prepended as position 0. So the final sequence has 82 positions:

position 0:  puzzle embedding    ← 512-dim vector
position 1:  cell 1 embedding   ← 512-dim vector
position 2:  cell 2 embedding   ← 512-dim vector
...
position 81: cell 81 embedding  ← 512-dim vector

Both z_H and z_L have this same shape: (batch_size, 82, 512). Each position holds a 512-dimensional vector representing the model’s current “thoughts” about that cell.

When a sequence starts fresh, both are initialized to learned vectors — H_init and L_init — broadcast across all positions. The model starts with the same state everywhere and must differentiate through the input injection and attention.

After each ACT step, both are detached (gradients cut) and stored in a carry dataclass. The next step picks up where the last left off — but no gradients flow backward between steps. This is what makes the whole thing memory-feasible.

Position 0 is special. Since it holds the puzzle embedding (not a cell value), it acts as a global summary token. Through non-causal attention, it sees all 81 cells. The Q-head reads z_H[:, 0] specifically to make the halt/continue decision:

q_logits = self.q_head(z_H[:, 0])   # position 0 → halt decision

And the final answer is read from the remaining positions:

output = self.lm_head(z_H)[:, puzzle_emb_len:]   # positions 1-81 → predictions

Puzzle Embeddings: Per-Puzzle Identity

Not all puzzle types need this, and the difference is revealing.

Sudoku: every puzzle follows the same rule (fill digits 1-9, no repeats in row/column/box). So puzzle_identifiers = 0 for every example. One universal algorithm.

ARC: every puzzle has a different rule. Puzzle 42 might be “rotate the shape 90°”, puzzle 137 might be “fill enclosed regions with blue”. The model needs to know which puzzle it’s solving.

For ARC, the dataset builder assigns each puzzle a unique integer ID (1 through ~960). The model has a learnable embedding table:

puzzle_emb: shape (961, 512)

Row 0:   [0, 0, ..., 0]            ← blank (unused)
Row 1:   [0.12, -0.34, ..., 0.56]  ← learned embedding for puzzle 1
Row 2:   [-0.78, 0.91, ..., 0.23]  ← learned embedding for puzzle 2
...

Each embedding starts at zero and is trained via SignSGD — a simple optimizer that only uses the sign of the gradient:

w = w * (1 - lr * weight_decay) - lr * sign(gradient)

Every weight goes up by lr or down by lr, regardless of gradient magnitude. Why not Adam? Because puzzle embeddings are extremely sparse — with ~960 puzzles and a batch of 768, most rows get no gradient on any given step. Adam would approximate SignSGD anyway for such sparse updates, but SignSGD is simpler and needs zero optimizer state (no momentum, no second moment to track).

The puzzle embedding is trained with a separate optimizer at 100x the learning rate of the main model (0.01 vs 0.0001) and 10x the weight decay (1.0 vs 0.1). It updates rarely, so it needs to move fast when it does.

The Q-Learning Halting Mechanism

How does the model decide when to stop thinking? Through two Q-values produced by a tiny linear head:

self.q_head = CastedLinear(hidden_size, 2, bias=True)   # 512 → 2 numbers

It reads z_H[:, 0] (the summary token) and outputs:

q_halt: “how confident am I that my current answer is correct?”
q_continue: “how confident am I that continuing will lead to a correct answer?”

If q_halt > q_continue, the model halts.

Training q_halt: supervised from ground truth

seq_is_correct = (number_of_correct_cells == total_cells)   # True or False
q_halt_loss = binary_cross_entropy(q_halt_logits, seq_is_correct)

Simple. Did you get every cell right? Push q_halt toward 1. Wrong? Push toward 0.

Training q_continue: bootstrapping from the future

This is the trickier part. There’s no ground truth for “will continuing help?” So the model peeks ahead — it runs one more forward pass from the current carry state:

next_q_halt, next_q_continue = self.inner(new_inner_carry, new_current_data)[-1]

The target for q_continue at step t is: the best outcome achievable from step t+1 onward.

target = sigmoid(
    where(is_last_step,
        next_q_halt,                            # forced to halt next step
        max(next_q_halt, next_q_continue)        # best option at next step
    )
)

This is the Bellman equation from reinforcement learning. If at the next step, halting gives 82% confidence and continuing gives 69%, then the value of continuing now is 82% (you’d halt next step). The target follows whichever future path leads to the best outcome.

The bootstrapping cold start

At the beginning of training, both Q-values are meaningless. The Q-head is initialized with bias = -5, so sigmoid(-5) ≈ 0.007 — the model believes there’s a 0.7% chance of being correct for everything. Since q_halt ≈ q_continue, nobody halts early; everything runs to the maximum 16 steps.

The chain reaction goes:

lm_loss gradually teaches the model to produce correct answers
q_halt starts learning which answers are correct (grounded in truth)
Once q_halt is meaningful at step 16, q_continue at step 15 gets a real target
That propagates backward: step 14, 13, 12…
Eventually the model learns to halt early for easy puzzles, run longer for hard ones

Exploration

Without exploration, the Q-head can get stuck — if it always halts at step 3, it never discovers that step 8 would give the right answer. So 10% of the time, each batch item gets a random minimum number of steps it must run before halting is allowed:

min_halt_steps = (rand() < 0.1) * randint(2, max_steps + 1)
halted = halted & (steps >= min_halt_steps)

This ensures the model occasionally sees deeper computation and can update its estimates.

Training: Two Optimizers, One Loss

Each training step:

Forward pass — puzzle embeddings copied to local buffer, flow through L/H cycles, produce logits + Q-values
Single backward pass — gradients flow through everything
Two optimizers step:
- SignSGD for puzzle embeddings (lr=0.01, weight_decay=1.0)
- Adam for all transformer weights (lr=0.0001, weight_decay=0.1)

The total loss combines three terms:

total_loss = lm_loss + 0.5 * (q_halt_loss + q_continue_loss)

All three losses backpropagate through the entire model. The Q-losses aren’t just training the Q-head — they shape the representations in z_H and z_L throughout, forcing the model to develop internal representations of “how solved is this puzzle.”

The gradient efficiency trick

Within each ACT step, only the final H/L cycle computes gradients. All earlier cycles run in torch.no_grad():

with torch.no_grad():
    # Run H_cycles * L_cycles - 1 warmup iterations
    for H_step in range(H_cycles):
        for L_step in range(L_cycles):
            if not (last H and last L):
                z_L = L_level(z_L, z_H + input)
        if not last H:
            z_H = H_level(z_H, z_L)

# Only this final step has gradients:
z_L = L_level(z_L, z_H + input)
z_H = H_level(z_H, z_L)

The hidden states carry forward information from the no-grad iterations, but only the final refinement contributes to the loss. This dramatically reduces memory usage.

Limitations: No Branching, No Backtracking

HRM’s computation is a single linear path:

carry → step 1 → step 2 → step 3 → ... → answer

As humans, when we solve puzzles, we do something different:

“What if this cell is 5?” → follow implications → contradiction → backtrack
“OK, what if it’s 7?” → follow implications → works → keep going

That’s tree search — branching, evaluating, backtracking. HRM can’t do this. If step 2 goes down a wrong path, step 3 builds on that wrong foundation.

The non-causal attention can partially compensate by processing all positions simultaneously (like parallel constraint propagation rather than sequential hypothesis testing). But for tasks that fundamentally require exploring multiple hypotheses — like playing Go, where you need to simulate opponent responses many moves ahead — HRM’s single-path architecture won’t work.

Task type	What’s needed	HRM works?
Sudoku	Constraint propagation	Yes
Maze	Path finding	Yes
ARC	Pattern recognition + rule inference	Partially
Go / Chess	Multi-step adversarial tree search	No
Theorem proving	Hypothesis testing + backtracking	No

The Follow-Up Critiques

Two important independent analyses appeared after HRM’s release, and they paint a different picture than the original paper.

ARC Prize Team Analysis

The ARC Prize team verified HRM’s results and ran ablation studies. Their key findings:

The hierarchy barely matters. A regular transformer with the same parameter count came within ~5 percentage points of HRM without any hyperparameter tuning. The H/L architectural split isn’t the secret sauce.

The refinement loop is the real driver. Performance jumped +13 percentage points from zero to one refinement iteration. This is the ACT outer loop — but any recurrent architecture could benefit from iterative refinement.

Puzzle embeddings limit generalization. Since each puzzle gets a learned embedding by ID, the model can only work on puzzles it has seen during training. This makes HRM closer to “test-time training” (memorizing each puzzle’s pattern) than genuine reasoning that generalizes to novel puzzles.

Ge, Liao & Poggio Analysis (arXiv 2510.00355)

Researchers from MIT published “Hierarchical Reasoning Models: Perspectives and Misconceptions” with further findings:

A flat model works equally well. An 8-layer L-only model (no H module at all) achieved similar performance and trained faster (1h 48m vs 4h 21m).

The one-step gradient trick isn’t novel. The no-grad warmup + 1-step gradient pattern is mathematically equivalent to how diffusion models and Latent Consistency Models train. It’s a known technique.

ACT doesn’t help at inference. Running for the maximum number of steps always gives the best results. The learned halting policy is never actually useful — the code itself always runs to halt_max_steps during evaluation.

Is it even recurrent? Since only the last cycle has gradients and the carry is detached between ACT steps, the paper questions whether HRM is truly recurrent or just a very deep feedforward model.

What’s Genuinely Interesting

Despite the critiques, HRM points toward ideas worth taking seriously:

Latent-space reasoning works. Instead of generating tokens to “think” (Chain-of-Thought), you can reason inside hidden states. This is fundamentally faster — no autoregressive token generation — and the ARC results show it’s viable even at 27M parameters.

Iterative refinement is powerful. Running the same model multiple times with carried state is a simple idea with outsized impact. The +13pp jump from zero to one refinement iteration shows this clearly.

Small models can do complex reasoning. With the right architecture and training setup, you don’t need billions of parameters for tasks like Sudoku and maze solving. The computational depth comes from recurrence, not model size.

The specific hierarchical architecture may not be essential, and the puzzle embeddings are a significant limitation. But the broader research direction — compact models that reason through iterative latent computation — is one worth watching.

Yi's Blog

Make, Observe, and Analyze