Chain of Thought

Information Optimal Chain of Thought

Most CoT advice quietly assumes the bottleneck is thinking: if the model fails, it must not have reasoned enough, so we give it more tokens and hope. But a lot of the failures don't look like "couldn't solve it." They look like something else entirely.

The Canonical Example

The Strawberry Problem

A tiny example that captures both failure effects so cleanly it's hard to unsee.

How many "r"s are in "strawberry"?

strawberry
Short chain Long chain
Let me count: s-t-r-a-w-b-e-r-r-y. That's 3.
Let me check each letter. s(0), t(0), r(1), a(0), w(0), b(0), e(0), r(2), r(3), y(0). Running total: 3.
Wait, let me recount to be sure. Position 3: r(1). Position 8: r(2). Position 9: r(3). Double-checking... 1+1+1 = 3. Or was it 2? Let me see: 2, no 3.
Systematic scan: s-no, t-no, r-YES(1), a-no, w-no, b-no, e-no, r-YES(2), r-YES(3), y-no. Sum so far: 3. Verify: first r at position 3, count=1. Second r at position 8, count=2. Third r at position 9, count=3. Total mentions of counts: 1,2,3,1,2,3. Final tally shows 2... or 3?

Competing sufficient statistics at final token

WORD 3

The actual count of "r" in strawberry. This is the target statistic we want the model to select.

TOTAL 3

Count of all numbers mentioned across the chain. Grows as reasoning adds "1, 2, 3" repeatedly.

SUFFIX 3

What the most recent tokens emphasize. Attention recency bias makes this a strong competitor.

Model outputs 3

The task is trivial. The failure isn't.

As chains get longer, TOTAL and SUFFIX become stronger competitors. The model can locally track correct steps, yet the final answer drifts because it selects the wrong statistic at readout.

The Real Problem

Two failure modes that longer chains won't fix

The model does generate the right intermediate state... and then the final token picks a nearby, biased competitor anyway.

The same prompt, with the same information, flips its conclusion just because the order or position changed.

These aren't reasoning failures. They're selection failures at the final token.

The Common Assumption

"The model failed, so it needs more reasoning tokens. Let's make the chain longer."

The Actual Fix

Make the final selection robust. The model often knows the answer; it just picks the wrong statistic at readout.

The Core Mechanism

Three statistics compete at readout

The key isn't "right vs wrong." It's which quantity wins the final token selection.

Target

WORD

Count of "r" in strawberry

The intended answer
Competitor

TOTAL

Count accumulated across the entire chain

An artifact the chain creates
Competitor

SUFFIX

A recency-window count

What the last few steps emphasize
Chain Length WORD TOTAL SUFFIX Competition Zone

As the chain runs long, TOTAL and SUFFIX become stronger competitors.

That's why "think longer" can reduce reliability: you're not just adding compute, you're increasing competition at the decision point.

The fix is revealing: when you force the model to bind the output to the WORD statistic (instead of drifting to TOTAL/SUFFIX), performance becomes perfect across tested chain lengths.

"

Once you see selection failure as the dominant failure mode, the "optimal CoT length" knob stops being the star of the show.

"
Operational Moves

Five solutions that actually work

Each targets a specific failure mode. Pick the lever that matches your problem.

01

Make the right answer easy to select: checkpointing

When the last step is where things go wrong, longer free-form reasoning adds more competitors.

WITHOUT CHECKPOINT 3 5 2 ? WITH CHECKPOINT c1 c2 c3 3
Operational move
  • Maintain a tiny STATE: record (bindings / totals / chosen option)
  • Rewrite it once near the end (or periodically)
  • Final answer must be a direct copy of FINAL_STATE.answer

This is exactly the kind of intervention that recovers binding failures in the procedural setting.

02

Don't "debug CoT": diagnose the output failure type

There are two qualitatively different output failures. They need different fixes.

Doesn't commit to an answer

Won't pick a format or candidate set

Fix: Tighten the schema (force a choice, constrained decoding, "must be one of...")

Commits, but picks wrong

Makes a choice, but it's the wrong candidate

Fix: Shorten evidence path (checkpoint), reduce clutter, make final selection trivial

You can separate these with simple next-token logprob margins (gate vs value margin).

03

When ordering/position drives variance: spend compute on width

If answers aren't stable under harmless reorderings, deeper single-run thinking is the wrong spend.

DEPTH 2 one shot vs WIDTH 3 3 2 3 majority vote A B C fragile robust
Operational move
  • Generate K variants (permute exchangeable blocks / jitter position)
  • Aggregate (vote or mean candidate logprob)

This targets the "single realization" instability directly.

04

Treat "optimal length" as a starting guess, then pick the smallest stable k

The scaling result is useful as a prior, but you don't want "largest budget you can afford."

k ~ n · ln(1/ε) Theoretical scaling
What you actually want
  • A tiny sweep (k=0, k≈k/2, k≈k)
  • Choose the smallest k where the answer stops drifting
  • If drift persists, switch levers: checkpoint or widen, not just "more k"
05

If you care about faithful traces: scrub-audit them

A nice-looking chain isn't evidence. The prose can sound justified without being supported.

Operational move
  • Remove cited evidence (keep structure)
  • Re-score whether each step is still supported
  • Flag steps that remain confident without evidence

This catches "sounds justified" without relying on trusting the prose.

Putting It Together

The combined planner you actually run

1

Set an initial depth from the scaling prior

2

Check stability + margins (gate vs value)

3

Pick the lever that matches the failure:

"Won't answer" Schema
"Picks wrong value" Checkpoint
"Order sensitive" Width
4

Optionally audit faithfulness via scrub checks

Depth helps when you truly need more computation.

But the fastest wins come from making the final selection robust, and from not betting everything on one ordering.

Ready to build more reliable reasoning chains?

Stop assuming longer is better. Start diagnosing the actual failure mode.