SDS 978: A Post-Transformer Architecture Crushes Sudoku (Transformers Solve ~0%)

A game millions of people solve over morning coffee is exposing a fundamental weakness in today’s most powerful AI models. In this Five-Minute Friday, Jon Krohn breaks down Pathway’s new Sudoku Extreme benchmark, roughly 250,000 of the hardest Sudoku puzzles available and why leading LLMs like o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet scored effectively zero percent, while Pathway’s post-transformer BDH architecture achieved 97.4% accuracy at a fraction of the cost. Listen to the episode to find out what BDH is doing differently, why Sudoku performance matters far beyond puzzles, and what this means for the future of AI reasoning.

Interested in sponsoring a Super Data Science Podcast episode? Email natalie@superdatascience.com for sponsorship information.

In this Five-Minute Friday, Jon Krohn explores why Sudoku, a constraint-satisfaction problem that requires search, backtracking, and holding multiple possibilities in parallel is proving to be a revealing test for AI architectures. Pathway, the company behind the post-transformer BDH (Baby Dragon Hatchling) architecture, recently published the Sudoku Extreme benchmark: roughly 250,000 of the hardest Sudoku puzzles available. The results are striking: BDH solved them with 97.4% accuracy, while the leading transformer-based LLMs, including o3-mini, DeepSeek-R1, and Claude 3.7 Sonnet, scored effectively zero percent.

Jon explains why transformers struggle with this kind of task: they process information token by token with a latent space constrained to roughly a thousand floating-point values per token, locking in each decision as text is generated. BDH takes a fundamentally different approach. As a native reasoning model, it maintains a much larger internal “latent reasoning space” that doesn’t require verbalizing every thought as text. Key technical features include sparse positive activations (only about 5% of neurons firing at any time, far more biologically plausible than dense transformer activation), a state-based architecture similar to Mamba that maintains and updates internal state rather than relying on attention, and continual learning that allows BDH to pick up the rules of a new game in as little as 20 minutes and improve through repeated play.

Importantly, BDH achieves its 97.4% accuracy at roughly 10x lower cost than the leading LLMs achieve their near-zero scores, since it reasons internally rather than burning GPU cycles on long chains of text. Jon argues that Sudoku performance is really a proxy for constraint-satisfaction problems more broadly, the kind found in medicine, law, operations, and planning and that Pathway’s concept of “generative strategy” (creatively proposing what should be done, rather than merely remembering what has been done) represents an exciting frontier. While BDH has so far been demonstrated at about a billion-parameter scale, the zero-versus-97.4% gap represents a categorical difference that suggests alternative architectures can address real limitations of transformers.

ITEMS MENTIONED IN THIS PODCAST:

DID YOU ENJOY THE PODCAST?

Download The Transcript

Podcasts SDS 978: A Post-Transformer Architecture Crushes Sudoku (Transformers Solve ~0%)

Podcast Transcript

Share on

Related Podcasts

May 12, 2026

May 8, 2026

May 5, 2026

Podcasts SDS 978: A Post-Transformer Architecture Crushes Sudoku (Transformers Solve ~0%)

Share