Why I Keep Thinking About Grokking

2026.04.20

A few years ago a paper dropped with a result so strange it looked like a bug. You train a small transformer on modular arithmetic — say a + b mod 97. Within a few hundred steps, training loss crashes to zero. The model has memorized. Test loss, meanwhile, sits flat at chance. Any reasonable person stops here, calls it overfit, and goes home.

Except they didn’t. They kept training. And after something like a hundred thousand more steps of no visible progress, test loss suddenly collapsed to near zero. The model generalized, long after the moment everyone would have called the experiment over. The paper is Power et al. 2022, and the authors called the phenomenon grokking.

The reason it sticks with me is not the specific result. It’s that every metric we had pointed in the wrong direction. If you’d used early stopping — the textbook move — you’d have stopped with a memorizing network and never discovered that the generalizing one was a hundred thousand steps away along the same loss curve, in the same training run, on the same weights.

Later, Neel Nanda and collaborators cracked open one of these networks and reverse-engineered what was actually happening during the long flat region. For modular addition, the network was quietly building a representation that solves the whole task with discrete Fourier transforms — rotating vectors on a unit circle indexed by residues, and adding angles. The memorizing solution and the Fourier solution coexistin the weights for most of training. Weight decay slowly sands down the memorizing circuit, and when it’s gone, the Fourier circuit is already there waiting. The phase transition you see on the test loss is the moment the second circuit becomes load-bearing.

I keep coming back to a few things about this.

The metric lied.Test loss was flat. Train loss was zero. From the outside nothing was happening. Inside, a whole new algorithm was being constructed. A lot of learning — in models, in people — looks like this: long stretches where the instruments read flat, and then a phase transition. If your only instrument is a loss curve you will misread every one of those stretches.

Memorization isn’t the opposite of understanding. It’s a substrate. The grokking network had to get the training set right first, by brute force; only then was there enough capacity and gradient signal to build a compressed solution on top. The two modes aren’t rivals — one is scaffolding for the other.

This shape shows up in finance, too.I worked on ML price prediction last summer. You can fit a model beautifully on historical data, ship it, watch it outperform for months, then get run over by a regime that wasn’t in the training set. From the outside it looks like the model “broke.” From the inside, it never generalized — it had memorized a regime, and you just hadn’t caught it yet. Whatever the domain, the question is the same: has the model built something that compresses the world, or just photographed it?

I don’t think grokking is the final picture of how learning works. But the shape of the result — that the interesting part of training can be invisible to any metric you’d naturally log — has stayed with me longer than most papers I’ve read. It’s the cleanest empirical argument I know for why mechanistic interpretability is worth doing. You can’t align what you can’t see, and you can’t see anything if your only window is the loss.