Tiny LLM Benchmark: Where a Ternary Model Beats a Vanilla Transformer — and Where It Loses

Most model pages show you the benchmark they win and quietly drop the rest. This one shows both directions, because the place where Atome loses is the single most useful thing we know about the architecture. All numbers below come from the public training logs in the repository and are reproducible on a CPU.

The setup

Both models are trained on TinyStories with a byte-level tokenizer, and the vanilla baseline is a standard pre-norm FP32 decoder-only transformer brute-forced to land within a handful of parameters of the Atome target (parameter-fair) or within the same flash budget (flash-fair). This is a like-for-like comparison, not a strawman: the baseline is the same architecture every public tiny-LM paper uses.

The win: 60K parameters (the MCU regime)

Model	Params	Disk	Perplexity (lower is better)
Atome 3-pathway, ternary	60,800	15.1 KB	6.31
Vanilla GPT FP32 (param-fair)	60,808	237.5 KB	8.12
Vanilla GPT FP32 (flash-fair)	5,968	23.3 KB	13.10

At a matched parameter count, Atome reaches 6.31 perplexity versus 8.12 for the FP32 transformer — about 22% lower — while using roughly 16× less disk. At a matched flash budget, where the float model can only afford about 6K parameters, the gap widens to 6.31 versus 13.10, about 52% lower. The win also survives three seeds in a separate shorter run (mean perplexity 7.77 versus 9.82), so it is not a single lucky seed.

The reversal: 944K parameters

Model	Params	Val loss	Perplexity
Atome 3-pathway, ternary	944,640	1.0545	2.87
Vanilla GPT FP32 (param-fair)	950,608	0.9337	2.54

Scale the same recipe to about 944K parameters and the result flips. The FP32 baseline reaches 0.9337 validation loss and 2.54 perplexity, beating Atome's 1.0545 and 2.87 by roughly 11% — same corpus, same validation split, same seed. We publish this prominently because hiding it would be dishonest and, frankly, less useful.

Why the reversal is the important number

The flip tells you exactly what the architecture is for. The three-pathway ternary block is an inductive bias that substitutes for capacity when capacity is scarce, and gets in the way once you have enough of it. So Atome's bet is the sub-1M-parameter, microcontroller-class regime — not “tiny ternary beats everything.” Knowing the crossover is what lets you choose correctly: below roughly a million parameters, Atome; above it, a plain transformer if your hardware can afford one.

What the benchmark does not claim

The 60K and 944K headlines are single-seed; only the shorter 1500-step run is replicated across three seeds.
All microcontroller figures are QEMU Cortex-M3 measurements, not physical silicon throughput.
This is TinyStories, a narrow corpus; it is a controlled comparison, not a general-quality claim.

What the ablations tell us

Beyond the head-to-head, the public three-seed run includes ablations that remove one pathway at a time, and they explain where the win comes from. With all three pathways the mean perplexity is 7.77; remove the local convolution and it rises to 8.99, remove the state-space model and it is 8.05, remove the sparse attention and it is 7.93. Every pathway contributes, and the local convolution — the cheapest of the three — matters most at this scale. That is a useful design signal: the routed combination is not redundant decoration, it is doing real work, and the router is learning to lean on the convolution for the short-range structure that dominates byte-level text.

Reading perplexity without fooling yourself

Perplexity is a proxy, not a product metric. A 22% perplexity reduction at 60K parameters is a meaningful architectural signal under a controlled comparison, but it does not by itself mean the model writes good prose — at this scale neither model does, because TinyStories at kilobyte scale is a structure benchmark, not a fluency one. The honest way to read these numbers is comparative: under identical training, the same corpus and the same seed, the ternary routed block extracts more from a fixed parameter or flash budget than a plain transformer does, up to about a million parameters, after which the plain model pulls ahead. The benchmark tells you about the architecture's efficiency in a regime, not about chatbot quality.

If you want to reproduce any of this, the repository ships the trained checkpoints and the step-by-step training logs, and the 60K comparison runs in about half an hour on a laptop CPU. We would rather you check the numbers than trust them.

How to run the comparison yourself

The strongest thing we can say about these numbers is that you do not have to take our word for them. The repository bundles the trained checkpoints for both the Atome and the vanilla baseline, the exact training configuration, and the step-by-step logs behind every figure on this page, so the comparison is auditable rather than asserted. The 60K parameter-fair and flash-fair sweep reproduces in roughly half an hour on an ordinary laptop CPU, with no GPU required, and the larger 944K reversal is documented with its own logs. We publish results this way on purpose: a benchmark you can regenerate is a claim you can trust, and one you cannot is just a number on a slide. If you find a discrepancy, that is a bug report we want, not a result we will defend.

Bottom line

The two-directional benchmark is the honest one: a ternary routed block beats a parameter-matched FP32 transformer by about 22% at 60K parameters and loses to it by about 11% at 944K, with the ablations showing every pathway contributes. The reversal is the useful part, because it tells you exactly where the architecture belongs — the sub-million-parameter, microcontroller-class regime — and where it does not. Read perplexity as a comparative efficiency signal, not a fluency claim, and reproduce the numbers from the bundled checkpoints and logs rather than trusting them. That is what a benchmark is for.

Frequently asked questions

Does a ternary LLM beat a normal transformer?

At very small scale, yes: Atome beats a parameter-matched FP32 transformer by ~22% in perplexity at 60K parameters. At ~944K parameters the float model wins by ~11%. It depends on scale.

Are these tiny-LLM benchmark numbers reproducible?

Yes. The trained checkpoints and step-by-step training logs ship in the public repository, and the 60K comparison reproduces in about 30 minutes on a CPU.

← All posts Source & data on GitHub