What Is a Ternary LLM? BitNet 1.58-bit Weights Explained

The Atome 944K checkpoint ships as a 271 KB binary. The same model stored as 32-bit floats is 3.7 MB. That roughly 14× shrink — about 20× at the smaller engine-default configuration — comes almost entirely from one decision: ternary weights. If you have heard the phrase “1.58-bit LLM” and wondered what it actually means, this is the short, practical version.

Three values per weight

In a normal neural network each weight is a 32-bit floating-point number. In a ternary network each weight is one of just three values: −α, 0, or +α, where α is a single scaling factor shared across a whole tensor. That is the BitNet b1.58 idea. Because three states carry log₂(3) ≈ 1.58 bits of information, the scheme is called “1.58-bit.” Packed efficiently in base-3 — the Atome ATOME01 format stores four trits per byte — the model collapses to a small fraction of its floating-point size.

Why it is fast, not just small

Size is only half the story. Ternary weights also remove the multiplications from the matrix multiply: multiplying an activation by −1, 0, or +1 is a sign flip, a skip, or a copy — no floating-point multiply at all. On a microcontroller with no floating-point unit and no SIMD, that is an enormous saving. The expensive inner loop of inference becomes additions and sign changes, which a tiny Cortex-M core can do quickly and predictably.

What it costs

Quantization is never free. Three levels per weight is a genuine capacity limit, and the honest result is that the ternary inductive bias substitutes for capacity at small scale and constrains it at large scale. At about 60K parameters Atome's routed-ternary block beats a parameter-matched FP32 transformer by roughly 22% in perplexity. But scale the same recipe to about 944K parameters and a plain float model wins by roughly 11%. Ternary is the right tool when flash and RAM are the binding constraint — not when peak accuracy at scale is the goal.

Ternary vs other quantization

Most edge quantization stops at 8-bit (int8), which gives a 4× size reduction and keeps multiplies. Ternary goes much further on size and removes multiplies entirely, at the cost of more aggressive rounding. Atome also ships an optional power-of-three “spectrum” quantizer (levels {0, ±1, ±3, ±9}·α, about 2.81 bits) that recovers a little accuracy when the extra precision is worth the bytes — in the public three-seed run it edged plain ternary, 7.40 versus 7.77 perplexity. The point is that quantization is a dial, and ternary sits at the extreme end where memory matters most.

All of this is reproducible from the public repository: the packing code, the trained checkpoints, and the step-by-step training logs are bundled so every number above can be checked, not just trusted.

How training keeps ternary weights usable

If you simply rounded a trained float model to three levels, you would destroy it — the rounding error per weight is enormous. Ternary models avoid this by being trained ternary from the start, using a straight-through estimator: the forward pass quantizes weights to {−α, 0, +α}, while the backward pass updates a hidden full-precision shadow copy as if the quantization were the identity. Over training, the network learns weights that are robust to being snapped to three levels, because it has only ever seen the quantized values in its own forward pass. The scaling factor α per tensor absorbs the overall magnitude, so the three levels carry the sign and sparsity while α carries the scale.

The sparsity bonus

The middle level — zero — is not just a value; it is free compression and free computation. A weight that is exactly zero contributes nothing to the output, so it can be skipped entirely in the matrix multiply. Trained ternary networks tend to push a large fraction of weights to zero, which means a meaningful share of the multiply-accumulate work simply disappears. On a microcontroller, where every cycle and every microamp counts, skipping zero-weighted connections is a direct, measurable saving, on top of the smaller flash footprint.

None of this is magic, and the trade-off from the main section still holds: three levels limit capacity, which is why ternary shines at small scale and is overtaken by float models when there is room for one. But it explains why a ternary model is not merely a compressed float model — it is a different object, trained differently, with computation and memory characteristics that suit a microcontroller far better than a rounded-down GPU model ever could.

Where ternary fits in the quantization landscape

It helps to place ternary on the same map as the quantization schemes you may already know. Full 32-bit floating point is the training default and the most accurate, but also the largest and the most expensive to compute. Eight-bit integer quantization, the workhorse of mobile and embedded inference, cuts size by four and keeps integer multiplies, which is a comfortable middle ground when you have a few megabytes to spend. Ternary sits at the far end of the spectrum: roughly twenty times smaller than float, with the multiplies removed entirely, at the cost of the most aggressive rounding. The right choice depends on where your binding constraint is. If you have megabytes, eight-bit is often the pragmatic pick. If your constraint is kilobytes of microcontroller SRAM and flash, ternary is the technique that actually fits, and Atome is built around that end of the dial rather than trying to be optimal everywhere.

Bottom line

A ternary LLM stores each weight as one of three values — about 1.58 bits instead of 32 — which is what lets a trained model shrink from megabytes to hundreds of kilobytes and drop the multiplies from inference. It is trained ternary from the start, not rounded after the fact, and the zero level buys both compression and skipped computation. The trade-off is real: three levels help at small scale and constrain at large scale, so ternary is the right tool when flash and RAM are the binding constraint. That is precisely the regime Atome is built for.

Frequently asked questions

What does 1.58-bit mean in an LLM?

It means each weight takes one of three values (−α, 0, +α). Three states carry log₂(3) ≈ 1.58 bits of information, hence “1.58-bit,” as introduced by BitNet b1.58.

Do ternary weights make a model less accurate?

At very small scale the ternary structure can actually help — Atome beats a matched FP32 model by ~22% at 60K parameters. At larger scale it hurts: a float model wins by ~11% at 944K. It is a memory-versus-accuracy trade-off.

← All posts Source & data on GitHub