Running an LLM on ESP32 and STM32 with a Bit-Exact C Engine

Plenty of projects claim they “ported the model to C.” Far fewer test the one thing that matters: does the C output match the Python reference, token for token? Atome makes that a test rather than a promise, and that guarantee is what turns a microcontroller demo into something you can actually certify and ship.

One engine, two chips

The Atome inference engine is plain C99 with a fixed-shape block: a LayerNorm, a ternary depthwise convolution, a diagonal state-space model, a top-k attention, and a router, in that order. It targets ESP32 (the dual-core Xtensa LX7 with on-chip SRAM) and STM32 / Cortex-M parts identically because it makes no assumptions about an FPU, a cache, or an operating system. The same source compiles for a $2 STM32F103 and a $5 ESP32-S3; only the compile-time configuration changes.

The parity contract

Because the PyTorch model and the C engine are built to the same fixed block shape, the test suite can require that they produce identical numbers. End-to-end Python-to-Cortex-M3 parity under QEMU is max |Δ| = 3.7×10⁻⁷, which is floating-point epsilon. Multi-token generation parity is exact: 48 of 48 tokens on the 60K model and 16 of 16 on the 944K model. The full suite is 146 tests, green at HEAD.

Why bit-exactness matters for shipping

If the chip can silently diverge from your reference implementation, you cannot certify the device's behavior. Bit-exact parity means the behavior you validated in Python is the behavior that runs on the board — which is exactly what regulated domains (medical, industrial, automotive) require, where “probably the same” is not an acceptable answer. It also makes debugging tractable: a mismatch is a located bug, not a floating-point mystery to chase across two languages.

Memory and flash, per chip

The configuration you can run depends on the part's SRAM. From the repository's measured table: an STM32F103 (20 KB SRAM) runs the small classifier configs; an RP2040 (264 KB) runs the 64-dimension story model at about 104 KB of RAM; the 944K “prod” configuration needs a 512 KB part such as an STM32F7 or an ESP32-S3. Flash is rarely the limit — the packed weights plus the engine are tens to hundreds of kilobytes, well within a typical 512 KB to 4 MB flash.

One honest boundary

The parity guarantee holds between the Python reference and the C engine. The in-browser playground on this site runs a separate JavaScript reimplementation of the forward pass — same weights, but not covered by the bit-exact guarantee, since floating-point order can differ in JavaScript. And all Cortex-M3 numbers are measured under QEMU; we have not yet flashed a physical board and measured joules per token. When we do, we will publish it with the same candor as everything else.

From PyTorch to a flashable blob

The path from a trained model to something a microcontroller runs is short and explicit. You train in PyTorch, export the ternary weights to the ATOME01 packed format — four trits per byte in base-3 — and the C engine loads that blob directly with no conversion at boot. Because the engine's block structure mirrors the PyTorch module exactly, there is no translation layer that could introduce a discrepancy; the same operations happen in the same order. That is what makes the bit-exact parity guarantee possible in the first place: the two implementations are not approximations of each other, they are the same computation expressed twice.

Choosing between ESP32 and STM32

The two families suit different products. STM32 parts span a huge range, from the 20 KB Blue Pill that runs only the smallest classifier configs to the 512 KB STM32F7 that hosts the full 944K model, so they are ideal when you want to pick exactly the right cost and size point. The ESP32-S3 brings 512 KB of SRAM plus built-in wireless, which is useful when you want an on-device model for the privacy-sensitive, low-latency path and optional connectivity for everything else. Either way the engine code is the same; you choose the chip by the configuration your task needs and the peripherals your product wants, then confirm the fit against the measured RAM table before committing.

Testing the port before you trust it

Bit-exact parity is only worth anything if you actually run the checks, so the workflow ends with verification rather than assumption. The repository ships a parity test that compares the C engine's output against the PyTorch reference, a multi-token test that confirms generation stays identical across a sequence, and a QEMU test that runs the Cortex-M3 build itself; all of it is part of the 146-test suite that must pass before a change is considered done. When you bring the engine up on a new part, the right move is to reproduce these checks in your environment first, confirm the numbers match, and only then build features on top. A mismatch at that stage is a located bug you can fix; a mismatch discovered in the field, after you assumed the port was faithful, is far more expensive. The discipline is simple: verify parity on the chip before you trust the chip.

Bottom line

The same portable C99 engine runs on an ESP32 and across the STM32 range because it assumes no FPU, cache or operating system; you choose the chip by the configuration your task needs and confirm the fit against the measured RAM table. The export path from PyTorch to a flashable ATOME01 blob is direct, and the bit-exact parity tests are what let you trust that the chip behaves exactly like your reference. Bring the engine up, reproduce the parity checks in your environment, and only then build features on top. Verify parity on the chip before you trust the chip.

Frequently asked questions

Can an ESP32 run a language model?

Yes — an ESP32-S3 has 512 KB of SRAM, enough for Atome's larger configurations, and the heap-free C99 engine runs without an OS. Smaller configs also run on STM32 parts down to about 20 KB of SRAM.

What does bit-exact Python-to-C parity mean?

It means the C engine produces the same outputs as the PyTorch reference, verified by tests — max |Δ| = 3.7×10⁻⁷ end-to-end and exact multi-token generation. The behavior you validate is the behavior that ships.

← All posts Source & data on GitHub