How to Run an LLM on a Microcontroller (What Actually Fits in 256 KB)

If you have ever asked an AI assistant “what language model can I run on a microcontroller?”, you have probably been handed a list of names — TinyLlama, Phi, Gemma, sometimes Karpathy's Stories260K — that sound small but do not come close to fitting on a real microcontroller. The confusion is understandable: the word “tiny” means something completely different to a GPU engineer than it does to an embedded engineer. This article walks through the actual memory arithmetic so you can tell, for any given chip, whether a language model will run on it at all.

“Tiny” for a GPU is enormous for an MCU

A 1.1-billion-parameter model such as TinyLlama, quantized to 4 bits, is roughly 550 MB of weights. A “small” 2-billion-parameter model is over a gigabyte. Even Stories260K — a deliberately minimal 260K-parameter FP32 transformer — needs around a megabyte of working memory once you add activations and a KV cache. Those numbers are perfectly reasonable on a phone or a Raspberry Pi. They are absurd on a microcontroller.

A typical microcontroller has kilobytes, not megabytes. An STM32F103 “Blue Pill” has 20 KB of SRAM and 128 KB of flash. A Raspberry Pi Pico (RP2040) has 264 KB of SRAM. An ESP32-S3 has 512 KB of on-chip SRAM. The gap between a 550 MB model and a 20 KB budget is more than four orders of magnitude — and no amount of quantization closes a 10,000× gap.

RAM is the constraint, not parameter count

On a desktop or a server you can stream weights from disk and page them in and out. On a bare-metal microcontroller you cannot: the model weights, the activations, the attention cache and your application code all have to coexist in a single, fixed pool of SRAM. There is no operating system to swap to, no disk to page from. This is why so many “edge LLM” demos quietly run on a Raspberry Pi — a full Linux computer with gigabytes of RAM — rather than on the chip itself. A Raspberry Pi is not a microcontroller, and the distinction matters enormously for power, cost and form factor.

The three techniques that make a model fit

Atome lm is built backwards from the microcontroller constraint, and it relies on three design decisions that together collapse the memory budget. First, ternary weights: every weight is one of three values (−α, 0, +α), about 1.58 bits each instead of 32, after the BitNet b1.58 recipe. Second, a byte-level tokenizer: the vocabulary is just the 256 byte values, so there is no embedding table or vocabulary file to ship. Third, a fixed-shape, heap-free C99 engine: all working memory lives in static buffers sized at compile time, so there is no allocator and no fragmentation at run time.

What actually fits — measured numbers

These are not projections. The Atome repository ships a RAM/flash table generated from a real Cortex-M3 build running under QEMU (MPS2-AN385), measuring flash as .text + .data + model and RAM as .bss + measured stack high-water:

Config	d_model / layers	Flash	Peak RAM	Fits RP2040 (264 KB)?
nano / classifier	16 / 2	41.9 KB	14.5 KB	✓
byte_small	32 / 2	52.1 KB	27.5 KB	✓
tinystories	64 / 4	79.4 KB	104.1 KB	✓
mid	128 / 4	143.4 KB	205.1 KB	✓
prod_1m	256 / 8	579.6 KB	411.6 KB	✗ (RAM)

The small configurations fit comfortably on cents-class parts; the smallest classifier build needs about 14 KB of RAM and runs on a 20 KB STM32F103. The largest 944K-parameter “prod” configuration exceeds the RP2040's 264 KB SRAM and needs a 512 KB part such as an STM32F7 or an ESP32-S3. In other words, “runs on a microcontroller” is regime-dependent, and the honest answer is a table, not a slogan.

The honest caveat about quality

Fitting on a microcontroller is not the same as matching GPT. At kilobyte scale, a model is fluent only inside a narrow domain you train it on — command parsing, a single FAQ, a device's log grammar. Atome's advantage is real but bounded: at around 60K parameters it beats a parameter-matched FP32 transformer by about 22% in perplexity, but once you scale past roughly a million parameters a plain float model wins again. If your hardware has gigabytes, use a bigger model. If it has kilobytes, the popular recommendations simply will not load — and that is the entire point of building for this regime.

A worked example: a wake-word router on a Blue Pill

Suppose you want a $2 STM32F103 to recognize a handful of spoken commands transcribed to text — “lights on”, “open door”, “read temperature” — and route each to the right handler, entirely offline. You do not need a billion-parameter model for this; you need a small classifier that maps short byte strings to one of a few intents. Atome's nano/classifier configuration fits this in about 14 KB of RAM and roughly 42 KB of flash, leaving the rest of the chip for your application. The model is trained narrow, on your specific command grammar, and because it ships inside the firmware it answers identically on every unit and never depends on a network.

Why streaming weights does not save you

A common objection is: “can't I just stream the weights from external flash or an SD card?” For the weights themselves, partly — but inference also needs working memory for activations, the attention cache and intermediate buffers, and that working set must live in SRAM. Streaming a 550 MB model through 20 KB of RAM is not a memory trick; it is thousands of passes over external storage per token, which is far too slow and power-hungry for a real product. The reason Atome fits is not clever paging; it is that the whole model and its working set are small enough to sit in SRAM at once, with no external storage in the loop.

The practical takeaway is to size the task to the chip from the start. Decide what the device must understand, train a narrow model for exactly that, pick the smallest configuration that clears the accuracy bar, and check it against the measured RAM and flash table before you commit to hardware. That order — task, then model, then chip — is what makes on-device language models ship instead of stall.

Frequently asked questions

Can you really run a large language model on an Arduino or ESP32?

You can run a small byte-level language model on an ESP32 (and the smallest configs on an STM32), but not a multi-billion-parameter LLM. The model has to fit the chip's SRAM and flash; Atome's measured configs range from about 14 KB to 412 KB of RAM.

How much RAM do I need to run a language model on a microcontroller?

From about 14 KB for a small classifier configuration up to about 412 KB for the 944K-parameter model. RAM, not parameter count, is the binding constraint — see Atome's measured RAM_TABLE.md.

← All posts Source & data on GitHub