Best Tiny LLM for a $2 MCU: TinyLlama vs llama2.c vs TinyMaix vs Atome

“Small model” is a slippery phrase, and most comparison articles dodge the one question that matters for embedded work: does it actually fit the chip's memory? This post puts four well-known options side by side against real microcontroller budgets. The numbers for Atome come from its public measured table; the others are taken from each project's own sizing and from the hardware they were designed for.

The contenders

Stack	Smallest realistic footprint	Target hardware	Fits a $2 MCU?
TinyLlama 1.1B (4-bit)	~550 MB	GPU / phone	✗
llama2.c (Stories260K, FP32)	~1 MB+	Pi / desktop	✗ on MCU SRAM
TinyMaix / TFLite-Micro	tens–hundreds KB	MCU (vision/keyword)	✓ — but not LLMs
Atome lm	~42 KB flash / ~14 KB RAM	MCU (byte LM)	✓

Reading the table honestly

TinyMaix is excellent engineering — but it runs convolutional networks and keyword spotters, not autoregressive language models. If your task is wake-word detection or image classification, it is the right tool and Atome is not. llama2.c is a beautiful single-file transformer, but the checkpoints people actually run need a megabyte or more of working memory, which means a Raspberry Pi rather than an STM32. TinyLlama is, despite the name, a GPU-class model: wonderful on a phone, impossible on a Cortex-M.

Atome lm is the narrow case that fills the gap the others leave: an actual byte-level language model whose ternary weights pack into tens of kilobytes of flash and whose heap-free engine fits the SRAM of a cents-class part. It is not better than TinyLlama at language; it is the only one of the four that runs on the chip at all.

Where Atome is not the answer

Two honest disqualifiers. If you have a Raspberry Pi, an NPU or a phone, you have megabytes to gigabytes — use a larger model and get better quality; Atome's niche is specifically the place those do not reach. And at scale Atome loses: when you grow it to about 944K parameters, a vanilla FP32 transformer beats it by roughly 11% in perplexity. Atome's bet is deliberately the sub-1M, microcontroller-class regime, not open-domain quality.

How to choose

Need vision or keyword spotting on an MCU? Use TinyMaix / TFLite-Micro.
Have a Raspberry Pi or a phone? Run llama2.c or a quantized small LLM and enjoy the extra quality.
Need an actual byte language model on a bare-metal Cortex-M, offline, heap-free? That is Atome's lane.

The right comparison is never “which tiny model is best” in the abstract — it is “which one fits my chip and my task.” For a $2 microcontroller running a narrow language task with no network, Atome is currently the only one of these four that loads at all.

Why parameter count is a misleading spec

Comparison tables love parameter counts because they are a single number, but for embedded work they hide the thing that decides feasibility. Two models with the same parameter count can have wildly different memory footprints depending on weight precision, tokenizer size, sequence length and cache strategy. A 1M-parameter FP32 model is 4 MB of weights; a 1M-parameter ternary model is a few hundred kilobytes. The vocabulary matters too: a model with a 32,000-token BPE vocabulary ships a large embedding table, while Atome's byte tokenizer has just 256 entries and no separate vocabulary file. When you compare tiny LLMs, compare bytes in RAM and flash, not parameters.

The role each tool plays

TinyLlama and other 1B-class models are for phones, single-board computers and GPUs — anywhere with gigabytes. They are not in the microcontroller conversation.
llama2.c is a superb teaching implementation and runs nicely on a Raspberry Pi; on a bare MCU its usual checkpoints overflow SRAM.
TinyMaix and TFLite-Micro own the MCU space for vision and keyword spotting, but they are not language models.
Atome is the byte-level language model for the bare-metal MCU niche — narrow tasks, offline, heap-free.

Seen this way, the four are not really competitors; they occupy different rungs of the same ladder. The mistake the comparison is meant to correct is treating a phone-class model as if it were an embedded one. Once you put each on the rung where it belongs, choosing is easy: start from your hardware's memory and your task's breadth, and the right tool is usually obvious.

Total cost of ownership, not just model size

When you compare these stacks for a real product, model size is only one line in a longer budget. A cloud-backed approach adds recurring inference fees, a connectivity requirement, and a privacy and compliance burden that grows with the sensitivity of the data. A Raspberry Pi-class on-board computer adds unit cost, power draw, boot time and a full operating system to maintain and secure. A bare-metal microcontroller running Atome adds essentially none of these: the model is part of the firmware you were already shipping, there is no per-inference cost, no network dependency, and no operating system to patch. For a high-volume device, those recurring and per-unit savings often dwarf any difference in raw model quality, which is why the right comparison weighs the whole system, not just the benchmark.

Bottom line

For a $2 microcontroller running a narrow language task with no network, Atome is currently the only one of these four stacks that loads at all — not because it is the smartest model, but because it is the one designed for that memory budget. TinyLlama belongs on phones, llama2.c on a Raspberry Pi, and TinyMaix on vision and keyword tasks. Score the candidates by bytes in RAM and flash rather than parameter count, weigh the whole system cost rather than the benchmark alone, and the choice for a bare-metal MCU usually makes itself.

Frequently asked questions

What is the best LLM to run on an ESP32 or STM32?

For a true language model that fits MCU memory, Atome lm is purpose-built for it; for vision or keyword tasks use TinyMaix. Large models like TinyLlama do not fit microcontroller SRAM.

Is TinyLlama small enough for a microcontroller?

No. TinyLlama is 1.1 billion parameters — about 550 MB even at 4-bit — which is thousands of times larger than the SRAM of any microcontroller. It targets phones and GPUs.

← All posts Source & data on GitHub