Docker AI Ecosystem 01: Local LLMs with Docker Model Runner

For a decade Docker did one thing, package and run apps. Lately it grew a second job, running AI models and the agents that drive them. This first part of the series gets a model talking on your own machine, points your code at it with a one-line change, and takes the dev loop off the cloud meter. Every command was run on a real box and the output pasted in.

For a decade Docker did one thing, package and run applications. Lately it grew a second job: running AI models and the agents that drive them. I was skeptical too, the last thing the world needs is another "AI-powered" sticker on a tool that was already fine. So I installed all of it on a plain Linux box and ran it until I understood it. This series is what I found, and it goes simple to complex:

No prior AI knowledge assumed, only that you have run a container or two.

1. From Shipping Apps to Shipping Inference #

Docker's whole pitch was always "build, ship, run", wrap an app with everything it needs so it runs the same on your laptop, in CI, and in production. That killed "but it works on my machine," and for years Docker just grew rings around it: Compose, a bundled Kubernetes, Scout.

Then large language models showed up and the laptop became a place you run models, not just apps. For a while the only option was renting someone else's GPU and paying per token, which is fine for production but grating for a developer: the bill, the latency, the fact that your prompt now lives on a server you do not own, and no working on a plane. Meanwhile consumer hardware got fast enough and quantized open models small enough that running one locally went from weekend science project to a Tuesday.

Docker's bet: running a model should feel exactly like running a container, pull an artifact, run it, point your app at it. Annoyingly, it turns out to work. The new AI pieces are Model Runner (run models locally, this post), agents and microVM Sandboxes (Parts 02–03). They reuse what you already know, registries, images, ports, volumes, limits, so the learning curve is short.

2. What Is Docker Model Runner? #

Docker Model Runner (DMR) runs language models locally using the container mental model:

  • Models are OCI artifacts — the same packaging standard your images use. They live on Docker Hub (the ai/ namespace) and Hugging Face, and you pull, version, and cache them exactly like images.
  • There is a real engine behind itllama.cpp by default, on CPU or GPU.
  • It speaks OpenAI — an OpenAI-compatible HTTP API, so any tool that already talks to OpenAI works against it unchanged.

3. Run Your First Model #

On Linux with Docker Engine (the CLI-only "CE" install), Model Runner is a plugin package from Docker's own repository:

sudo apt-get install docker-model-plugin

If docker model reports "unknown command" afterwards, your Docker came from the distro's repo rather than Docker's, reinstall Docker Engine from Docker's official repository first.

On Docker Engine you also need the runner backend, set up as a managed container with one command:

docker model install-runner          # the runner is then a container; *-runner subcommands manage it

There is no GPU requirement. I ran all of this on a 16-core desktop CPU with no graphics card, and a small model answered in well under a second. But the GPU win is real, and I measured it on a separate box with an RTX 3090. A real 1B model (ai/gemma3:1B-Q4_K_M) went from 53 tok/s on CPU to 352 on the GPU, about 7×; the tiny ai/smollm2:360M went 124 → 585 (the bigger model gets the bigger speedup, as you'd expect). llama.cpp offloads every layer onto the card, the CUDA log says it plainly, found 1 CUDA devices … offloaded 33/33 layers to GPU, and nvidia-smi caught 93% utilisation mid-generation. The CUDA runner image is docker/model-runner:latest-cuda.

A gotcha worth the warning, because it cost me three confused runs. The runner container runs as a non-root user, so on a host where /dev/nvidia* is root:video 0660 (Gentoo, for one) it gets "NVML: Insufficient Permissions" and silently falls back to CPU, identical tok/s to no GPU at all, no error in sight. Run the runner as root (or make the device nodes accessible), and ensure the NVIDIA container toolkit grants the compute capability (not just utility/nvidia-smi). Then it offloads.

Now pull a model the way you pull an image, go small for a first run:

docker model pull ai/smollm2:360M-Q4_K_M
docker model list
docker model run ai/smollm2:360M-Q4_K_M "In one sentence, what is Docker?"
Docker is a tool for automating the creation, shipping, and running of applications and systems from source code.

A real answer from a model on your CPU. Housekeeping you will use: docker model ps (loaded in memory), docker model df (disk), docker model rm <model>. Models load into RAM only on request and unload when idle, so a pulled-but-unused model costs disk, not memory.

You are not limited to Docker Hub either. DMR pulls GGUF files straight from Hugging Face with an hf.co/ prefix:

docker model pull hf.co/unsloth/SmolLM2-135M-Instruct-GGUF

I pulled and ran that one while writing this, it lands in docker model ls and runs like any Hub model.

Picking a model. Watch RAM and quantization. The Q4_K_M in the tag is compression, Q4 variants are the local sweet spot. Parameter count drives quality and RAM (a Q4 model wants a bit more RAM than its file size while loaded). On CPU, smaller and more quantized is snappier. Good small-to-mid picks: ai/qwen2.5, ai/gemma3, ai/llama3.2.

4. Talk to It Over the API #

The real prize is the HTTP API. The runner listens on the host at port 12434 with an OpenAI-compatible surface:

curl -s http://localhost:12434/engines/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"ai/smollm2:360M-Q4_K_M","messages":[{"role":"user","content":"Reply with exactly: API works"}],"max_tokens":20}'

Back comes the familiar OpenAI shape with usage and, helpfully, timing ("timings":{"predicted_per_second":123.63}), tokens/sec on a CPU. There is a GET /engines/v1/models endpoint too.

4.1 Reaching the runner from another container #

This trips up everybody once. Inside a container, localhost is the container, not your machine, so localhost:12434 hits nothing. Two ways that work on plain Docker Engine, both tested:

# A) via the host's published port
docker run --rm --add-host host.docker.internal:host-gateway curlimages/curl \
-s http://host.docker.internal:12434/engines/v1/models

# B) the real "container hostname" way: same user-defined network, call it by name
docker network create dmrnet
docker network connect dmrnet docker-model-runner
docker run --rm --network dmrnet curlimages/curl \
-s http://docker-model-runner:12434/engines/v1/models

User-defined networks give DNS by container name for free; the default bridge does not (so a bare docker-model-runner fails there). Doing this by hand gets old fast, which is exactly the chore Compose removes.

5. Wire It Into an App with Compose #

Compose understands models natively, no hard-coded endpoints. A top-level models: block declares the model, a service references it, and Compose injects the address + name as environment variables. The short form is least typing, list the model and you get <NAME>_URL and <NAME>_MODEL:

services:
app:
image: my-app
models:
- smollm # injects SMOLLM_URL and SMOLLM_MODEL

models:
smollm:
model: ai/smollm2:360M-Q4_K_M

I confirmed that injects exactly SMOLLM_URL/SMOLLM_MODEL. Use the longer endpoint_var/model_var mapping when you want to choose the variable names yourself (handy in the next section). The models: element needs Docker Compose 2.38.0+.

6. Ditching Cloud APIs: Cost-Free Dev #

The cheapest token is the one you never send to a vendor. Cloud LLM APIs charge per token, fair in production, but during development you run the same prompt hundreds of times, and CI runs them again on every push, so you end up paying customer prices for your own debugging. Move that loop local.

Where local wins: the inner dev loop, CI/tests (deterministic, no key, no rate limits), privacy-sensitive data, and offline/air-gapped work (a model pulled once works in an isolated network like a mirrored package repo). Where it doesn't: raw quality (a small local model is not a frontier model), huge context, and throughput at scale. The grown-up answer is hybrid, local for the loop, cloud for production, chosen by config.

And the switch can be zero code. The OpenAI SDK already reads OPENAI_BASE_URL and OPENAI_API_KEY from the environment when you build the client with no arguments, so map Compose's injected variables onto those names:

    models:
smollm:
endpoint_var: OPENAI_BASE_URL
model_var: OPENAI_MODEL
import os
from openai import OpenAI
client = OpenAI() # reads OPENAI_BASE_URL + OPENAI_API_KEY from the env
model = os.environ["OPENAI_MODEL"]

I ran exactly this against the local runner and it answered, no base_url, no key in the code. Now dev/CI sets OPENAI_BASE_URL=http://localhost:12434/engines/v1 and a local ai/... model (with any non-empty placeholder key, the SDK insists one exists); production leaves OPENAI_BASE_URL unset to hit the real endpoint with a real key. The application code is byte-for-byte identical in every environment.

The marginal cost of a development token drops to roughly zero, and it is the spend with the least to show for it, because none of it served a customer.

Summary #

  • Model Runner makes a local LLM as easy as pull then run, no CUDA weekend.
  • Models are OCI artifacts from Docker Hub's ai/ namespace or Hugging Face (hf.co/).
  • Everything is reachable over an OpenAI-compatible API on port 12434; Compose can provision and inject it.
  • Point any OpenAI SDK at it with a one-line base_url change, or zero code via OPENAI_BASE_URL.
  • Local inference wins the inner loop, CI, privacy, and offline; cloud still wins frontier quality and scale, run hybrid.

A note on method: everything here was run on a real machine, Docker CE on Ubuntu and a desktop CPU, plus a separate RTX 3090 box for the GPU numbers. The outputs are real, typos and all.

Next #

You have models running locally and answering over a standard API. The harder, more interesting problem is running the agents that drive them, programs that execute shell commands and edit files for you. Part 02 explains why a plain container is not a strong enough wall for that.


Next · 02: Why AI Agents Need MicroVM Sandboxes →