Datacenter on Your Desk

Last Tuesday night I watched a 397-billion-parameter model generate text on a $2,500 box sitting on my desk in Nashville. Not a cloud endpoint. Not an API call. Not a rented GPU. A desktop computer I bought with a credit card, running a model that was supposed to require a datacenter.

Six point one tokens per second. Not fast by cloud standards. But fast enough to hold a conversation. Fast enough to run inference for a seven-agent system. Fast enough to change the economics of everything I have been building for the last six months.

The Hardware Nobody Is Benchmarking

The machine is called Gesha -- named after the rarest coffee varietal, because my wife names all our machines after coffee strains with a cannabis whisper. It is a Framework Desktop with an AMD Ryzen AI Max+ 395, the Strix Halo APU. RDNA 3.5 graphics, 40 compute units, Radeon 8060S. The spec that matters: 128GB of LPDDR5X unified memory. CPU and GPU share one memory bus, one pool. No PCIe bottleneck. No VRAM wall.

When I started researching Strix Halo benchmarks, the largest published model I could find was around 70 billion parameters. Level1Techs, TweakTown, Hardware Corner, llm-tracker.info -- they all stopped at 70B to 120B. Nobody had published results for anything larger. Not 235B. Certainly not 397B.

So I did what operators do. I stopped reading benchmarks and started running them.

The Model Ladder

I built a systematic benchmark ladder. Start small, prove the stack works, scale up until something breaks. Here is what Gesha produced:

Qwen3 8B Q4_K_M -- 5.5 GB on GPU, 37.0 tok/s generation, all 33 layers on GPU. This is the baseline. Fast, boring, expected.

Qwen3 32B Q4_K_M -- 20 GB on GPU, 10.4 tok/s generation, all 65 layers on GPU. Still comfortably fast. Still fully offloaded to GPU. Still boring.

Qwen3-235B-A22B Q2_K -- this is where it gets interesting. 56.5 GB on GPU plus 24.2 GB on CPU. 9.2 tok/s generation with a 66-of-95 GPU layer split. A 235-billion-parameter model running at nearly ten tokens per second on a desktop.

Qwen3.5-397B-A17B IQ2_XXS -- 397 billion parameters. 17 billion active per token (mixture-of-experts). 115 GB across four GGUF shards. 33 of 61 layers on GPU. 6.1 tokens per second.

That last number is the one that matters. 397 billion parameters on consumer hardware. Not a cluster. Not a rack. A box you can carry with one hand.

The BIOS Discovery That Changed Everything

Here is the part that cost me a full day and taught me something nobody documents.

AMD Strix Halo has unified memory architecture -- UMA. The CPU and GPU share the same physical RAM. But the BIOS lets you carve out how much of that 128GB pool is reserved as dedicated VRAM versus system RAM. And that split determines which inference backends work.

At 96GB VRAM / 32GB system RAM (the default config):

Ollama with ROCm 6 works beautifully. 10.4 tok/s on 235B, 66 of 95 GPU layers. ROCm 6 allocates from the VRAM carveout and uses driver-level lazy paging.
HIP SDK 7.1 (compiled llama.cpp) crashes. hipMalloc pins physical pages from the system RAM pool, and 32GB is not enough for a 56GB model buffer.

At 2GB VRAM / 126GB system RAM (after I changed the BIOS):

HIP SDK 7.1 works. 235B at 9.1 tok/s. 397B at 6.1 tok/s. hipMalloc now has 126GB of system RAM to allocate from.
Ollama breaks completely. Falls to CPU-only at 2.0 tok/s. Cannot allocate from a 2GB VRAM carveout.
Vulkan is completely broken. Cannot allocate even 572MB. AMD Windows Vulkan driver needs VRAM carveout for device-local memory.

The key insight: HIP allocates from system RAM. Vulkan and Ollama allocate from the VRAM carveout. They are mutually exclusive optimization targets on the same hardware, controlled by a single BIOS setting.

Nobody tells you this. It is not in the AMD documentation. It is not in any forum post I found. I learned it by rebooting twelve times and watching allocation errors.

The Windows Ceiling

HIP SDK 7.1 on Windows has a hard limit: roughly 60GB per hipMalloc call. I tested this exhaustively. With 126GB of system RAM free, 59GB allocations succeed. 64GB allocations either hang or OOM. This is not a RAM limitation. It is an allocation ceiling in the HIP Windows driver.

For 397B, this means 33 of 61 GPU layers is the maximum. That is where the 6.1 tok/s number comes from. Push to 35 layers and performance drops to 4.5 tok/s -- memory pressure. Push to 38 layers and you OOM.

For 235B, the ceiling caps you at 70 of 95 GPU layers. Still good -- 9.1 tok/s. But not what the hardware is capable of.

The Optimization Sweep

I ran 15 configurations on the 235B model to find the production-optimal settings. Zero crashes across the full sweep. Here are the results that matter:

The winner: context 2048, batch size 512, 9.9 tok/s average. Defaults are nearly as good at 9.8 tok/s. Flash attention, thread counts of 12-16 -- all within margin.

The performance killer: 24 threads. Dropped to 2.8 tok/s. Three and a half times slower than 12 threads. CPU oversubscription on Strix Halo is catastrophic. The instinct to throw all cores at the problem is exactly wrong.

The production config: num_ctx 2048-4096, num_batch 512, num_thread 12-16, never 24. flash_attention optional -- it does not measurably help on this architecture.

Why This Changes the Math

I run a multi-agent system called Groundswell. Seven agents, 22 scheduled tasks, operating on a $500/month budget. The single largest cost pressure is inference. Every agent call that hits a cloud API has a per-token cost. Every per-token cost compounds across thousands of daily invocations.

Gesha changes this equation. A 235B model at 9.9 tok/s is fast enough for agent routing decisions, content scoring, policy checks -- all the tasks that do not need frontier-model reasoning but do need to be cheap and fast. A 397B model at 6.1 tok/s is slow but usable for complex reasoning tasks that currently go to Claude Opus at API rates.

The math: a $2,500 one-time hardware investment replaces a recurring monthly inference bill. Not all of it. Frontier models still win for the hardest reasoning tasks. But for the 80% of agent calls that are classification, scoring, routing, and template-based generation -- local inference on Strix Halo is not a hobby project. It is a production alternative.

The Linux Frontier

Everything I have described so far is on Windows. The 60GB hipMalloc ceiling is a Windows limitation. On Linux with ROCm 6.4, kernel 6.13+, and amdgpu.gttsize set to 117760, the GTT (Graphics Translation Table) should expand to roughly 115GB. That means all 61 layers of the 397B model on GPU. No CPU spillover. No memory pressure.

My projection: 10-12 tok/s on 397B. Double the Windows number. On the same $2,500 hardware.

Ubuntu 24.04 is the next build. When those numbers are confirmed, I will publish them the same way I published the Windows results -- with terminal screenshots, exact commands, and reproducible steps. The GitHub repo is public: thebeedubya/autoresearch.

What the Inference Community Validated

I posted the 397B results to r/LocalLLaMA on Reddit and to X. The Reddit community -- the single most technically demanding audience for local inference benchmarks -- engaged. Not with skepticism. With curiosity. Because nobody had published Strix Halo numbers at this scale before.

The X posts tagged @ggerganov (llama.cpp creator, 54K followers), @AMD (1.3M followers), and the local inference community. Real terminal screenshots. Real numbers. No generated graphics, no marketing polish. Just the receipt.

The Operator Lesson

Everyone in the AI industry talks about inference costs like they are an immutable law of physics. API pricing goes down 10% per quarter and we are supposed to be grateful. Meanwhile, a $2,500 AMD desktop is running 397 billion parameters in a home office.

The gap between cloud inference pricing and local inference capability is not closing. It is inverting. The hardware exists today -- not in a research lab, not behind a waitlist, not at enterprise pricing. On a desk. Ordered with a credit card. Running models that were supposed to require an H100 cluster.

I am not suggesting everyone should build their own inference stack. Most teams should not. But every operator should understand that the economics of AI inference are not fixed. They are a choice. And the range of choices just got dramatically wider.

Not because some breakthrough happened in model architecture. Because a $2,500 APU with unified memory and a well-compiled inference engine can do what a datacenter GPU did eighteen months ago.

The most expensive assumption in AI right now is that you need expensive hardware.