Running Code AI Locally: An Engineering Reality Check

Inhaltsverzeichnis

The Setup That Sounds Perfect on Paper
Context Windows Are Not a Detail
Model Behavior Matters More Than Model Names
Vague Prompts Create Expensive Work
When It Clicks, It Really Clicks
A Note for the Apple Silicon Crowd
Local Versus Cloud, Without the Mythology
The Actual Takeaway

Over the last couple of days, my LinkedIn feed has been flooded with euphoric posts about “Code AI” and “local coding assistants”. Screenshots of terminals, bold claims about productivity exploding, and the familiar undertone that if you are not running an LLM locally via Ollama, OpenCode, or Copilot, you are already falling behind.

I know that not only engineers read my blog. A fair number of managers and tech leads do as well, often just to get a polite rant about trends and what is actually happening behind the screenshots. So let me be clear early on: this is not a tool comparison, not a benchmark, and not a sales pitch. It is a practical experiment, run on my own machine, against my own code, under constraints that resemble real work.

I have learned to be careful with waves of excitement like this. I do not really believe in tools until they work on my own machine, with my own code, under my own constraints. So instead of resharing or debating, I decided to test it properly. I took a fully vibe-coded application from last week and used it as a guinea pig for cleanup, refactoring, and structural improvements.

The short version is simple: local code assistants do work.
The long version is more nuanced, and much less LinkedIn-ready.

The Setup That Sounds Perfect on Paper

I started with the setup that sounds most attractive in theory: a local Ollama backend combined with OpenCode or GitHub Copilot, editing code directly in the IDE, no cloud dependency, full control. Ollama is the de facto standard for running local models, and Copilot is deeply integrated into VS Code. This makes the setup both realistic and easy to reproduce.

On paper, this is the dream. In practice, the first hours were rough.

The initial problems had very little to do with model intelligence and a lot to do with fundamentals.

Context Windows Are Not a Detail

Context windows were the first silent killer. Many models default to 4k context, which is fine for chat but completely insufficient once an assistant starts reading files, planning changes, and applying edits. The symptoms were confusing at first. Agents announced that they were analyzing the codebase and then stopped. Tool calls failed in odd ways. Requests restarted or hung indefinitely.

Only after digging through Ollama logs did it become obvious that prompts were being aggressively truncated. Extending the context length to 32k was not an optimization. It was a prerequisite.

This is one of those details that never shows up in screenshots, but it defines whether the system works at all.

Model Behavior Matters More Than Model Names

I tested several models along the way. Some were fast and responsive for chat, but unreliable in agent-style workflows. Tool calls broke, Copilot complained about missing choices, and the overall experience felt brittle. Others were much stronger when it came to actual code quality, refactoring suggestions, and understanding Java and Vaadin patterns, but they were also far more eager to behave like full agents.

That eagerness turned out to be a double-edged sword. Broad scanning, long planning phases, and overthinking tasks that should have been trivial were common failure modes.

At some point, it became obvious that a lot of the slowness people conveniently omit in their “productivity booster” posts is self-inflicted.

Vague Prompts Create Expensive Work

A vague prompt like “Let’s refactor this class” can easily explode into dozens of backend calls, minutes of processing, and in one case, hours of waiting until a request finally timed out. This was not running on a lightweight machine either. I ran these tests on a Ryzen 9 9900X, 128 GB of RAM, and a GeForce RTX 4080 Super. This is not a toy setup.

Once GPU acceleration inside Docker was configured correctly, things improved dramatically. That single change reduced request times from 15 to 30 minutes down to seconds or a few minutes. Switching from agent mode to edit mode for single-file refactors was another major breakthrough.

Edit mode is much closer to what most developers actually want in daily work. Take this file, clean it up, apply changes, stop. No repo-wide archaeology.

When It Clicks, It Really Clicks

With the right setup and some discipline, the experience became genuinely impressive. I refactored a large Vaadin view, extracted magic numbers into constants, split complex constructors into setup methods, and improved readability without changing behavior. I then tackled four nearly identical dialogs, identified duplicated logic, extracted a shared base class, and reused it cleanly.

The assistant applied real changes, not just suggestions. At this point, it stopped feeling like a toy.

But it also did not turn into magic.

Even with everything tuned, some refactors were still slow. Not unusably slow, but slow enough that you would never confuse this with autocomplete. Some requests took seconds. Others took a minute or two. And honestly, if a refactoring takes three minutes, I am often faster with keyboard shortcuts and muscle memory. That matters in real engineering work.

A Note for the Apple Silicon Crowd

Before this turns into a hardware debate, let me be clear: I know Apple Silicon is popular, and I know many developers are very happy with it. For mobile work, battery life, and general development tasks, a MacBook Pro is a solid machine.

That said, I do not particularly like Apple hardware, and I am not pretending otherwise.

More importantly, this experiment stresses a very specific class of workloads: long-running local inference, large context windows, repeated agent calls, and Dockerized GPU workloads. This is exactly where a desktop-class system with a dedicated GPU and large amounts of RAM has structural advantages. A Ryzen 9 with 128 GB of RAM and an RTX 4080 Super is built for sustained throughput. It does not throttle under load, it does not share GPU memory with the rest of the system, and it does not optimize for portability.

Apple’s unified memory model is elegant, but context-heavy models consume memory aggressively. Once you move beyond small demos, those limits become visible quickly. That does not mean local assistants do not work on a MacBook. It simply means that if things feel slow or fragile on a high-end desktop setup, they will not magically improve on a laptop.

This is not about brand preference. It is about workload fit.

Local Versus Cloud, Without the Mythology

Out of curiosity, I ran the same project initialization task against a cloud model, Claude Haiku 4.5 and a local Qwen3-Coder Model. The contrast was striking. The cloud model produced a structured, confident project guide in a minute or two, while my local setup struggled with over one hour and came back with generic stuff.

That does not mean the cloud output was perfect or fully verified. But it highlighted an important reality: cloud models are still significantly better at summarization and large-scale reasoning under time pressure. Local models shine when you give them concrete code and a tightly defined task.

Many posts blur this distinction. They showcase local setups doing things that, in practice, either require very careful prompt engineering or simply work better in the cloud. Claiming massive productivity gains on a modern MacBook without mentioning context limits, GPU usage, or prompt discipline is, at best, incomplete.

The Actual Takeaway

My takeaway from this experiment is fairly simple, but it benefits from being explicit.

Local code assistants are real and useful. They are already capable of supporting serious editing and refactoring work. They are not gimmicks.

For well-scoped tasks, such as refactoring a single class, cleaning up a file, or extracting duplicated logic, they are fast enough on proper hardware and can save a decent amount of cloud cost. In those situations, running locally makes sense. The latency is acceptable, the feedback loop is tight, and you stay in control of your code and your data.

Once the scope grows, the trade-off shifts. If you spend more time carefully rephrasing a prompt to limit file access and constrain behavior than the task itself would take, it is often more efficient to run a larger cloud model and spend a few cents. Large-scope reasoning, summarization, and broad architectural guidance are still areas where cloud models simply perform better under time pressure.

Local assistants also do not magically solve problems. If a prompt is vague because you do not actually know what you want to change, the output will be vague as well. They do not replace understanding. They amplify whatever clarity or confusion you bring to the task.

And that brings me to the core point.

Engineering is still decision-making under constraints, not typing speed. Sometimes I am faster with keyboard shortcuts than a model is at reasoning. Sometimes the model helps me avoid tedious work. Knowing the difference is part of the job.

I will keep using local assistants, especially for cleanup, refactoring, and structural improvements. But I will remain skeptical when I see posts promising effortless productivity gains with a screenshot and a few emojis. The reality is more interesting than the hype, and far more educational if you actually run the experiment yourself.

Dominique Ronde

Dominique Ronde is a Staff Solution Engineer, PhD candidate in Applied Artificial Intelligence, and author focused on AI, data streaming, Apache Kafka, Apache Flink, and real-time system architecture. With more than 20 years of experience in IT, data platforms, and digital transformation, he helps organizations design reliable, scalable, and practical data systems. On Big Data Pilot, he writes about AI, machine learning, event streaming, software engineering, and the realities of building technology that actually works in production.