Home

Published

- 6 min read

Private AI Coding: Run Secure, Local LLMs with OpenCode and Docker

img of Private AI Coding: Run Secure, Local LLMs with OpenCode and Docker

In the era of AI-assisted development, a quiet anxiety is spreading among engineering teams. We love the productivity boost of tools like GitHub Copilot or Cursor, but we hate the nagging question: Where is my code going?

For industries with strict compliance requirements or any developer who values privacy sending proprietary source code to a cloud-hosted LLM is a non-starter. The solution isn’t to abandon AI, but to bring it home.

By combining OpenCode (an open-source coding assistant) with Docker Model Runner (DMR), you can build a powerful, fully private AI coding environment that runs entirely on your local infrastructure. No data leaves your machine. No monthly API bills. Just pure, secure code generation.

This guide will walk you through the technical setup, from configuring the runner to implementing a local RAG (Retrieval-Augmented Generation) system for context-aware coding.

The Architecture: Privacy by Design

The setup relies on two core components:

  1. Docker Model Runner (DMR): A lightweight containerized service that hosts open-weight models (like Qwen or Mistral) and exposes an OpenAI-compatible API.
  2. OpenCode: The developer interface (CLI or IDE extension) that interacts with the model.

Unlike cloud solutions, OpenCode sends prompts and context only to your local DMR endpoint.

Step 1: Setting up Docker Model Runner

First, we need to spin up the inference engine. Docker’s new AI tools make this incredibly simple compared to managing raw Python environments.

1. Enable Model Runner If you are using Docker Desktop, enable the Model Runner feature and ensure the TCP port is exposed so OpenCode can talk to it.

   docker desktop enable model-runner --tcp
# The API will be available at http://localhost:12434/v1

2. Pull a Coding-Optimized Model General-purpose models like Llama 3 are okay, but for coding, you want models fine-tuned on vast repositories. We recommend Qwen2.5-Coder (3B or 7B parameters) for a balance of speed and accuracy on consumer hardware.

   # Pull the 3B parameter version (runs fast even on MacBook Air)
docker model pull hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

3. Verify the Model Ensure the model is loaded and ready to serve requests.

   docker model ls
# Output should list the qwen model

Step 2: Configuring OpenCode for Local Inference

Now we tell OpenCode to ignore the cloud and talk to our local Docker container.

Create or edit your project configuration file opencode.json in the root of your repository. We will define a custom provider that points to our local endpoint.

   {
  "$schema": "https://opencode.ai/schema",
  "provider": {
    "dmr": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Docker Model Runner",
      "options": {
        "baseURL": "http://localhost:12434/v1"
      },
      "models": {
        "qwen-coder-local": {
          "name": "hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m"
        }
      }
    }
  }
}

Now, when you run opencode, you can select qwen-coder-local. Any code you highlight or query you type travels over localhost never the internet.

Step 3: Making Small Models Smarter with Local RAG

A 3B parameter model is “small” compared to GPT-4. It doesn’t know your private library’s documentation. To fix this without retraining, we use RAG (Retrieval Augmented Generation).

We can build a specialized “Expert Agent” using LangChain.js and Docker Agentic Compose. This agent will ingest your local documentation (e.g., a snippets.md file) into a vector store and use it to ground the LLM’s answers.

The Agent Configuration (docker-compose.yml): We define a service that has access to our local docs and the model runner.

   services:
  golang-expert:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      # Point to the local Docker Model Runner (host gateway)
      MODEL_RUNNER_BASE: http://host.docker.internal:12434/v1
      MODEL_RUNNER_LLM_CHAT: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m
      MODEL_RUNNER_LLM_EMB: ai/embeddinggemma:latest
      CONTENT_PATH: /app/data
    volumes:
      - ./data:/app/data # Mount your local docs here

The RAG Logic (Simplified Node.js): Inside the agent, we split your documentation into chunks, create embeddings using a small local model (like embeddinggemma), and store them in memory.

   // 1. Create Embeddings for your docs
const store = new MemoryVectorStore(new OpenAIEmbeddings({
  configuration: { baseURL: process.env.MODEL_RUNNER_BASE }
}));

// Load your local docs
let content = fs.readFileSync(process.env.CONTENT_PATH, 'utf8');
let chunks = splitMarkdownBySections(content); // Custom helper

for (const chunk of chunks) {
  await store.addDocuments([chunk]); 
}

// 2. Query Loop
// When user asks a question, find relevant code snippets
const userQuery = "How do I implement a stream agent in Nova?";
const relevantDocs = await store.similaritySearch(userQuery, 2);

// 3. Construct Prompt
const prompt = `
System: You are an expert coding assistant. Use these snippets:
${relevantDocs.map(d => d.pageContent).join('\n')}

User: ${userQuery}
`;

// 4. Generate Answer
const response = await chatModel.invoke(prompt);
console.log(response);

Why This Matters

By implementing this architecture, you achieve three critical security goals:

  1. Data Sovereignty: Your proprietary code and internal documentation never leave your controlled environment.
  2. Supply Chain Security: You are using signed, hardened container images for your models, reducing the risk of pulling compromised model weights from public hubs.
  3. Cost Predictability: You can run millions of inference tokens for free (hardware costs only), allowing you to deploy agents for every developer without budget fears.

This is the future of secure development: AI that lives where your code lives.

To further enhance your cloud security and implement Zero Trust, contact me on LinkedIn Profile or [email protected]

Frequently Asked Questions (FAQ)

What hardware do I need to run these models?

For small coding models (3B-7B parameters), a modern laptop is sufficient. A MacBook Pro with an M-series chip (M1/M2/M3) works exceptionally well due to unified memory. You do not need a dedicated NVIDIA GPU cluster for personal use.

Can I use this with VS Code?

Yes. OpenCode integrates with developer workflows. By pointing the OpenCode configuration to your local Docker Model Runner, you can use it alongside your IDE terminal or as a CLI tool within VS Code.

Is the local model as good as GPT-4?

No, a 3B parameter model is not as "smart" as a trillion-parameter cloud model. However, for specific coding tasks (autocompletion, explaining snippets, generating boilerplate), it is surprisingly effective especially when augmented with RAG (local context), which often outperforms a generic "smarter" model that lacks knowledge of your specific codebase.

How do I update the model?

Updating is as simple as running a `docker model pull` command with the new model tag. Docker Model Runner handles the versioning and local caching for you.

Is this compliant with SOC2/HIPAA?

Running AI locally can simplify compliance because data never crosses a third-party trust boundary. However, you still need to ensure your local environment (workstation security, access controls) meets your compliance standards.

Resources


William OGOU

William OGOU

Need help implementing Zero Trust strategy or securing your cloud infrastructure? I help organizations build resilient, compliance-ready security architectures.