Skip to main content

Using Ollama With Controller

Controller supports running models locally using Ollama. This provides privacy, offline access, and potentially lower costs, but requires more setup and a powerful computer.

Website: https://ollama.com/


Setting up Ollama

  1. Download and Install Ollama: Download the Ollama installer for your operating system from the Ollama website. Follow the installation instructions. Make sure Ollama is running

    ollama serve
  2. Download a Model: Ollama supports many different models. You can find a list of available models on the Ollama website. Some recommended models for coding tasks include:

    • codellama:7b-code (good starting point, smaller)
    • codellama:13b-code (better quality, larger)
    • codellama:34b-code (even better quality, very large)
    • qwen2.5-coder:32b
    • mistralai/Mistral-7B-Instruct-v0.1 (good general-purpose model)
    • deepseek-coder:6.7b-base (good for coding tasks)
    • llama3:8b-instruct-q5_1 (good for general tasks)

    To download a model, open your terminal and run:

    ollama pull <model_name>

    For example:

    ollama pull qwen2.5-coder:32b
  3. Configure the Model: Configure your model's context window in Ollama and save a copy.

    Default Context Behavior

    Controller automatically defers to the Modelfile's num_ctx setting by default. When you use a model with Ollama, Controller reads the model's configured context window and uses it automatically. You don't need to configure context size in Controller settings - it respects what's defined in your Ollama model.

    Option A: Interactive Configuration

    Load the model (we will use qwen2.5-coder:32b as an example):

    ollama run qwen2.5-coder:32b

    Change context size parameter:

    /set parameter num_ctx 32768

    Save the model with a new name:

    /save your_model_name

    Option B: Using a Modelfile (Recommended)

    Create a Modelfile with your desired configuration:

    # Example Modelfile for reduced context
    FROM qwen2.5-coder:32b

    # Set context window to 32K tokens (reduced from default)
    PARAMETER num_ctx 32768

    # Optional: Adjust temperature for more consistent output
    PARAMETER temperature 0.7

    # Optional: Set repeat penalty
    PARAMETER repeat_penalty 1.1

    Then create your custom model:

    ollama create qwen-32k -f Modelfile
    Override Context Window

    If you need to override the model's default context window:

    • Permanently: Save a new model version with your desired num_ctx using either method above
    • Controller behavior: Controller automatically uses whatever num_ctx is configured in your Ollama model
    • Memory considerations: Reducing num_ctx helps prevent out-of-memory errors on limited hardware
  4. Configure Controller:

    • Open the Controller sidebar (res 18 icon).
    • Click the settings gear icon ().
    • Select "ollama" as the API Provider.
    • Enter the model tag or saved name from the previous step (e.g., your_model_name).
    • (Optional) Configure the base URL if you're running Ollama on a different machine. The default is http://localhost:11434.
    • (Optional) Enter an API Key if your Ollama server requires authentication.
    • (Advanced) Controller uses Ollama's native API by default for the "ollama" provider. An OpenAI-compatible /v1 handler also exists but isn't required for typical setups.

Tips and Notes

  • Resource Requirements: Running large language models locally can be resource-intensive. Make sure your computer meets the minimum requirements for the model you choose.
  • Model Selection: Experiment with different models to find the one that best suits your needs.
  • Offline Use: Once you've downloaded a model, you can use Controller offline with that model.
  • Token Tracking: Controller tracks token usage for models run via Ollama, helping you monitor consumption.
  • Ollama Documentation: Refer to the Ollama documentation for more information on installing, configuring, and using Ollama.

Troubleshooting

Out of Memory (OOM) on First Request

Symptoms

  • First request from Controller fails with an out-of-memory error
  • GPU/CPU memory usage spikes when the model first loads
  • Works after you manually start the model in Ollama

Cause If no model instance is running, Ollama spins one up on demand. During that cold start it may allocate a larger context window than expected. The larger context window increases memory usage and can exceed available VRAM or RAM. This is an Ollama startup behavior, not a Controller bug.

Fixes

  1. Preload the model

    ollama run &lt;model-name&gt;

    Keep it running, then issue the request from Controller.

  2. Pin the context window (num_ctx)

    • Option A — interactive session, then save:
      # inside `ollama run &lt;base-model&gt;`
      /set parameter num_ctx 32768
      /save &lt;your_model_name&gt;
    • Option B — Modelfile (recommended for reproducibility):
      FROM &lt;base-model&gt;
      PARAMETER num_ctx 32768
      # Adjust based on your available memory:
      # 16384 for ~8GB VRAM
      # 32768 for ~16GB VRAM
      # 65536 for ~24GB+ VRAM
      Then create the model:
      ollama create &lt;your_model_name&gt; -f Modelfile
  3. Ensure the model's context window is pinned Save your Ollama model with an appropriate num_ctx (via /set + /save, or preferably a Modelfile). Controller automatically detects and uses the model's configured num_ctx - there is no manual context size setting in Controller for the Ollama provider.

  4. Use smaller variants If GPU memory is limited, use a smaller quant (e.g., q4 instead of q5) or a smaller parameter size (e.g., 7B/13B instead of 32B).

  5. Restart after an OOM

    ollama ps
    ollama stop &lt;model-name&gt;

Quick checklist

  • Model is running before Controller request
  • num_ctx pinned (Modelfile or /set + /save)
  • Model saved with appropriate num_ctx (Controller uses this automatically)
  • Model fits available VRAM/RAM
  • No leftover Ollama processes