Using Ollama With Controller

Controller supports running models locally using Ollama. This provides privacy, offline access, and potentially lower costs, but requires more setup and a powerful computer.

Website: https://ollama.com/

Setting up Ollama

Download and Install Ollama: Download the Ollama installer for your operating system from the Ollama website. Follow the installation instructions. Make sure Ollama is running
```
ollama serve
```
Download a Model: Ollama supports many different models. You can find a list of available models on the Ollama website. Some recommended models for coding tasks include:
- codellama:7b-code (good starting point, smaller)
- codellama:13b-code (better quality, larger)
- codellama:34b-code (even better quality, very large)
- qwen2.5-coder:32b
- mistralai/Mistral-7B-Instruct-v0.1 (good general-purpose model)
- deepseek-coder:6.7b-base (good for coding tasks)
- llama3:8b-instruct-q5_1 (good for general tasks)
To download a model, open your terminal and run:
```
ollama pull <model_name>
```
For example:
```
ollama pull qwen2.5-coder:32b
```
Configure the Model: Configure your model's context window in Ollama and save a copy.

Default Context Behavior
Controller automatically defers to the Modelfile's num_ctx setting by default. When you use a model with Ollama, Controller reads the model's configured context window and uses it automatically. You don't need to configure context size in Controller settings - it respects what's defined in your Ollama model.

Option A: Interactive Configuration

Load the model (we will use qwen2.5-coder:32b as an example):
```
ollama run qwen2.5-coder:32b
```
Change context size parameter:
```
/set parameter num_ctx 32768
```
Save the model with a new name:
```
/save your_model_name
```
Option B: Using a Modelfile (Recommended)

Create a Modelfile with your desired configuration:
```
# Example Modelfile for reduced context
FROM qwen2.5-coder:32b

# Set context window to 32K tokens (reduced from default)
PARAMETER num_ctx 32768

# Optional: Adjust temperature for more consistent output
PARAMETER temperature 0.7

# Optional: Set repeat penalty
PARAMETER repeat_penalty 1.1
```
Then create your custom model:
```
ollama create qwen-32k -f Modelfile
```
Override Context Window
If you need to override the model's default context window:
- Permanently: Save a new model version with your desired num_ctx using either method above
- Controller behavior: Controller automatically uses whatever num_ctx is configured in your Ollama model
- Memory considerations: Reducing num_ctx helps prevent out-of-memory errors on limited hardware
Configure Controller:
- Open the Controller sidebar ( icon).
- Click the settings gear icon ().
- Select "ollama" as the API Provider.
- Enter the model tag or saved name from the previous step (e.g., your_model_name).
- (Optional) Configure the base URL if you're running Ollama on a different machine. The default is http://localhost:11434.
- (Optional) Enter an API Key if your Ollama server requires authentication.
- (Advanced) Controller uses Ollama's native API by default for the "ollama" provider. An OpenAI-compatible /v1 handler also exists but isn't required for typical setups.

Tips and Notes

Resource Requirements: Running large language models locally can be resource-intensive. Make sure your computer meets the minimum requirements for the model you choose.
Model Selection: Experiment with different models to find the one that best suits your needs.
Offline Use: Once you've downloaded a model, you can use Controller offline with that model.
Token Tracking: Controller tracks token usage for models run via Ollama, helping you monitor consumption.
Ollama Documentation: Refer to the Ollama documentation for more information on installing, configuring, and using Ollama.

Troubleshooting

Out of Memory (OOM) on First Request

Symptoms

First request from Controller fails with an out-of-memory error
GPU/CPU memory usage spikes when the model first loads
Works after you manually start the model in Ollama

Cause If no model instance is running, Ollama spins one up on demand. During that cold start it may allocate a larger context window than expected. The larger context window increases memory usage and can exceed available VRAM or RAM. This is an Ollama startup behavior, not a Controller bug.

Fixes

Preload the model
```
ollama run &lt;model-name&gt;
```
Keep it running, then issue the request from Controller.

Pin the context window (num_ctx)

Option A — interactive session, then save:

# inside `ollama run &lt;base-model&gt;`
/set parameter num_ctx 32768
/save &lt;your_model_name&gt;

Option B — Modelfile (recommended for reproducibility):

FROM &lt;base-model&gt;
PARAMETER num_ctx 32768
# Adjust based on your available memory:
# 16384 for ~8GB VRAM
# 32768 for ~16GB VRAM
# 65536 for ~24GB+ VRAM

Then create the model:

ollama create &lt;your_model_name&gt; -f Modelfile

Ensure the model's context window is pinned Save your Ollama model with an appropriate num_ctx (via /set + /save, or preferably a Modelfile). Controller automatically detects and uses the model's configured num_ctx - there is no manual context size setting in Controller for the Ollama provider.
Use smaller variants If GPU memory is limited, use a smaller quant (e.g., q4 instead of q5) or a smaller parameter size (e.g., 7B/13B instead of 32B).
Restart after an OOM
```
ollama ps
ollama stop &lt;model-name&gt;
```

Quick checklist

Model is running before Controller request
num_ctx pinned (Modelfile or /set + /save)
Model saved with appropriate num_ctx (Controller uses this automatically)
Model fits available VRAM/RAM
No leftover Ollama processes

Setting up Ollama​

Tips and Notes​

Troubleshooting​

Out of Memory (OOM) on First Request​

Setting up Ollama

Tips and Notes

Troubleshooting

Out of Memory (OOM) on First Request