Ollama (Local LLM Runner)

Ollama runs open-source LLMs locally inside your environment — Llama 3, Mistral, Phi-3, CodeLlama — with no API key required and no data leaving the sandbox.

Aithroyz deploys Ollama on a dedicated VM and exposes it at an HTTPS subdomain. Open WebUI and Flowise are pre-wired to the Ollama internal IP when deployed in the same plan, so models pulled into Ollama appear in both UIs automatically.

Access

API URL: https://ollama.<env-name>.ops.aithroyz.com

OpenAI-compatible: https://ollama.<env-name>.ops.aithroyz.com/v1

Auth: No API key required. Access is restricted to users authenticated via Google SSO at the gateway.

Pulling a model

Ollama starts with no models installed. Pull a model via the REST API or from Open WebUI's admin panel:

# Pull a model via the Ollama API
curl -X POST https://ollama.<env-name>.ops.aithroyz.com/api/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3:8b"}'

# Pull a smaller coding model
curl -X POST https://ollama.<env-name>.ops.aithroyz.com/api/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "codellama:7b-instruct"}'

# List installed models
curl https://ollama.<env-name>.ops.aithroyz.com/api/tags

ℹ

Model downloads can take several minutes depending on size. A 7B model is roughly 4–5 GB. The pull endpoint streams progress as newline-delimited JSON until the download completes.

Chatting via the API

Ollama exposes an OpenAI-compatible endpoint at /v1, so any OpenAI SDK or tool works without modification:

# Chat completions — OpenAI-compatible format
curl -X POST https://ollama.<env-name>.ops.aithroyz.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the OSI model in two sentences."}
    ]
  }'

# Streaming response
curl -X POST https://ollama.<env-name>.ops.aithroyz.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3:8b", "messages": [...], "stream": true}'

Model sizing guide

Choose your environment VM size based on the models you want to run. Larger models produce higher quality outputs but require more RAM and are slower on CPU-only instances:

e2-standard-2 (8 GB RAM)

7B parameter models — Llama 3 8B, Mistral 7B, Phi-3 Mini. Good quality, ~3–8 tokens/sec on CPU.

e2-standard-4 (16 GB RAM)

13B parameter models — Llama 2 13B, CodeLlama 13B. Better reasoning, ~2–4 tokens/sec on CPU.

e2-standard-8 (32 GB RAM)

30B+ parameter models — Llama 3 70B (Q4 quantized), Mixtral 8x7B. Near-GPT-3.5 quality; CPU inference only.

ℹ

Aithroyz sandbox environments run on CPU-only GCE instances. GPU instances are not currently supported. For GPU-accelerated inference, consider routing through the LLM Gateway (Claude, GPT-4) for latency-sensitive workloads.

Tips

Model storage

Downloaded models are stored at /root/.ollama/models on the Ollama VM and persist across restarts. Destroying and re-provisioning the environment clears them.

Use :instruct variants

For chat use cases, prefer instruct-tuned variants (e.g. llama3:8b-instruct, mistral:7b-instruct) over base models. Base models are for text completion only.

Embeddings

Ollama also serves embeddings via /api/embeddings or /v1/embeddings. Use nomic-embed-text or mxbai-embed-large for document indexing in Qdrant or Flowise.

Open WebUI integration

If Open WebUI is in the same plan, models pulled into Ollama appear in the Open WebUI model selector within 30 seconds — no configuration needed.

Open WebUI (AI Chat)Read article →Flowise (LangChain Visual Builder)Read article →LLM GatewayRead article →