What Ollama Logs Teach You About Running LLMs Locally¶
Running a large language model locally starts simple enough: pull a model, send a request, get a response. Performance tuning is where it gets interesting.
When models run behind a cloud API, there is not much to do when something goes wrong except wait or click retry. Running models locally changes that entirely. Ollama exposes rich server logs that describe exactly what the inference engine is doing on every request. Even without prior knowledge of how LLM inference works, you can paste those logs into the model and ask questions. The answers build a working understanding of what is actually happening under the hood, one slow request at a time.
That feedback loop, running a slow request, capturing the Ollama server output, and pasting it back as a new prompt, turned a single evening of troubleshooting into a detailed education in transformer inference internals. What follows is what that conversation actually taught.
Reading What the Logs Say¶
Ollama writes server logs by default — no special flags required. On macOS, tail them with tail -f ~/.ollama/logs/server.log, or watch the output if you run ollama serve directly in a terminal. What you see depends heavily on which backend is running the model.
Both GGUF and MLX models produce a mix of structured time=... level=INFO lines from Ollama's Go scheduler alongside raw output from the inference runner itself. The content differs significantly between backends.
GGUF models run through llama.cpp, which emits slot and srv prefixed lines directly to the log. On the first request with an empty cache you see slot selection by LRU and the sampler chain, followed by prompt processing progress and tokens-per-second throughput:
On a subsequent request where the prompt prefix is stable, the slot selection switches from LRU to LCP similarity scoring:
MLX models emit structured Go log lines from both Ollama's scheduler and the MLX runner subprocess, with a different progress format:
The MLX logs tell you whether the prefix cache was hit and how fast the prompt is being processed, but they do not expose the sampler chain, the LCP similarity score, or the per-checkpoint validation that makes GGUF logs so useful for diagnosing cache invalidation. If you want those diagnostics, you need a GGUF model.
For GGUF models, the sim_best value is the Longest Common Prefix similarity score between the current prompt and the cached prompt state. A score near 1.0 means the engine reused the KV cache for almost the entire prompt, processing only the new tokens appended since the last request. A score near 0.4 means the prompt diverged early and the engine reprocessed nearly everything from scratch.
In one troubleshooting session, the score dropped from 0.995 to 0.444 between requests, triggering a log line that was not subtle:
That reprocessing cost 28 seconds for a 40,000-token prompt. Pasting the log output back into the model produced an immediate, accurate diagnosis: the KV cache checkpoints had been invalidated by Sliding Window Attention. The model read its own performance telemetry and explained why it had been slow.
This works because these logs are structured text describing real system state. The model can reason about cache similarity scores, checkpoint positions, and window boundaries just as well as it reasons about code.
KV Cache, Prefix Caching, and Where Time Goes¶
The KV cache stores the attention key and value tensors for every token that has already been processed. When the next request begins with the same prefix, the engine skips recomputing those tokens and picks up from the cached state. For long conversations, or workflows where a large system prompt precedes every user message, this can reduce prompt processing time from tens of seconds to near zero.
Two things destroy prefix cache reuse. The first is prompt instability: any change to tokens before the divergence point invalidates everything after it. A GitHub Copilot session with a long list of MCP tool definitions injects dynamic workspace context alongside those static definitions, and if that dynamic content shifts between turns, the effective divergence point moves earlier, invalidating a larger portion of the cache. The second cause is architectural.
Flash Attention addresses a related but distinct problem. Standard attention materializes an N x N matrix in memory, where N is sequence length. At 32,000 tokens, that matrix runs to roughly 4 GB per layer. Flash Attention 2 (Dao et al., 2023) eliminates the full matrix by processing attention in small tiles that fit in the processor's fast on-chip memory, reducing peak memory from O(N²) to O(N) while producing identical results. On Apple Silicon, where the CPU and GPU share unified memory over a high-bandwidth on-die fabric, enabling Flash Attention reduces contention on that shared bus during long-context requests. Enable it by setting OLLAMA_FLASH_ATTENTION=1 before starting Ollama.
KV cache quantization, available once Flash Attention is enabled, compresses the cached tensors from f16 to q8_0 or q4_0, roughly halving or quartering cache memory at minimal quality cost. On a 128 GB machine running a 70B model at Q8, keeping the KV cache at f16 and the context at 32,000 tokens uses around 90 GB total. Quantizing the KV cache frees room for larger contexts or multiple loaded models without touching model weight precision.
Model Architecture and the SWA Trade-Off¶
Once the logs implicated Sliding Window Attention, the natural next question was which models avoid the problem entirely, and that question opened up a tour of modern attention mechanisms.
SWA restricts each token's attention to a fixed window of recent tokens rather than the full context history. Qwen3 uses a hybrid design where some layers apply full attention and others apply SWA. The efficiency argument is real: at 40,000 tokens, full attention requires a 40K x 40K matrix per layer, while an SWA layer with a 4,096-token window requires a 40K x 4K matrix, roughly ten times smaller. Models achieve long context windows partly through this trade-off.
The cache invalidation consequence is that checkpoints saved at positions outside the current SWA window cannot be reused. In the troubleshooting session, six cached checkpoints spanning positions 35,617 to 40,542 were checked against a divergence point at position 18,014, and all six failed. The SWA window had moved past the cached positions, making them structurally invalid regardless of content similarity. Even if the checkpoint content matched perfectly, the architecture could not use it.
Gemma 4 uses alternating local and global attention layers with SWA windows of 512 to 1,024 tokens. Smaller windows mean cheaper local layers but earlier checkpoint invalidation; global layers provide full-context anchoring. The long-context cache behavior differs from Qwen3 but sits in the same category.
Models with full attention throughout, Llama 3.3 70B and DeepSeek R1 distilled onto the Llama architecture among them, produce predictable checkpoint reuse at any context length. The cache hit rate from the logs stays near 0.995 as long as the prompt prefix is stable. The trade-off is that full attention at long contexts costs more compute during prompt processing, which is exactly why SWA exists. DeepSeek's full-scale V3 and R1 models use Multi-head Latent Attention (MLA) instead, compressing the KV cache by projecting keys and values into a low-dimensional latent space rather than windowing them, but those 671B models require more memory than fits on a single machine.
The practical takeaway for long-context workflows is to check ollama show --modelinfo for a sliding_window field before committing to a model. Its presence, and its value, tells you what cache invalidation behavior to expect at scale.
Sampler Settings and Repetition Loops¶
One Qwen3 session produced infinite output repetition. The logs showed the cause before any guessing was required:
A repeat penalty of 1.0 is multiplicative identity: it changes nothing. Temperature at 0.1 collapses the token probability distribution so the model almost deterministically picks the highest-probability token. Once the model begins repeating a sequence, that sequence becomes the highest-probability next token, the low temperature locks the choice in, and the narrow top_k pool offers no escape.
The fix is a Modelfile with corrected sampler parameters:
Save that as qwen3.6-optimized, then build and run it:
Raising repeat_penalty to 1.15 means each repeated token's logit is divided by 1.15, compounding across repetitions until the loop becomes self-defeating. Widening top_k to 40 and adding top_p 0.9 as nucleus sampling give the model more viable exits. Temperature at 0.3 is still low enough for coherent output but high enough to allow some stochasticity. The repeat_last_n 128 window catches longer patterns than the default 64.
What We Actually Changed¶
Two changes covered most of the ground. First, enable Flash Attention. If you start Ollama via ollama serve, export the variable in the same shell:
If you run Ollama as a macOS app, use launchctl instead so the variable is set before the app launches:
Then restart the Ollama app.
Second, the Modelfile above replaces the default sampler configuration, sets a large context window, and pins all layers to the GPU. Together these address the three main issues: memory bus pressure at long contexts, output repetition loops, and the model being unloaded between requests.
Further Reading¶
- Flash Attention 2, Dao et al. (July 2023): the memory complexity derivation and benchmark results
- llama.cpp KV cache reuse tutorial: prefix caching mechanics and session persistence
- KV cache quantization in Ollama (smcleod.net): the integration history and configuration details
- Qwen3 Technical Report, Qwen Team (May 2025): the hybrid attention architecture specification
- Production-Grade Local LLM Inference on Apple Silicon (November 2025): comparative benchmarks across MLX, llama.cpp, and MPS

