Trimming context vs understanding context

We had a debate on the team about context management. Worth writing down because every team building with LLMs hits this wall.

The problem: Context windows fill up. What do you cut?

Approach A (simple truncation): Trim tool outputs from older turns. Keep all user/assistant messages. Fast, cheap, predictable.

Approach B (semantic extraction): Run a small model (4o-mini) on tool outputs to extract only the sentences that actually answer the query. Keep structure, no paraphrasing. Then trim.

Trade-off: extra model call = latency + cost. Benefit: main LLM sees much smaller, higher-signal context.

The insight that stuck with me:

Truncation asks: “What’s oldest?” Semantic extraction asks: “What’s relevant?”

For voice agents especially, context fills fast. Being aggressive about what you keep matters more than how much you keep.

We landed on semantic extraction for RAG outputs specifically. They’re long, most text is irrelevant to the actual question. Short outputs skip the extra call entirely.