That message pinged in our engineering Slack channel with many sharing how they also are hitting there usage limits, and it perfectly captured my weekend. I had been leaning hard on Claude Code for my daily development - Mixpanel MCP to figma MCP to debugging on-call production bugs, but instead of speeding me up, I was suddenly hitting a wall. Rate limits were throttling my flow, and it wasn't just a temporary API glitch - my token burn was completely out of control.
Here is the war story of how a quick Slack thread triggered a deep dive into my prompt payloads, and how I finally got my Claude Code usage under control.
The Diagnosis: Flying Blind
When you're moving fast with AI coding tools, it's easy to ignore the context window until it breaks. I realized Claude Code was likely stuffing massive amounts of redundant context into my prompts, but to fix the leak, I needed to know exactly where the pipe was bursting.
I didn't have the granular, real-time visibility I needed to see exactly what was being sent with each query.
Gaining Visibility: The Token Tracker
You can't fix what you can't measure. I spun up a quick Python script to track, log, and analyze my token payloads locally.
Running this gave me the exact breakdown I needed, and the findings were jarring. I didn't just have one problem; I had two distinct token drains.
First, conversation cache compounding. As a session grew, the token payloads snowballed. My logs revealed a very specific session shape where the crossover point between maintaining useful context and wasting tokens disappears around turn 6-8:
| Phase | Per-fix cost |
|---|---|
| Fix 1–5 (early session) | ~280K |
| Fix 6–7 (same session) | ~735K |
A single code fix would start at a reasonable 280K tokens but rapidly bloat to 735K tokens per prompt as the conversation history dragged along. The actionable fix here requires strict session discipline: cap your tasks at 4-5 fixes, then either run /compact or start a completely fresh session.
But the second problem was the real "wait, what?" moment: the subagent clone army.
My logs showed Claude spawning up to 13 identical subagents for a single task. Each subagent was eating 66K to 84K tokens apiece. That's over 1M tokens before Claude wrote a single line of code. It wasn't just fetching context; it was re-spawning the exact same agent, with the exact same massive context window, over and over again. That one session hit 20.5 million tokens.
The Fix: We Already Had One
I needed a way to give Claude Code the lay of the land without sending over the entire codebase or letting it mindlessly spawn subagents to blindly grep the repo. I started brainstorming ways to generate a lightweight, structural map of the project.
Then it hit me: we already had one.
I turned to Kartograph - a local codebase memory layer we built for exactly this kind of problem. Kartograph gives Claude a local, offline map of your codebase. Ask it anything in natural language - it finds the right code instantly, across every file, every project. Fully local. Nothing leaves your machine.
But there was a catch. Wiring up Kartograph as an MCP (Model Context Protocol) tool wasn't enough on its own. If you just plug it in, Claude will still happily spawn subagents to do its own digging.
The critical missing piece was the behavioral instruction. I had to explicitly update the CLAUDE.md file in the repository root to instruct Claude Code how to behave. By adding a hard rule to use the Kartograph MCP for architecture queries instead of spawning subagents to search the repo, the tool actually started working as intended. The model navigated the architecture intelligently, and I only passed the specific files it actually needed to modify.
The Results: By the Numbers
I rolled out the Kartograph integration, locked down the CLAUDE.md instructions, and ran the token tracker.
The proof was in the session data first. That same trigger — a PR checkout plus a bug report — went from 20.5M tokens and 14 subagents to 2.4M tokens and zero subagents. Same PR. Same fixes. Just a Claude that knew where to look instead of spawning clones to rediscover the codebase from scratch.
The week of March 25th was my baseline - before any of this , the weekly numbers told the cleaner story:
- Week of Mar 25 (before): 460M tokens
- Current week (after): 212M tokens
Same three repos. Same daily development work. No change in what I was asking Claude to do — only how it was allowed to find the answers.
What You Can Do Right Now
You don't need Kartograph to start cutting your token usage today. Here's the action plan, in order of impact:
Step 1 - Measure first.
You can't fix what you can't see. Run the token tracker script against your ~/.claude/projects/ directory. Look at your top 3 costliest sessions. If subagent count is high and per-session tokens are in the millions, you have the same problem I had.
SINCE_DAYS=7 python3 claude_token_usage.py
Step 2 - Write a CLAUDE.md for every repo.
This is free and takes 10 minutes. A CLAUDE.md at the repo root is loaded at the start of every Claude Code session. Use it to tell Claude what it should and shouldn't do - which directories matter, which tools to prefer, what not to grep blindly. Without it, Claude explores your entire codebase from scratch every single session.
## Code Lookup Rules
- Never spawn Explore or Plan subagents for code lookups
- Use Grep with specific patterns before reading any file
- Read only the lines you need — never an entire file
- For PR reviews: search for the referenced class before reading the diffStep 3 - Manage your session shape.
Context always cascades. Every turn re-reads the full conversation history. A fix that costs 280K tokens at turn 3 costs 735K tokens at turn 7 — same code, just a longer session. The discipline that works:
- Cap sessions at 4-5 focused tasks, then start fresh
- Use
/compactmid-session when a task is done and you're moving to something new - One question per session is the cheapest shape — don't chain unrelated tasks.
If you want to enforce this automatically without thinking about it, there's an env var for that. An Anthropic engineer shared this directly as a workaround for context quality degradation:
CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000Set this in your shell profile. It forces Claude Code to auto-compact at 400K tokens instead of letting context drift toward the 1M limit. Their internal finding: quality degrades past 200K even though 1M is available. This env var enforces the discipline so you don't have to.
Step 4 - Give Claude a codebase index.
Grep and Explore subagents are Claude's fallback when it has no map. A semantic index — whether Kartograph or another tool — means Claude queries for exactly what it needs instead of scanning everything. We're working on releasing Kartograph as a standalone MCP tool — local, offline, no API key, works across all your projects simultaneously. If you want early access, reach out to me on X.
Bonus : Cut output tokens too.
There's a whole other layer we haven't touched: Claude's responses are verbose by default. Caveman is a Claude Code skill that makes Claude respond in compressed, caveman-style prose — same technical accuracy, ~75% fewer output tokens. It also compresses your memory files to cut ~45% of input tokens per session. It won't fix a 20M token session, but once your session shape is under control, it's a clean final layer of savings.
The full stack, in order of impact:
| Layer | Tool | What it targets | Savings |
|---|---|---|---|
| Codebase exploration | Kartograph | Subagent clone explosion | 71x per code lookup |
| Session discipline | /compact + AUTO_COMPACT_WINDOW |
Conversation cache spiral | 2.5x per-fix cost at turn 6+ |
| Output verbosity | Caveman | Claude's response length | ~75% output tokens |
| Command output | RTK / similar | Shell command noise | ~89% command output |
Attack them in that order. The first two are where the real money is.
The 200K context window is a ceiling, not a target. The goal is to give Claude exactly what it needs - nothing more.