a typical session

ai can read 18,000 tokens of config on every message — wasting tokens before your prompt even starts

memorytune compresses your Claude Code config — CLAUDE.md, memory files, skill descriptions — so you spend tokens on work, not overhead.
type full workday
duration 4+ hours
model opus 4.6
plan max 5
effort max
tokens consumed
0K
tokens saved
0K
usage limits hit
0
config reduction
0%
workday — max 5 plan, opus 4.6, max effort, 8am to 6pm
tuned session
standard config
0 500K 1M 1.5M 8am 9 10 11 12pm 1 2 3 4 5 6pm USAGE LIMIT — MAX PLAN LOCKED OUT 10am–2pm LOCKED OUT 4–6pm server audit network fix visual upgrade layout tuning deploy + QA still working › standard config: 4 hours of work, 6 hours locked out tuned config: full day, never hit the limit — both on max effort
cumulative impact — tokens remaining after each technique
saved
remaining
baseline 100% + hidden state 82% -18% + mem dedup 67% -15% + naked code 55% -12% + fork loop 47% -8% + code switch 42% -5% + url inject 35% -8% 6 techniques applied. 65% fewer tokens. 98.2% fidelity. same output quality.
compression fidelity — 112 question a/b test
pass
degraded
instruction following 28/28 pass code generation 23/24 — 1 edge case architectural decisions 20/20 pass debugging + triage 22/22 pass context retention 17/18 — 1 edge case total 110/112 — 98.2% fidelity, 2 documented edge cases

compressed notation
what it actually looks like
ai reads tokens, not grammar. every heading, bullet point, and adjective is overhead — charged on every message. these are real config blocks, compressed with zero accuracy loss.
naked code — the only reader is the machine.
before — 440 tokens
## Code Documentation - Every function needs a docstring with description, args, returns, and examples - Add inline comments above complex blocks explaining the reasoning, not the what - README sections for each module with architecture overview and data flow - Type annotations on all function signatures and class attributes - Changelog entries for every modification
440 tok
after — 42 tokens
docs:none—ai reads source directly types:yes,skip obvious no readme,no changelog,no docstrings code IS the context
42 tok
mem dedup — save what matters. derive the rest.
before — 1,203 tokens
## Memory System - Save important user preferences to memory - Memory files go in the .claude/memory/ dir - Include frontmatter with name, description, and type fields - Update MEMORY.md index when saving memories - Types: user, feedback, project, reference - Don't save things derivable from code - Don't save git history or debugging solutions - Check for existing memory before creating new
1,203 tok
after — 156 tokens
mem:save→.claude/memory/ w/ frontmatter(name,desc,type) types:user|feedback|project|reference update MEMORY.md index on save skip:code-derivable,git-history,debug-fixes dedup:check existing before new
156 tok
code switch — write for the reader. the reader is a tokenizer.
before — 380 tokens
## Response Behavior Please keep your responses concise and focused on the task at hand. Do not include unnecessary preamble, summaries, or pleasantries. When you reference code, always include the file path and line number so the user can navigate directly to the relevant section. If you encounter an error, explain what went wrong and suggest a fix rather than just showing the error message.
380 tok
after — 38 tokens
resp:concise,task-focused,no filler code ref→filepath:line always error→explain+fix,not just dump
38 tok
fork loop — when stuck, fork. don't loop.
without fork loop
ssh connection refused → retry with -v flag → connection refused → try port 22 explicitly → connection refused → try username@ip instead → connection refused → check firewall... same error → retry original command → connection refused → ask user for help
6 attempts, same wall
fork loop
ssh connection refused ×2 → fork: agent A keeps ssh debug agent B checks routing + firewall → B finds: no internet forwarding to host → B fixes route, ssh connects solved in 2 steps, not 6
2 attempts trigger fork
hidden state — transformer runs. SSM thinks.
before — 410 tokens
## Working Memory - Before each response, mentally review all prior context to maintain continuity - Keep a running summary of decisions made, files changed, and approaches tried - When starting a new task, check if similar work was done earlier in the session - Compress old context when approaching limits — preserve decisions, drop details - Carry architectural understanding forward between messages, never start cold
410 tok
after — 39 tokens
mem:compressed state,not full replay decisions+changes→persist,details→drop similar prior work→reuse,don't redo architecture→carry forward always
39 tok
url inject — skip the form. drop the value.
before — 520 tokens
> "go to the project settings" 1. navigate to dashboard.example.com 2. click "Projects" in the sidebar 3. find the project named "api-v2" 4. click the gear icon 5. dialog: "Save changes?" → click OK 6. scroll to "Webhooks" section 7. type the new URL into the field 8. click "Save"
520 tok
after — 18 tokens
navigate→dashboard.example.com/projects/api-v2/settings#webhooks inject URL value→save
18 tok

session log
what happened in 4 hours without hitting the usage limit
0:00
CODEBASE AUDIT
Full repository scan across 3 services. 140 files analyzed, dependency tree mapped, 51 issues flagged.
0:45
API INTEGRATION
Payment provider connected from scratch. Endpoint mapping, webhook validation, error handling, retry logic. Tested end-to-end with live sandbox.
1:30
FRONTEND REBUILD
Dashboard rewritten from wireframes. 12 components, responsive grid, real-time data binding.
2:30
DATA MIGRATION
Schema redesigned for scale. Migration script with rollback safety. 50K records moved, integrity verified, zero downtime.
3:30
DEPLOY + MONITOR
CI/CD pipeline assembled. Automated test suite, staging deploy, production cutover.
5 major tasks. max effort the entire session. 1 context compaction. 0 usage limits hit.

questions
does compressed notation actually work?
112 questions tested across instruction following, code generation, architectural reasoning, debugging, and context retention. identical agents, one verbose, one compressed. 110 passed, 2 edge cases: one complex regex generation lost a capture group, one long-session context recall drifted on variable names. both fixed by backing off compression on those blocks. ai tokenizes into subwords, not grammar — the grammar is overhead paid on every message.
will this break my setup?
no. ai reads the same instructions — just fewer tokens to express them. test against the original after compressing. if comprehension drops on any task, you went too far on that section — back it off.
what tools does this apply to?
anything ai reads before responding. Claude Code, Cursor, Windsurf, Aider, custom agents. if ai spends tokens on config before doing your work, compression helps.
how long does compression take?
typical CLAUDE.md: 30-60 minutes following the patterns above. memory files and skill descriptions add time. the notation examples on this page cover the core patterns.
what about effort levels?
claude code has four effort levels: low, medium, high, and max. higher effort means deeper reasoning but burns tokens faster. this session ran entirely on max. dropping to high or medium for straightforward tasks can stretch the same token budget further — another lever alongside compression.
who built this?
justin at marow.ai. solo engineer, former platform team lead. built this after watching my own token budget burn on config overhead every day. the patterns came from 6 months of daily Claude Code sessions on max effort. [email protected] — happy to talk methodology.
research
adaptive quantization blueprinting
memorytune handles the prompt layer. this paper tackles the model layer — what if inference precision adapted per query?
Abstract. Current quantization methods apply fixed precision determined offline. The Quantization Blueprint Engine (QBE) analyzes each query at runtime and generates a per-layer precision blueprint: weight bit-widths, activation precisions, and KV-cache parameters across every transformer layer. QBE integrates JIT kernel synthesis, blueprint caching with semantic fingerprinting, FP4/FP6/FP8 tensor core execution, SLA-driven routing, mid-generation refresh, and type-aware quantization for hybrid transformer-SSM architectures.