Claude Opus 4.7 supports a 1,000,000-token context window. The number gets all the airtime. The behavior change underneath it gets less attention โ and that's the part that matters.
The old playbook for long-context models was "stuff more stuff in there." That playbook does not scale to 1M tokens. At full capacity a single query can cost roughly $7.50 in input alone (Opus pricing of $7.50/MTok in), or ~$15 with output. Run that ten times an hour on an ill-considered workflow and you've spent a car payment by lunch.
The new playbook is: treat the 1M window as a persistent in-memory cache, not a one-shot prompt. Once you internalize that, the patterns become obvious.
The mental model: cache, not prompt
Anthropic's prompt cache lets you keep up to 90% of a 1M-token prompt warm across calls for a 5-minute TTL. The cached portion costs 10% of the standard input rate. So a "1M context" workflow isn't "send 1M tokens every time" โ it's "send 1M tokens once, then ask 50 questions against that warm cache at one-tenth the price."
Concretely: a 900k-token cached prefix costs about $0.68 per follow-up turn instead of $6.75. That's the actual unlock. Once you see it, you'll catch yourself architecting around it.
8 patterns where 1M wins
1. Full-codebase audits
A medium codebase is 200kโ600k tokens. With 1M context you can hold the entire thing โ source files, tests, docs, dependency configs โ and run 30+ follow-up questions against the same cached snapshot. "Where is auth handled?" "Find every place we call this deprecated API." "What's the blast radius if I rename this function?" Each follow-up costs pennies because the codebase is cached.
We use this on Sunday nights for the Monday pulse: load the repo, ask twelve diagnostic questions, write the answers into our standup notes. Total cost: about $1.20.
2. Whole-book or whole-spec editing
A 300-page book is roughly 110k tokens. A serious technical spec runs 200k+. Load the whole document and ask Claude to find inconsistencies across chapters โ the kind of cross-reference checks a human editor flat-out cannot do at scale. Then iterate revisions against the cached version.
3. Long meeting transcripts
A four-hour board meeting transcript runs 60kโ80k tokens. Load it, then ask: who committed to what by when, where did the discussion drift, what decision got made on Topic X, who pushed back and why. The cross-temporal pattern recognition is where 4.7's new tokenizer and reasoning model legitimately outperforms its predecessors.
4. Multi-document synthesis
Twelve research papers. Six competitor product pages. Forty Stripe disputes. Load them all and ask for the synthesis. The 1M window means you don't pre-summarize each document (which loses fidelity), you just ask the question against the raw set.
5. Architecture review with full repo + diagrams
Opus 4.7's vision also jumped to 3.75MP (2576px max input). Combine that with the 1M context and you can drop in your full repo plus the system architecture diagram, the database schema PDF, and the API spec โ and have Claude reason about all of them together. This used to require a multi-step RAG pipeline; now it's one prompt.
6. Cross-version diffs (refactors, migrations)
Load main, load your branch, ask: "Find every behavioral change between these two versions, ignoring formatting." Or load v1.0 and v2.0 of a library and ask Claude to write the migration guide. Tokens used: 400kโ700k. Worth every cent.
7. Long-running agent memory replay
Opus 4.7 ships a file-system memory tool. Agents jot notes to disk between sessions. When a session resumes, you can load the full notes history (often 100k+ tokens after a few weeks) into the prompt and the agent picks up exactly where it left off. This is the foundation of the "dreaming" feature in Managed Agents โ we cover that in a separate guide.
8. Customer support replay
Load a customer's full ticket history โ chat logs, order history, support emails, churn signals โ and ask Claude to draft the response. The output is dramatically better than the standard "look up customer in CRM, paste into prompt" workflow because Claude sees the pattern, not just the most recent ticket.
3 patterns where 1M loses
Not every workflow benefits. Three classes where chunking + RAG still beats long context:
1. Needle-in-a-haystack lookups
"Find the customer record where email = jane@example.com" against 800k tokens of CSV data. Long-context models still degrade meaningfully on these โ recall drops to 70-85% depending on where the needle sits in the haystack. A SQL query or a vector lookup is faster, cheaper, and 100% accurate.
2. Very small queries with no shared context
If each query is unrelated to the last, there's no cache to hit. You're paying full input rate every time. A 50k context running 200 unrelated queries costs the same as 200 separate 50k prompts. Just send the small prompts.
3. Workflows that need fresh data each call
If your "context" is "the current state of the database" and that state changes between every call, caching does nothing. Use the API normally with a tool-use loop, not the long-context cache.
Budgeting: how to not burn $40 per query
Three rules we follow:
- Cache the static stuff first. System prompt, codebase, reference docs โ anything that doesn't change between turns goes first in your prompt, before any variable content. The cache only hits a prefix.
- Keep the variable tail small. User question, current diff, fresh data โ all of it concentrated in the last 5-10k tokens. If your variable content keeps growing, the cache will keep invalidating.
- Pre-budget per workflow. Before deploying a 1M-context workflow, calculate: input cost ร queries ร cache hit rate. We require any workflow over $1/call to have an explicit ROI justification.
code-reviewer skill loads ~120k tokens of cached prefix (CLAUDE.md, codebase summaries, recent PRs) and a ~5k variable tail (the diff). Per-review cost: ~$0.18. Without caching: ~$1.20. Six-fold cheaper, identical output.
Does 1M context kill RAG?
Short answer: no, but it kills bad RAG.
Pre-2026 RAG pipelines had four moving parts: chunker, embedder, vector store, retriever. Each was a place to lose information. Bad chunking split a function across two embeddings. Bad embedding lost the semantic link between "auth" and "session." Bad retrieval surfaced the wrong 5 chunks. The compounding error was often worse than just stuffing the raw context in.
With 1M context, you can skip RAG entirely for any corpus that fits. If your knowledge base is <800k tokens, dump it in. Cache it. Query against the cache.
RAG still wins when:
- Your corpus is larger than 1M tokens (most enterprise wikis).
- The corpus updates faster than the 5-minute cache TTL.
- You have strict latency requirements and can't afford the first cache-miss query.
- You need source citation โ Claude with raw 1M context still hallucinates citations occasionally.
For everything else, the 1M window is the simpler, cheaper, more accurate path.
The takeaway
1M context is not "use more tokens." It's "use the same tokens fifty times for one-tenth the price." Architect for cache hits. Cache the static prefix, keep the variable tail tiny, batch your queries inside the 5-minute window.
Workflows that follow this pattern become 6ร cheaper than the same workflow on a 200k-window model. Workflows that don't follow it become 6ร more expensive. There's no middle ground.
If you want the 25 recipes we use in production โ pre-tested, with cache layouts and budget math worked out โ the 1M Context Cookbook is $9 one-time. Otherwise, the framework above is yours. Go build.