Claude API Prompt Caching: Cut Input Costs by 90%
If you're building multi-turn conversation apps or processing large contexts with the Claude API, resending the full system prompt and message history on every request adds up fast. Prompt Caching is Anthropic's solution — cache hits reduce input token costs to 1/10th of the base price.
This guide covers the mechanics, both usage patterns, pricing math, and common pitfalls.
How Prompt Caching Works
On each API request, the system checks whether your prompt prefix is already cached:
- Check cache: From the start of your prompt to the designated cache breakpoint, look for an existing cache entry
- Hit → reuse: If matched, skip reprocessing that portion and use the cached version
- Miss → write: Process the full prompt and write the prefix into cache for subsequent requests
Cache follows a hierarchy: tools → system → messages. Changes at any level invalidate that level and everything after it.
Key insight: caching is prefix-based. Only content that's continuous and unchanged from the very start can be cached. Inserting changes in the middle invalidates everything downstream.
Supported Models & Minimum Token Requirements
All active Claude models support Prompt Caching:
| Model | Minimum Cacheable Tokens |
|---|---|
| Claude Opus 4.7 / 4.6 / 4.5, Haiku 4.5 | 4,096 tokens |
| Claude Sonnet 4.6 / 4.5, Opus 4.1 | 1,024 tokens |
Prompts shorter than the minimum won't be cached even if marked with cache_control.
Two Usage Patterns
Automatic Caching (Recommended for Multi-Turn)
Add cache_control at the top level of your request. The system automatically places the cache breakpoint on the last cacheable block and moves it forward as the conversation grows:
Cache behavior across turns:
| Request | Cache Behavior |
|---|---|
| Turn 1 | Everything written to cache |
| Turn 2 | Previous content read from cache; new turns written |
| Turn 3 | All history read from cache; only new message written |
Automatic caching is the simplest approach — one line of config handles multi-turn cache optimization without manual breakpoint management.
Manual Breakpoints (Fine-Grained Control)
Place cache_control on specific content blocks to precisely control what gets cached. Up to 4 breakpoints per request:
Best for prompts where some sections (like coding standards) never change and others (like the current question) always change.
Cache Duration (TTL)
| Type | TTL | Write Price | Use Case |
|---|---|---|---|
| Default | 5 minutes | 1.25x base input | Fast-paced multi-turn conversations |
| Extended | 1 hour | 2x base input | Users may return after longer gaps |
The 5-minute TTL means: if a user waits more than 5 minutes between messages, the next request triggers a cache write (1.25x price) instead of a cache read (0.1x price). For low-frequency interactions, consider 1-hour TTL.
Pricing Breakdown
Using Claude Sonnet 4.6 as an example (per million tokens):
| Operation | Price | vs. Base Input |
|---|---|---|
| Base input | $3 | 1x |
| 5-min cache write | $3.75 | 1.25x |
| 1-hour cache write | $6 | 2x |
| Cache read | $0.30 | 0.1x |
| Output | $15 | — |
The math: first write costs 25% more, but every subsequent read saves 90%. If the same cache is read 2+ times, total cost is lower than no caching at all.
Monitoring Cache Performance
Every API response includes three key metrics in the usage field:
Total input tokens = cache_read + cache_creation + input
In steady-state conversations, cache_read_input_tokens should far exceed cache_creation_input_tokens. If you're constantly writing instead of reading, check whether content that shouldn't change is changing.
Cache Pre-Warming
For latency-sensitive applications, send an empty request to "warm" the cache before real traffic:
Common Pitfalls & Best Practices
Pitfall: Breakpoint on Changing Content
Best Practices Checklist
- Start with automatic caching for multi-turn conversations — one line of config
- Put stable content first: system prompt → reference docs → tool definitions → history → current question
- Verify cache hits: check that
cache_read_input_tokens > 0 - Use 1-hour TTL for interactions with >5 minute gaps
- Don't frequently add/remove tools — the tools array is first in the cache hierarchy; changes invalidate everything downstream
Full Example: RAG Application with Caching
Summary
| Scenario | Recommended Approach |
|---|---|
| Multi-turn conversations | Top-level cache_control (automatic) |
| Large static reference + varying questions | Manual breakpoints |
| Fixed tool definitions + system prompt | Manual breakpoint at end of system |
| Low-frequency interactions (>5 min gaps) | 1-hour TTL |
| First-response latency sensitive | Pre-warming request |
Prompt Caching offers the best ROI of any Claude API optimization. Minimal configuration (as little as one line), significant savings (up to 90%), and zero impact on output quality. If your application has repeated prompt prefixes, enable it now.