If you're building multi-turn conversation apps or processing large contexts with the Claude API, resending the full system prompt and message history on every request adds up fast. Prompt Caching is Anthropic's solution — cache hits reduce input token costs to 1/10th of the base price.

This guide covers the mechanics, both usage patterns, pricing math, and common pitfalls.

How Prompt Caching Works

On each API request, the system checks whether your prompt prefix is already cached:

Check cache: From the start of your prompt to the designated cache breakpoint, look for an existing cache entry
Hit → reuse: If matched, skip reprocessing that portion and use the cached version
Miss → write: Process the full prompt and write the prefix into cache for subsequent requests

Cache follows a hierarchy: tools → system → messages. Changes at any level invalidate that level and everything after it.

Key insight: caching is prefix-based. Only content that's continuous and unchanged from the very start can be cached. Inserting changes in the middle invalidates everything downstream.

Supported Models & Minimum Token Requirements

All active Claude models support Prompt Caching:

Model	Minimum Cacheable Tokens
Claude Opus 4.7 / 4.6 / 4.5, Haiku 4.5	4,096 tokens
Claude Sonnet 4.6 / 4.5, Opus 4.1	1,024 tokens

Prompts shorter than the minimum won't be cached even if marked with cache_control.

Two Usage Patterns

Automatic Caching (Recommended for Multi-Turn)

Add cache_control at the top level of your request. The system automatically places the cache breakpoint on the last cacheable block and moves it forward as the conversation grows:

Cache behavior across turns:

Request	Cache Behavior
Turn 1	Everything written to cache
Turn 2	Previous content read from cache; new turns written
Turn 3	All history read from cache; only new message written

Automatic caching is the simplest approach — one line of config handles multi-turn cache optimization without manual breakpoint management.

Manual Breakpoints (Fine-Grained Control)

Place cache_control on specific content blocks to precisely control what gets cached. Up to 4 breakpoints per request:

Best for prompts where some sections (like coding standards) never change and others (like the current question) always change.

Cache Duration (TTL)

Type	TTL	Write Price	Use Case
Default	5 minutes	1.25x base input	Fast-paced multi-turn conversations
Extended	1 hour	2x base input	Users may return after longer gaps

The 5-minute TTL means: if a user waits more than 5 minutes between messages, the next request triggers a cache write (1.25x price) instead of a cache read (0.1x price). For low-frequency interactions, consider 1-hour TTL.

Pricing Breakdown

Using Claude Sonnet 4.6 as an example (per million tokens):

Operation	Price	vs. Base Input
Base input	$3	1x
5-min cache write	$3.75	1.25x
1-hour cache write	$6	2x
Cache read	$0.30	0.1x
Output	$15	—

The math: first write costs 25% more, but every subsequent read saves 90%. If the same cache is read 2+ times, total cost is lower than no caching at all.

Monitoring Cache Performance

Every API response includes three key metrics in the usage field:

Total input tokens = cache_read + cache_creation + input

In steady-state conversations, cache_read_input_tokens should far exceed cache_creation_input_tokens. If you're constantly writing instead of reading, check whether content that shouldn't change is changing.

Cache Pre-Warming

For latency-sensitive applications, send an empty request to "warm" the cache before real traffic:

Common Pitfalls & Best Practices

Pitfall: Breakpoint on Changing Content

Best Practices Checklist

Start with automatic caching for multi-turn conversations — one line of config
Put stable content first: system prompt → reference docs → tool definitions → history → current question
Verify cache hits: check that cache_read_input_tokens > 0
Use 1-hour TTL for interactions with >5 minute gaps
Don't frequently add/remove tools — the tools array is first in the cache hierarchy; changes invalidate everything downstream

Full Example: RAG Application with Caching

Summary

Scenario	Recommended Approach
Multi-turn conversations	Top-level `cache_control` (automatic)
Large static reference + varying questions	Manual breakpoints
Fixed tool definitions + system prompt	Manual breakpoint at end of system
Low-frequency interactions (>5 min gaps)	1-hour TTL
First-response latency sensitive	Pre-warming request

Prompt Caching offers the best ROI of any Claude API optimization. Minimal configuration (as little as one line), significant savings (up to 90%), and zero impact on output quality. If your application has repeated prompt prefixes, enable it now.

This guide covers the mechanics, both usage patterns, pricing math, and common pitfalls.

How Prompt Caching Works

On each API request, the system checks whether your prompt prefix is already cached:

Check cache: From the start of your prompt to the designated cache breakpoint, look for an existing cache entry
Hit → reuse: If matched, skip reprocessing that portion and use the cached version
Miss → write: Process the full prompt and write the prefix into cache for subsequent requests

Cache follows a hierarchy: tools → system → messages. Changes at any level invalidate that level and everything after it.

Key insight: caching is prefix-based. Only content that's continuous and unchanged from the very start can be cached. Inserting changes in the middle invalidates everything downstream.

Supported Models & Minimum Token Requirements

All active Claude models support Prompt Caching:

Model	Minimum Cacheable Tokens
Claude Opus 4.7 / 4.6 / 4.5, Haiku 4.5	4,096 tokens
Claude Sonnet 4.6 / 4.5, Opus 4.1	1,024 tokens

Prompts shorter than the minimum won't be cached even if marked with cache_control.

Two Usage Patterns

Automatic Caching (Recommended for Multi-Turn)

Add cache_control at the top level of your request. The system automatically places the cache breakpoint on the last cacheable block and moves it forward as the conversation grows:

Cache behavior across turns:

Request	Cache Behavior
Turn 1	Everything written to cache
Turn 2	Previous content read from cache; new turns written
Turn 3	All history read from cache; only new message written

Automatic caching is the simplest approach — one line of config handles multi-turn cache optimization without manual breakpoint management.

Manual Breakpoints (Fine-Grained Control)

Place cache_control on specific content blocks to precisely control what gets cached. Up to 4 breakpoints per request:

Best for prompts where some sections (like coding standards) never change and others (like the current question) always change.

Cache Duration (TTL)

Type	TTL	Write Price	Use Case
Default	5 minutes	1.25x base input	Fast-paced multi-turn conversations
Extended	1 hour	2x base input	Users may return after longer gaps

Pricing Breakdown

Using Claude Sonnet 4.6 as an example (per million tokens):

Operation	Price	vs. Base Input
Base input	$3	1x
5-min cache write	$3.75	1.25x
1-hour cache write	$6	2x
Cache read	$0.30	0.1x
Output	$15	—

The math: first write costs 25% more, but every subsequent read saves 90%. If the same cache is read 2+ times, total cost is lower than no caching at all.

Monitoring Cache Performance

Every API response includes three key metrics in the usage field:

Total input tokens = cache_read + cache_creation + input

Cache Pre-Warming

For latency-sensitive applications, send an empty request to "warm" the cache before real traffic:

Common Pitfalls & Best Practices

Pitfall: Breakpoint on Changing Content

Best Practices Checklist

Start with automatic caching for multi-turn conversations — one line of config
Put stable content first: system prompt → reference docs → tool definitions → history → current question
Verify cache hits: check that cache_read_input_tokens > 0
Use 1-hour TTL for interactions with >5 minute gaps
Don't frequently add/remove tools — the tools array is first in the cache hierarchy; changes invalidate everything downstream

Full Example: RAG Application with Caching

Summary

Scenario	Recommended Approach
Multi-turn conversations	Top-level `cache_control` (automatic)
Large static reference + varying questions	Manual breakpoints
Fixed tool definitions + system prompt	Manual breakpoint at end of system
Low-frequency interactions (>5 min gaps)	1-hour TTL
First-response latency sensitive	Pre-warming request

Claude API Prompt Caching: Cut Input Costs by 90%

How Prompt Caching Works

Supported Models & Minimum Token Requirements

Two Usage Patterns

Automatic Caching (Recommended for Multi-Turn)

Manual Breakpoints (Fine-Grained Control)

Cache Duration (TTL)

Pricing Breakdown

Monitoring Cache Performance

Cache Pre-Warming

Common Pitfalls & Best Practices

Pitfall: Breakpoint on Changing Content

Best Practices Checklist

Full Example: RAG Application with Caching

Summary

Claude API Prompt Caching: Cut Input Costs by 90%

How Prompt Caching Works

Supported Models & Minimum Token Requirements

Two Usage Patterns

Automatic Caching (Recommended for Multi-Turn)

Manual Breakpoints (Fine-Grained Control)

Cache Duration (TTL)

Pricing Breakdown

Monitoring Cache Performance

Cache Pre-Warming

Common Pitfalls & Best Practices

Pitfall: Breakpoint on Changing Content

Best Practices Checklist

Full Example: RAG Application with Caching

Summary