How to Use Prompt Caching to Save Money on AI APIs

What You'll Need

Tools Needed

Anthropic API account or OpenAI API account, Python 3.8+, code editor

Step-by-Step Instructions

1

Understand How Prompt Caching Works

Every time you send an API request, the model processes every token in your prompt. Your 500-token Context Card gets processed from scratch whether it's your first request or your thousandth. Prompt caching tells the API: 'I've sent this system prompt before. Reuse the processed version.' The API stores the processed prompt and skips reprocessing on subsequent calls. You pay a small caching fee on the first request, then dramatically reduced costs on every cached hit after that. For real estate agents running batch operations (50 listing descriptions with the same system prompt), this adds up fast.

Tip: Think of prompt caching like a GPS that remembers your home address. Instead of re-calculating the route from scratch every time, it starts from a saved point. Your Context Card is the saved point. Each new task just adds the destination.

2

Identify Your Cacheable Content

Cacheable content is anything that stays the same across multiple requests. In real estate AI workflows, this includes: your Context Card (brand voice, market context, style preferences), your role definition ('You are a luxury real estate copywriter in Nashville'), output format instructions ('Use short sentences, under 150 words, lead with the strongest feature'), and reference examples (few-shot examples of your best work). The non-cacheable part is the specific task data: property details, lead information, market stats that change. Separate these clearly in your prompts.

Tip: Your cacheable content should be in the system prompt. Your variable content should be in the user message. This clean separation maximizes cache hit rates. Most real estate workflows have 60-80% cacheable content.

3

Structure Your API Calls for Maximum Cache Hits

Anthropic's API supports explicit cache control. In your system prompt, add cache_control markers that tell the API which content to cache. The system prompt (your Context Card) is the primary cache target. Keep it identical across requests—even small changes invalidate the cache. For batch operations like processing 50 listing descriptions, your system prompt stays the same for all 50. Only the user message (property details) changes. This means 49 out of 50 requests hit the cache and cost a fraction of the first request.

Tip: Cache breakpoints work best at the end of your system prompt. Put your Context Card and all static instructions in the system message with a cache marker at the end. Everything in the user message remains dynamic and uncached.

4

Implement Caching in Your Code

For Anthropic's API, add cache_control to your system prompt block. The implementation is straightforward: structure your messages so the system prompt includes your full Context Card with a cache_control breakpoint, and the user message contains only the variable task. On the first request, you pay a slight premium (25% extra on cached tokens). On every subsequent request within the cache window (typically 5 minutes), cached tokens cost 90% less. For a 50-item batch operation, that's 49 requests at 90% discount.

Tip: Anthropic's cache window is currently 5 minutes. Process your batch within that window to maximize hits. For 50 items with a 1-second delay between calls, the batch completes in under 2 minutes—well within the cache window.

5

Measure Your Savings and Optimize

Track three metrics: cache hit rate (percentage of requests that use cached prompts), cost per request (before and after caching), and total monthly spend. Anthropic's API returns cache hit information in every response. Build a simple logging function that records whether each request was a cache hit and the token costs. After one month, compare your cached costs to what you would have paid without caching. Most real estate API workflows see 40-70% total cost reduction with proper caching. Apply the OODA Loop: observe your cache hit rates, identify requests that miss the cache (usually from inconsistent system prompts), fix the inconsistency, and measure again.

Tip: If your cache hit rate is below 80%, you likely have inconsistencies in your system prompts. Even an extra space or changed word invalidates the cache. Use a single source of truth for each Context Card and load it programmatically rather than copying and pasting.

Real-World Example

See It in Action

Prompt

# Python example - Anthropic API with Prompt Caching
import os
import anthropic

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

# This system prompt gets cached after first request
system_prompt = """
You are a luxury real estate copywriter specializing in Nashville, TN.
Market context: Median $445K, 2.8 months inventory, 24 avg DOM.

Style guidelines:
- Short, punchy sentences
- Lead with strongest visual feature
- Under 150 words
- End with location context
- Never use: stunning, nestled, welcome to, boasts

Examples of my voice:
[3 example listing descriptions would go here]
"""

# Process a batch of listings with the same cached system prompt
properties = [
    "4BR/3BA colonial, Franklin, 3200 sqft, pool, renovated kitchen",
    "2BR/2BA condo, Germantown, 1100 sqft, rooftop access, modern finishes",
    "3BR/2BA ranch, Donelson, 1800 sqft, hardwood floors, fenced yard",
]

for prop in properties:
    message = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=300,
        system=[
            {
                "type": "text",
                "text": system_prompt,
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{"role": "user", "content": f"Write listing description: {prop}"}]
    )
    
    # Check cache usage
    print(f"Cache read tokens: {message.usage.cache_read_input_tokens}")
    print(f"Cache creation tokens: {message.usage.cache_creation_input_tokens}")
    print(message.content[0].text)
    print("---")

AI Output

BATCH PROCESSING WITH CACHING — COST COMPARISON Request 1 (cache creation): - System prompt: 450 tokens (cache write: 450 x $3.75/M = $0.00169) - User message: 25 tokens (standard: 25 x $3.00/M = $0.000075) - Output: 180 tokens ($15.00/M = $0.0027) - Total: $0.00446 Request 2 (cache hit): - System prompt: 450 tokens (cache read: 450 x $0.30/M = $0.000135) - User message: 22 tokens (standard: $0.000066) - Output: 165 tokens ($0.002475) - Total: $0.00268 Request 3 (cache hit): - System prompt: 450 tokens (cache read: $0.000135) - User message: 28 tokens (standard: $0.000084) - Output: 172 tokens ($0.00258) - Total: $0.00280 BATCH TOTAL (3 listings): $0.00994 WITHOUT CACHING (3 listings): $0.01278 SAVINGS: 22% on 3 items At 50 items: ~$0.14 with caching vs. ~$0.21 without = 33% savings At scale with larger system prompts (2000+ tokens): savings reach 60-70% Cache hit rate: 66.7% (2 of 3 requests) System prompt processing: 1 time instead of 3

Pro Tips

1

Keep your Context Cards identical across all requests in a batch. Load them from a file, not from memory. Even a typo fix between requests creates a new cache entry instead of hitting the existing one.

1

Larger system prompts benefit more from caching. A 2,000-token Context Card with few-shot examples saves 10x more per cached request than a 200-token role prompt. Put your most detailed, stable content in the cached system prompt.

1

Time your batch operations to complete within the cache window (5 minutes for Anthropic). If your batch takes longer, the cache may expire and costs revert to full price for later items.

1

Use prompt caching alongside model selection for maximum cost optimization. Cache your Context Card with Claude Haiku for routine tasks. The combination of caching + cheapest model produces AI output at pennies per request.

Common Mistakes to Avoid

Modifying the system prompt slightly between requests, invalidating the cache

Fix: Load your system prompt from a single source file. Any change—even adding a period or fixing a typo—creates a new cache entry. Treat the cached prompt as immutable during a batch operation.

Putting variable content (property details, lead names) in the system prompt instead of the user message

Fix: Separate static from dynamic content. System prompt = Context Card (static). User message = task and data (dynamic). Variable content in the system prompt guarantees cache misses on every request.

Not tracking cache performance, assuming it's working without verification

Fix: Log cache_read_input_tokens and cache_creation_input_tokens from every response. If cache_read is always 0, your caching isn't working. Debug by checking system prompt consistency across requests.

Frequently Asked Questions

What is prompt caching?

Prompt caching is a feature offered by AI API providers that stores the processed version of your prompt so it doesn't need to be reprocessed on every request. When you send the same system prompt repeatedly (like your Context Card), the API recognizes it and reuses the cached version. This reduces costs (up to 90% on cached tokens) and improves response speed. It's like preloading your GPS with a saved starting point instead of entering the full address every time.

Does prompt caching affect output quality?

No. Cached prompts produce identical results to non-cached prompts. The model processes the cached content exactly the same way—it just skips the redundant processing step. Your listing descriptions, market analyses, and lead responses will be the same quality whether the system prompt was cached or freshly processed. Caching is purely an efficiency optimization with no impact on output.

Which AI APIs support prompt caching?

Anthropic's Claude API offers explicit prompt caching with cache_control markers. OpenAI's API implements automatic prompt prefix caching on longer prompts. The implementation details differ, but the concept is the same: repeated system prompts get processed once and reused. Anthropic's caching gives you more control over what gets cached. OpenAI's caching happens automatically but with less granular control. For real estate batch operations, both effectively reduce costs.

How much can I actually save with prompt caching?

Savings depend on your system prompt size and batch volume. For a typical real estate setup with a 500-token Context Card processing 50 items: without caching, you process 25,000 system prompt tokens total. With caching, you process 500 tokens once and cache-read 24,500 tokens at 90% discount. That's roughly 60% savings on input token costs for that batch. Larger system prompts with more few-shot examples save even more. Monthly, most real estate API users save $10-$30, which adds up to significant annual savings.

Learn the Frameworks

Prompt Caching Context Window API Key

How to Use Prompt Caching to Save Money on AI

What You'll Need

Step-by-Step Instructions

Understand How Prompt Caching Works

Identify Your Cacheable Content

Structure Your API Calls for Maximum Cache Hits

Implement Caching in Your Code

Measure Your Savings and Optimize

See It in Action

Pro Tips

Common Mistakes to Avoid

Frequently Asked Questions

Related Guides

Prompt Caching Explained

Use the Anthropic Claude API

Use the OpenAI API

Related Articles

Best AI Tools for Real Estate

ChatGPT for Real Estate Guide

Learn Advanced AI Techniques Live