What Is Prompt Caching and Why Should You Care?
Every time you send a prompt to an AI model through the API, the model processes every single token from scratch. Your system instructions, your Context Card, your formatting rules — all of it gets recomputed on every request. That's wasteful when 80% of your prompt is identical across hundreds of API calls.
Prompt caching solves this. It stores the pre-computed results of your static prompt prefix — the part that doesn't change between requests — so the model can skip reprocessing it. Instead of paying full price for 2,000 tokens of system instructions on every call, you pay full price once and get a 50-90% discount on those tokens for every subsequent request.
Think of it like a real estate transaction coordinator who memorizes your standard checklists. The first time, they read every line. After that, they already know the routine and jump straight to what's unique about each deal. Same quality, fraction of the time.
This matters because AI API costs scale with volume. An agent running 50 listing descriptions through the API with a detailed Context Card might spend $3-5 in tokens. With caching enabled, that same batch drops to $0.50-1.50. Multiply that across a team of 20 agents running daily operations, and the savings are substantial.
According to NAR's 2025 Technology Survey, 68% of Realtors now use AI tools. Most use the chat interface. But the power users — the ones running bulk operations, automating lead responses, and building custom workflows — are the ones who benefit most from understanding caching.
How Prompt Caching Works (Without the Computer Science Degree)
The Technical Reality, Simplified
When you send a prompt to an AI model, the model converts your text into tokens and processes them through its neural network. This processing takes compute time and costs money. Prompt caching works by saving the intermediate computational state after processing your static prefix — everything before the dynamic part of your prompt.
Here's the sequence:
First request: You send your full prompt — system instructions + Context Card + the specific listing details. The model processes everything, generates a response, and caches the computation for the static prefix (your system instructions and Context Card).
Every subsequent request: You send the same system instructions + Context Card + different listing details. The model recognizes the cached prefix, skips reprocessing it, and only computes the new, dynamic portion. Faster response. Lower cost.
What Counts as a Cacheable Prefix?
The prefix must be identical, character for character, from the beginning of your prompt. You can't cache a paragraph in the middle. It works front-to-back: system prompt first, then any static instructions, then your variable content at the end. This is why well-structured prompts with a consistent system message and a variable user message are ideal for caching.
For real estate workflows, the pattern is natural. Your Context Card sits at the top (cacheable), your formatting instructions follow (cacheable), and the property-specific details go at the end (dynamic). The architecture of good prompting and efficient caching are the same thing.
Cache Lifetime and Expiration
Caches don't last forever. Anthropic's cache has a 5-minute TTL (time to live) that refreshes with each hit. OpenAI's caching is automatic and lasts 5-10 minutes of inactivity. Google's Gemini API offers explicit cache creation with configurable TTLs. If you're running a batch of 50 listings in sequence, the cache stays warm the entire time. If you run one prompt, wait an hour, and run another, you'll pay full price again.
Prompt Caching Across Providers: Cost and Feature Comparison
| Feature | Anthropic (Claude) | OpenAI (GPT-4o) | Google (Gemini) |
|---|---|---|---|
| Cache discount | 90% off cached input tokens | 50% off cached input tokens | 75% off cached input tokens |
| Setup required | Explicit — add cache_control breakpoints | Automatic — no code changes | Explicit — create cached content via API |
| Min prefix length | 1,024 tokens (Sonnet/Opus) | 1,024 tokens | 32,768 tokens |
| Cache TTL | 5 min (refreshes on each hit) | 5-10 min automatic | Configurable (minutes to hours) |
| Latency improvement | Up to 85% faster (TTFT) | Up to 80% faster | Significant for large contexts |
| Cache write cost | 25% premium on first write | No additional cost | Varies by model |
| Best for real estate | Bulk operations with detailed Context Cards | Any repeated prompt pattern | Large document analysis (CMAs, leases) |
Prompt caching features and pricing across the three leading AI providers. Anthropic offers the deepest discount (90%) but requires explicit setup. OpenAI's caching is automatic. Google requires the largest minimum prefix but offers configurable cache duration.
Real Estate Use Cases: Where Caching Saves the Most
1. Bulk Listing Descriptions
You have 15 new listings to write descriptions for. Your prompt includes a 1,200-token Context Card (your brand voice, market area, formatting rules) plus property-specific details for each listing. Without caching, you pay full price for that Context Card 15 times. With caching, you pay once and get the discount on the other 14. At Anthropic's 90% discount, those 14 cached calls cost roughly what one uncached call would.
2. Lead Response Automation
You're using the API to auto-draft responses to incoming leads. Every response uses the same system prompt: your tone, your market expertise, your call-to-action preferences. New leads come in throughout the day — each one triggers the same prefix with different lead details appended. Caching keeps your prefix warm as long as leads keep flowing.
3. Market Report Generation
Monthly market updates for different neighborhoods, all using the same report structure and formatting instructions. The template is cached; only the neighborhood data and stats change. This is one of the 5 Essentials — consistent output from consistent input.
4. ChatGPT Projects and Claude Projects (Caching You Already Use)
Here's something most agents don't realize: if you use ChatGPT Projects or Claude Projects, you're already benefiting from a form of caching. When you set Project Instructions in ChatGPT or a System Prompt in a Claude Project, those instructions get prepended to every conversation in that project. The platform handles the caching behind the scenes — your project instructions don't get fully reprocessed every time you start a new chat.
This is why setting up project instructions isn't just about consistency — it's about efficiency. The platform is optimized for exactly this pattern: static context up front, dynamic conversation below.
Example: Cacheable Prompt Structure for Listing Descriptions
# SYSTEM PROMPT (cached — same for every listing) You are a luxury real estate copywriter for the Phoenix metro area. Brand voice: Confident, knowledgeable, never salesy. Format: 150-200 words. Lead with the lifestyle, then features. Always include: neighborhood context, architectural style, one emotional hook. Never use: "boasts," "nestled," "stunning." # CONTEXT CARD (cached — same for every listing) Agent: Sarah Chen, Russ Lyon Sotheby's International Realty Market: Paradise Valley, Scottsdale, Arcadia Specialty: Luxury homes $1M+ Tone: Sophisticated but approachable # PROPERTY DETAILS (dynamic — changes per listing) Address: 6234 E Cactus Wren Rd, Paradise Valley, AZ 85253 Beds/Baths: 5/6 SqFt: 7,200 Lot: 1.1 acres Style: Contemporary desert modern Key features: Infinity pool, mountain views, chef's kitchen, home theater, 4-car garage List price: $4,250,000
When Caching Helps vs. When It Doesn't
Prompt caching isn't a universal cost saver. It shines in specific scenarios and does nothing in others. Understanding the difference keeps you from over-engineering your workflow.
Caching Works Best When:
You're making multiple API calls with the same prefix in a short window. Batch listing descriptions, sequential lead responses, iterating on a market report — any workflow where the same system prompt hits the API repeatedly within minutes. The more calls, the bigger the savings.
Your static prefix is large relative to your dynamic content. A 1,500-token Context Card with a 200-token property description means ~88% of your prompt is cacheable. A 100-token system prompt with a 2,000-token document to analyze means only ~5% is cacheable — minimal savings.
You're using the API directly. If you're building automations with the OpenAI API or the Anthropic API, caching is available and often automatic.
Caching Doesn't Help When:
You're using the chat interface for one-off conversations. If you open ChatGPT, ask one question, and close it, there's nothing to cache. Caching requires repeated, similar requests.
Your prompts vary completely each time. If every request has a different system prompt, there's no reusable prefix. Caching requires consistency in the front portion of your prompt.
Your requests are hours apart. Cache TTLs are typically 5-10 minutes. If you run one prompt at 9am and another at 2pm, the cache has expired. Batch your operations for maximum benefit.
The connection to tokens and costs is direct: caching reduces the effective token count you're billed for. Understanding how tokens work helps you estimate exactly how much you'll save.
Prompt Caching Quick-Start Checklist
- ☑Structure your prompts correctly: Static system instructions first, then Context Card, then dynamic content last. The cacheable portion must be at the front.
- ☑For Anthropic: Add
cache_control: {"type": "ephemeral"}breakpoints in your API messages to mark where caching should apply. - ☑For OpenAI: No changes needed. Caching is automatic for prompts with 1,024+ token prefixes. Just structure your prompts well.
- ☑Batch your operations: Run similar tasks in sequence, not spread across the day. Cache TTLs are 5-10 minutes — keep the cache warm.
- ☑Monitor your usage: Check
cached_tokensin the API response to verify caching is active. If it shows zero, your prefix may be too short or changed between requests. - ☑Build a master Context Card: Use AI Acceleration's Context Card framework to create a consistent, reusable system prompt — it's the ideal cacheable prefix.
- ☑Calculate your break-even: Anthropic charges a 25% write premium on first cache creation. You break even after just 2 cached reads. After that, every call saves 90%.