Technical 9 min read

Prompt Caching Explained: Save Money on AI Without Losing Quality

RW
Ryan Wanner

AI Systems Instructor • Real Estate Technologist

If you're running AI prompts through the API — bulk listing descriptions, lead responses, market reports — you're probably paying full price every single time. Prompt caching fixes that. It stores your reusable instructions so the model skips reprocessing them, cutting costs by up to 90% and making responses noticeably faster.

What Is Prompt Caching and Why Should You Care?

Every time you send a prompt to an AI model through the API, the model processes every single token from scratch. Your system instructions, your Context Card, your formatting rules — all of it gets recomputed on every request. That's wasteful when 80% of your prompt is identical across hundreds of API calls.

Prompt caching solves this. It stores the pre-computed results of your static prompt prefix — the part that doesn't change between requests — so the model can skip reprocessing it. Instead of paying full price for 2,000 tokens of system instructions on every call, you pay full price once and get a 50-90% discount on those tokens for every subsequent request.

Think of it like a real estate transaction coordinator who memorizes your standard checklists. The first time, they read every line. After that, they already know the routine and jump straight to what's unique about each deal. Same quality, fraction of the time.

This matters because AI API costs scale with volume. An agent running 50 listing descriptions through the API with a detailed Context Card might spend $3-5 in tokens. With caching enabled, that same batch drops to $0.50-1.50. Multiply that across a team of 20 agents running daily operations, and the savings are substantial.

According to NAR's 2025 Technology Survey, 68% of Realtors now use AI tools. Most use the chat interface. But the power users — the ones running bulk operations, automating lead responses, and building custom workflows — are the ones who benefit most from understanding caching.

How Prompt Caching Works (Without the Computer Science Degree)

The Technical Reality, Simplified

When you send a prompt to an AI model, the model converts your text into tokens and processes them through its neural network. This processing takes compute time and costs money. Prompt caching works by saving the intermediate computational state after processing your static prefix — everything before the dynamic part of your prompt.

Here's the sequence:

First request: You send your full prompt — system instructions + Context Card + the specific listing details. The model processes everything, generates a response, and caches the computation for the static prefix (your system instructions and Context Card).

Every subsequent request: You send the same system instructions + Context Card + different listing details. The model recognizes the cached prefix, skips reprocessing it, and only computes the new, dynamic portion. Faster response. Lower cost.

What Counts as a Cacheable Prefix?

The prefix must be identical, character for character, from the beginning of your prompt. You can't cache a paragraph in the middle. It works front-to-back: system prompt first, then any static instructions, then your variable content at the end. This is why well-structured prompts with a consistent system message and a variable user message are ideal for caching.

For real estate workflows, the pattern is natural. Your Context Card sits at the top (cacheable), your formatting instructions follow (cacheable), and the property-specific details go at the end (dynamic). The architecture of good prompting and efficient caching are the same thing.

Cache Lifetime and Expiration

Caches don't last forever. Anthropic's cache has a 5-minute TTL (time to live) that refreshes with each hit. OpenAI's caching is automatic and lasts 5-10 minutes of inactivity. Google's Gemini API offers explicit cache creation with configurable TTLs. If you're running a batch of 50 listings in sequence, the cache stays warm the entire time. If you run one prompt, wait an hour, and run another, you'll pay full price again.

Prompt Caching Across Providers: Cost and Feature Comparison

FeatureAnthropic (Claude)OpenAI (GPT-4o)Google (Gemini)
Cache discount90% off cached input tokens50% off cached input tokens75% off cached input tokens
Setup requiredExplicit — add cache_control breakpointsAutomatic — no code changesExplicit — create cached content via API
Min prefix length1,024 tokens (Sonnet/Opus)1,024 tokens32,768 tokens
Cache TTL5 min (refreshes on each hit)5-10 min automaticConfigurable (minutes to hours)
Latency improvementUp to 85% faster (TTFT)Up to 80% fasterSignificant for large contexts
Cache write cost25% premium on first writeNo additional costVaries by model
Best for real estateBulk operations with detailed Context CardsAny repeated prompt patternLarge document analysis (CMAs, leases)

Prompt caching features and pricing across the three leading AI providers. Anthropic offers the deepest discount (90%) but requires explicit setup. OpenAI's caching is automatic. Google requires the largest minimum prefix but offers configurable cache duration.

Real Estate Use Cases: Where Caching Saves the Most

1. Bulk Listing Descriptions

You have 15 new listings to write descriptions for. Your prompt includes a 1,200-token Context Card (your brand voice, market area, formatting rules) plus property-specific details for each listing. Without caching, you pay full price for that Context Card 15 times. With caching, you pay once and get the discount on the other 14. At Anthropic's 90% discount, those 14 cached calls cost roughly what one uncached call would.

2. Lead Response Automation

You're using the API to auto-draft responses to incoming leads. Every response uses the same system prompt: your tone, your market expertise, your call-to-action preferences. New leads come in throughout the day — each one triggers the same prefix with different lead details appended. Caching keeps your prefix warm as long as leads keep flowing.

3. Market Report Generation

Monthly market updates for different neighborhoods, all using the same report structure and formatting instructions. The template is cached; only the neighborhood data and stats change. This is one of the 5 Essentials — consistent output from consistent input.

4. ChatGPT Projects and Claude Projects (Caching You Already Use)

Here's something most agents don't realize: if you use ChatGPT Projects or Claude Projects, you're already benefiting from a form of caching. When you set Project Instructions in ChatGPT or a System Prompt in a Claude Project, those instructions get prepended to every conversation in that project. The platform handles the caching behind the scenes — your project instructions don't get fully reprocessed every time you start a new chat.

This is why setting up project instructions isn't just about consistency — it's about efficiency. The platform is optimized for exactly this pattern: static context up front, dynamic conversation below.

Example: Cacheable Prompt Structure for Listing Descriptions

Prompt
# SYSTEM PROMPT (cached — same for every listing)
You are a luxury real estate copywriter for the Phoenix metro area.
Brand voice: Confident, knowledgeable, never salesy.
Format: 150-200 words. Lead with the lifestyle, then features.
Always include: neighborhood context, architectural style, one
emotional hook. Never use: "boasts," "nestled," "stunning."

# CONTEXT CARD (cached — same for every listing)
Agent: Sarah Chen, Russ Lyon Sotheby's International Realty
Market: Paradise Valley, Scottsdale, Arcadia
Specialty: Luxury homes $1M+
Tone: Sophisticated but approachable

# PROPERTY DETAILS (dynamic — changes per listing)
Address: 6234 E Cactus Wren Rd, Paradise Valley, AZ 85253
Beds/Baths: 5/6
SqFt: 7,200
Lot: 1.1 acres
Style: Contemporary desert modern
Key features: Infinity pool, mountain views, chef's kitchen,
  home theater, 4-car garage
List price: $4,250,000

When Caching Helps vs. When It Doesn't

Prompt caching isn't a universal cost saver. It shines in specific scenarios and does nothing in others. Understanding the difference keeps you from over-engineering your workflow.

Caching Works Best When:

You're making multiple API calls with the same prefix in a short window. Batch listing descriptions, sequential lead responses, iterating on a market report — any workflow where the same system prompt hits the API repeatedly within minutes. The more calls, the bigger the savings.

Your static prefix is large relative to your dynamic content. A 1,500-token Context Card with a 200-token property description means ~88% of your prompt is cacheable. A 100-token system prompt with a 2,000-token document to analyze means only ~5% is cacheable — minimal savings.

You're using the API directly. If you're building automations with the OpenAI API or the Anthropic API, caching is available and often automatic.

Caching Doesn't Help When:

You're using the chat interface for one-off conversations. If you open ChatGPT, ask one question, and close it, there's nothing to cache. Caching requires repeated, similar requests.

Your prompts vary completely each time. If every request has a different system prompt, there's no reusable prefix. Caching requires consistency in the front portion of your prompt.

Your requests are hours apart. Cache TTLs are typically 5-10 minutes. If you run one prompt at 9am and another at 2pm, the cache has expired. Batch your operations for maximum benefit.

The connection to tokens and costs is direct: caching reduces the effective token count you're billed for. Understanding how tokens work helps you estimate exactly how much you'll save.

Prompt Caching Quick-Start Checklist

  • Structure your prompts correctly: Static system instructions first, then Context Card, then dynamic content last. The cacheable portion must be at the front.
  • For Anthropic: Add cache_control: {"type": "ephemeral"} breakpoints in your API messages to mark where caching should apply.
  • For OpenAI: No changes needed. Caching is automatic for prompts with 1,024+ token prefixes. Just structure your prompts well.
  • Batch your operations: Run similar tasks in sequence, not spread across the day. Cache TTLs are 5-10 minutes — keep the cache warm.
  • Monitor your usage: Check cached_tokens in the API response to verify caching is active. If it shows zero, your prefix may be too short or changed between requests.
  • Build a master Context Card: Use AI Acceleration's Context Card framework to create a consistent, reusable system prompt — it's the ideal cacheable prefix.
  • Calculate your break-even: Anthropic charges a 25% write premium on first cache creation. You break even after just 2 cached reads. After that, every call saves 90%.

Sources

  1. Anthropic — Prompt Caching documentation (cache_control, TTL, pricing)
  2. OpenAI — Prompt Caching guide (automatic caching, 50% discount)
  3. Google — Gemini API context caching documentation
  4. Anthropic — Claude API pricing (cached token rates)
  5. OpenAI — API pricing (cached input token rates)
  6. NAR — 68% of Realtors use AI tools (2025 Technology Survey)

Frequently Asked Questions

What is prompt caching in AI?
Prompt caching is a feature offered by AI API providers (Anthropic, OpenAI, Google) that stores the pre-computed results of your static prompt prefix so it doesn't need to be reprocessed on every API call. When you send the same system instructions repeatedly — like a Context Card for listing descriptions — caching means the model skips reprocessing that text and only computes the new, dynamic portion. This reduces costs by 50-90% depending on the provider and speeds up response times significantly.
How much money does prompt caching actually save?
Savings depend on your provider and usage volume. Anthropic offers a 90% discount on cached input tokens, OpenAI offers 50%, and Google offers approximately 75%. For a real estate team making 100 API calls per day with a 1,500-token Context Card prefix, that translates to roughly $45-90/month savings with Anthropic or $20-45/month with OpenAI. The larger your cacheable prefix and the more calls you make, the greater the savings.
Do I need to do anything special to enable prompt caching?
It depends on the provider. OpenAI's prompt caching is automatic — any prompt with a prefix of 1,024+ tokens gets cached with no code changes. Anthropic requires you to explicitly add cache_control breakpoints in your API messages to mark where caching applies. Google's Gemini API requires creating cached content objects via a separate API call. For ChatGPT and Claude chat interfaces, caching of project instructions happens automatically behind the scenes.
Does prompt caching affect the quality of AI responses?
No. Prompt caching is purely an optimization for cost and speed. The model produces the exact same output whether the prefix is processed fresh or loaded from cache. The cached computation is mathematically identical to reprocessing — the model is just skipping redundant work. You get the same listing descriptions, the same lead responses, the same quality. Just faster and cheaper.
How long does a prompt cache last before it expires?
Cache TTLs (time to live) vary by provider. Anthropic's cache lasts 5 minutes but refreshes every time it's used — so continuous usage keeps it warm indefinitely. OpenAI's automatic cache persists for 5-10 minutes of inactivity. Google's Gemini allows you to set custom TTLs when creating cached content. The practical implication: batch your similar operations together rather than spreading them across the day.
Can I use prompt caching with ChatGPT or Claude's chat interface?
Not directly — prompt caching is an API feature. However, if you use ChatGPT Projects or Claude Projects, the platform applies a form of caching to your Project Instructions automatically. Those static instructions get prepended to every conversation and the platform optimizes their processing behind the scenes. For explicit control over caching behavior and the full cost savings, you need to use the API directly.
What's the minimum prompt size needed for caching to work?
Anthropic requires a minimum of 1,024 tokens for the cacheable prefix (for Claude Sonnet and Opus models). OpenAI also requires at least 1,024 tokens. Google's Gemini has a much higher minimum of 32,768 tokens, making it suited for caching large documents rather than short system prompts. A typical real estate Context Card with detailed brand voice, market info, and formatting rules easily exceeds 1,024 tokens.

Related Terms

Keep Reading

Related Articles

Free Resources

Get the frameworks and workflows that make AI work for your business.

Free strategies, prompt chains, and implementation guides delivered to your inbox.

Get Free AI Strategies