A support agent I audited last quarter was answering the same three onboarding questions roughly 600 times a day, paying for a fresh embedding lookup plus a full model call every single time. The answers were identical. The bill was not. Caching should have cut the model spend by more than half, except the team had wired a naive exact-match cache that almost never hit, because users phrased the same question forty different ways. That gap, between a cache that technically exists and a cache that actually earns its keep, is where most agent knowledge bases quietly bleed money and latency. Picking the right caching library matters less than understanding which of four caches you are even trying to build.

An agent knowledge base has four cache layers, not one

The mistake I see most often is treating "caching" as a single decision. In a retrieval-augmented agent, there are at least four distinct things worth caching, and they have different economics, failure modes, and invalidation rules.

  • Prompt cache. Reuse of the static prefix of a prompt (system instructions, tool schemas, few-shot examples) so the model does not re-process the same tokens. This is mostly a provider-side feature now, and it is the safest cache you will ever run because the model still generates a fresh answer.
  • Embedding cache. Storing the vector for a document chunk or a query so you never re-encode identical text. Cheap, boring, and almost risk-free. If you are re-embedding the same knowledge base on every ingest, you are lighting money on fire.
  • Semantic response cache. Returning a previously generated answer when a new query is semantically close to an old one. This is where the real savings live, and also where the real danger lives.
  • Tool and result cache. Memoizing the output of a deterministic tool call (a SQL query, an API lookup, a computed aggregation) for some TTL. Useful, but freshness-sensitive.

The Spheron write-up on semantic caching for LLM inference is a decent tour of the prompt-cache and semantic-cache mechanics if you want the low-level view. The AWS Database team also has a practical breakdown of caching for LLM cost and latency that treats the response cache as an infrastructure concern rather than a framework toggle, which is the right mental model once you are past a prototype.

The key takeaway: before you choose a library, decide which layer you are building. A tool that is great for the embedding cache can be a liability as your semantic response cache.

The library shortlist, and what each is actually good at

Here is my honest read on the common options, based on what breaks under load rather than what the README promises.

Redis (with RedisVL or LangCache). Redis is the default answer for a shared, low-latency cache that multiple agent workers hit at once. It gives you an in-memory key-value store for exact matches and, through RedisVL and the managed LangCache layer, vector similarity search for semantic caching. The reason to reach for Redis is not novelty, it is operational maturity: TTLs, eviction policies, clustering, and observability that already exist in your stack. If you are running production traffic across more than one process, a shared Redis cache beats an in-process cache that each worker has to warm separately. The cost is that you are now running infrastructure and tuning a similarity threshold.

GPTCache. GPTCache is the fastest path to a working semantic cache in a Python-native stack. It bundles an embedding function, a vector store, and a similarity evaluator behind one interface, and it plugs into LangChain and LlamaIndex directly. If you want a semantic response cache running this afternoon without standing up separate services, this is the one. The tradeoff is that its default configuration is permissive, and the components it ships with are aimed at getting you started, not at multi-tenant production. Treat the defaults as a demo, not a deployment.

LangChain built-in caches. LangChain ships exact-match caches (in-memory, SQLite, Redis-backed) that intercept model calls transparently. These are genuinely useful during development, where you are running the same prompts repeatedly and want to stop paying for identical calls. The in-memory and SQLite variants are a one-line win for local iteration. Do not confuse them with a production semantic cache; the exact-match variants miss on paraphrasing, and the pattern is best for deterministic, repeated prompts.

MongoDB Atlas semantic cache. If your documents and application state already live in MongoDB Atlas, its vector search can double as a semantic cache, which removes a moving part. The pitch is consolidation: one system for documents, embeddings, and cached answers. The PingCAP comparison of databases for AI agents is a fair place to weigh this against Redis and dedicated vector stores when you are choosing the substrate rather than a bolt-on cache. Pick this when your gravity is already in the database, not when you are optimizing raw cache latency.

In-process LRU or SQLite. For a prototype, a plain functools.lru_cache or a SQLite table is completely fine and arguably correct. Zero infrastructure, trivial to reason about, easy to throw away. The moment you scale past one process or need TTLs and semantic matching, you will outgrow it, but shipping an LRU on day one to validate that caching even helps is smarter than architecting a distributed cache for traffic you do not have yet.

For a broader survey of the current semantic-cache tooling, Maxim AI's roundup of semantic caching solutions covers the newer entrants. And if you want the systems-thinking framing on where caching sits among evals, retrieval, and guardrails, Eugene Yan's patterns for building LLM systems is the reference I hand to engineers who are trying to see the whole board.

A decision matrix that starts from traffic, not tools

Choose the cache by the shape of your problem, then map to a library. These five dimensions decide almost everything.

Dimension Low pressure High pressure What it pushes you toward
Traffic pattern Unique, long-tail queries Repeated, clustered queries High repetition justifies a semantic response cache; long-tail favors embedding cache only
Freshness risk Answer stable for weeks Answer changes hourly High risk means short TTLs, tool-result cache over response cache
Multi-tenant permissions Single tenant, public data Many tenants, filtered data Permissioned data forces tenant-scoped keys, rules out global response cache
Invalidation needs Rarely changes Source docs churn Frequent churn needs event-driven invalidation tied to ingest
Latency budget Seconds acceptable Sub-100ms required Tight budgets favor in-memory Redis over a database-backed cache

Read this as a filter, not a scorecard. A high-traffic, single-tenant FAQ agent with stable answers is the textbook case for a semantic response cache in Redis or GPTCache, and you will see a real drop in cost and p95 latency. A multi-tenant agent over permission-filtered documents that change daily is the case where an aggressive response cache is a liability, and you should cache embeddings and tool results while leaving generation live.

The decision that matters most is the intersection of freshness risk and permissions. Those two together determine whether a wrong cache hit is a minor annoyance or an incident.

The failure modes that will bite you in production

Caching failures are sneaky because the system looks faster and cheaper right up until it serves the wrong thing. Here are the ones that actually cause pages.

Wrong semantic hits from a loose threshold. Semantic caches return a stored answer when the new query is "close enough." Set the similarity threshold too low and "How do I cancel my plan?" returns the cached answer for "How do I change my plan?" These failures are invisible in aggregate metrics and only show up in complaints. Tune the threshold on real query logs, start conservative, and log every hit with its similarity score so you can audit borderline matches.

Permission leakage. This is the one that ends up in a security review. If your response cache is keyed only by query text, a cached answer generated for a user who can see a document gets served to a user who cannot. Any cache over permission-filtered retrieval must include the permission set (tenant, role, access group) in the key. Never cache a personalized or filtered response in a global namespace.

Stale answers after source documents change. The cache returns yesterday's answer after someone updated the underlying policy doc this morning. This is not a threshold problem, it is an invalidation problem. Tie cache invalidation to your ingest pipeline: when a document changes, purge or version the cache entries derived from it. A cache with no invalidation story is a bug with a TTL.

Caching personalized responses as if they were shared. An answer that includes the user's account name, plan tier, or ticket history is not cacheable across users, full stop. Split your traffic: cache the generic, knowledge-base-derived portion, and generate the personalized wrapper live. Teams that skip this either leak data or cache nothing useful.

Cache stampede on a cold start. When a popular entry expires, every concurrent request misses at once and hammers the model in parallel. Redis-style caches handle this with request coalescing or a short lock; an in-process LRU under multiple workers does not. If you have real concurrency, this is a reason to prefer a shared cache with stampede protection over per-worker caches.

The pattern across all five: a cache miss costs you money, but a wrong cache hit costs you trust. Design for the second failure, not the first.

What I would actually build

For a production agent knowledge base with meaningful traffic, this is the layout I default to, and it maps cleanly to the four layers.

  • Let the model provider handle the prompt cache for static prompt prefixes. It is free performance and carries no correctness risk.
  • Run an embedding cache keyed by a hash of the exact chunk text, persisted so re-ingests do not re-encode unchanged content. This is the highest-return, lowest-risk cache you can add.
  • Add a semantic response cache in Redis (RedisVL or LangCache) with keys scoped by tenant and permission group, a threshold tuned on your own logs, and hit logging with similarity scores. Only cache answers derived from shared, non-personalized knowledge.
  • Memoize deterministic tool results with short, freshness-appropriate TTLs, and invalidate them on the events that change the underlying data.
  • Wire cache invalidation into the ingest pipeline so a document update purges the response and tool entries that depended on it. This is the piece most teams skip and later regret.

Start smaller than this if you are unproven. Ship an in-process LRU or GPTCache first, measure your actual hit rate against real traffic, and only graduate to shared Redis infrastructure once the data says caching helps. Instrument hit rate, cost saved, and any wrong-hit reports from day one, because a cache you cannot observe is a cache you cannot trust. This connects directly to the broader work of keeping agent latency budgets predictable and controlling the per-day cost of a production agent; caching is one of the few levers that improves both at once. It also sits alongside the larger question of agent memory beyond RAG and the patterns for building an agentic knowledge base, where the cache is one component of a system, not the system itself.

How OpenNash Can Help

Most caching decisions go wrong not because a team picked the wrong library, but because they cached the wrong layer for their traffic and skipped invalidation. OpenNash builds production AI agents where caching is designed alongside the retrieval, permissions, and guardrails, not bolted on after the bill spikes. Our process maps to the same decision framework in this post: we audit your traffic pattern and freshness risk, design the cache keys and invalidation rules around your permission model, build and tune the thresholds against your real query logs, and hand off a system you own with the observability to catch wrong hits before your users do.

If your agent is re-paying for identical answers, or you are nervous that a semantic cache might serve one tenant's data to another, that is exactly the kind of failure mode we design against. Book a call to map this caching architecture to your workflow.

Pick your first cache by looking at one day of your query logs: count how many questions are near-duplicates, and how often the answers would go stale. If duplicates are high and staleness is low, add a semantic response cache this week. If not, cache embeddings, fix your invalidation story, and leave generation live until the traffic tells you otherwise.