AI: Anthropic rolls out AI Memory Caching. RTZ #450

...so that AIs can remember more, re-process less, and be faster

Aug 16, 2024

From the very first post on AI: Reset to Zero, titled ‘AI: Memory of a Goldfish’, I’ve emphasized the urgent need to augment AI chatbots, applications and services with more memory, in all its forms. That was 450 daily posts ago today.

As surprising as it is, almost two years into the OpenAI ‘ChatGPT moment’, most of the major LLM AI systems we use still have the ‘memory of a goldfish’. They have to be reminded on who are are and what we’re asking about over and over again.

In those discussions, the emphasis was on more memory to remember what the end user’s needs today and tomorrow are all about. So as to better serve them mathematically with more Data, and AI ‘Reasoning’, ‘Smart Agents’, and deeper ‘Agentic capabilities’ with more use.

But there is a different type of memory that’s also needed, more robust systems that ‘remember’ previous AI prompts that come through LLM AI ‘APIs’ (Application Program Interfaces), that are many times ‘machine to machine’ (m2m) driven LLM AI queries in massive sequences to answer the end user’s queries. So better memory on this front is urgently needed as well.

And that’s what Anthropic, the second largest LLM AI company seems to be delivering with “Prompt Caching with Claude”, complete with examples:

“Prompt caching, which enables developers to cache frequently used context between API calls, is now available on the Anthropic API. With prompt caching, customers can provide Claude with more background knowledge and example outputs—all while reducing costs by up to 90% and latency by up to 85% for long prompts. Prompt caching is available today in public beta for Claude 3.5 Sonnet and Claude 3 Haiku, with support for Claude 3 Opus coming soon.”

Illustration of Claude holding cached context in prompt

“When to use prompt caching”

“Prompt caching can be effective in situations where you want to send a large amount of prompt context once and then refer to that information repeatedly in subsequent requests, including:”

“Conversational agents: Reduce cost and latency for extended conversations, especially those with long instructions or uploaded documents.”
“Coding assistants: Improve autocomplete and codebase Q&A by keeping a summarized version of the codebase in the prompt.”
“Large document processing: Incorporate complete long-form material including images in your prompt without increasing response latency.”
“Detailed instruction sets: Share extensive lists of instructions, procedures, and examples to fine-tune Claude's responses. Developers often include a few examples in their prompt, but with prompt caching you can get even better performance by including dozens of diverse examples of high quality outputs.”
“Agentic search and tool use: Enhance performance for scenarios involving multiple rounds of tool calls and iterative changes, where each step typically requires a new API call.”
“Talk to books, papers, documentation, podcast transcripts, and other long-form content: Bring any knowledge base alive by embedding the entire document(s) into the prompt, and letting users ask it questions.”

“Early customers have seen substantial speed and cost improvements with prompt caching for a variety of use cases—from including a full knowledge base to 100-shot examples to including each turn of a conversation in their prompt.”

The immediate advantages of better caching (memory) of course is cost reduction of the variable AI compute by 90%+, along with far lower latency (80%) in terms of faster response to queries.

As Ben’s Bites summarizes:

“What does this mean?”

“Here’s the deal, serious use with LLMs involves sending huge long prompts that don’t change for most of your requests. It could be in the form of:”

“Tons of examples for the LLM to reference for your task.”
“Long chats to keep the context of what you said earlier.”
“Working with long files for question answering.’

“And that makes a dent in your bank. But with caching, you can save the common part of your prompts for 5 mins. If you make another API call with largely the same prompt, Anthropic will use the saved part and let you run the call at a much lower cost. Not only that it makes getting the outputs faster too.”

Of course this new caching capability will be emulated by all the other LLM AI companies like OpenAI, Google, et al, and become what I call AI ‘Table Stakes’ in this AI Tech Wave.

But it’s important to flag the adding of more memory in AI systems, be it for ‘m2m’ AI computing, or directly for end users (b2c). More memory the better for the way to more ‘reasoned’ AGI. Stay tuned.

(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)

AI: Reset to Zero

Discussion about this post

Ready for more?