A 2023 paper from researchers at Yale University and Google explained that, by saving prompts on the inference server, developers can “significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.”
“It is becoming expensive to use closed-source LLMs when the usage goes high,” noted Andy Thurai, VP and principal analyst at Constellation Research. “Many enterprises and developers are facing sticker shock, especially if they have to repeatably use the same prompts to get the same/similar responses from the LLMs, they still charge the same amount for every round trip. This is especially true when multiple users enter the same (or somewhat similar prompt) looking for similar answers many times a day.”
Use cases for prompt caching
Anthropic cited several use cases where prompt caching can be helpful, including in conversational agents, coding assistants, processing of large documents, and allowing users to query cached long form content such as books, papers, or transcripts. It also could be used to share instructions, procedures, and examples to fine-tune Claude’s responses, or as a way to enhance performance when multiple rounds of tool calls and iterative changes require multiple API calls.