Anthropic has upped the ante for how much information a large language model (LLM) can consume at once, announcing on Tuesday that its just-released Claude 2.1 has a context window of 200,000 tokens. That’s roughly the equivalent of 500,000 words or more than 500 printed pages of information, Anthropic said.
The latest Claude version also is more accurate than its predecessor, has a lower price, and includes beta tool use, the company said in its announcement.
The new model powers Anthropic’s Claude generative AI chatbot, so both free and paying users can take advantage of most of Claude 2.1’s improvements. However, the 200,000 token context window is for paying Pro users, while free users still have a 100,000 token limit — significantly higher than GPT-3.5’s 16,000.
Claude 2’s beta tool feature will allow developers to integrate APIs and defined functions with the Claude model, similar to what’s been available in OpenAI’s models.
Claude’s previous 100,000 token context window had been significantly ahead of OpenAI in that metric until last month, when OpenAI announced a preview version of GPT-4 Turbo with a 128,000 token context windows. However, only ChatGPT Plus customers with $20/month subscriptions can access that model in chatbot form. (Developers can pay per usage for access to the GPT-4 API.)
While a large context window — the amount of data it can process at a time — looks compelling if you have a large document or other information, it’s not clear that LLMs can process large amounts of data as well as info in a smaller chunk. Greg Kamradt, an AI practitioner and entrepreneur who’s been tracking this issue, has run what he calls “needle in a haystack” analysis to see if tiny pieces of info within a large document are actually found when the LLM is queried. He repeats the tests putting a random statement in various portions of a large document that’s fed into the LLM and queried.
“At 200K tokens (nearly 470 pages), Claude 2.1 was able to recall facts at some document depths,” he posted on X (formerly Twitter), noting that he had been granted early access to Claude 2.1. “Starting at ~90K tokens, performance of recall at the bottom of the document started to get increasingly worse.” GPT-4 did not have perfect recall at its largest context either.
Running the tests on Claude 2.1 cost about $1,000 in API calls (Anthropic offered credits so he could run the same tests he had done on GPT-4).
His conclusions: How you craft your prompts matters, don’t assume information will always be retrieved, and smaller inputs will yield better results.
In fact, many developers seeking to query information from large amounts of data create applications that split that data into smaller pieces in order to improve retrieval results, even if the context window would allow more.
Looking at the new model’s accuracy, in tests with what Anthropic called “a large set of complex, factual questions that probe known weaknesses in current models,” the company said Claude 2.1 featured a 2-times decrease in false statements compared with the previous version. The current model is more likely to say it doesn’t know instead of “hallucinating” or making something up, according to the Anthropic announcement. The company also cited “meaningful improvements” in comprehension and summarization.