Large language models (LLMs) with extremely long context windows (LCW) have become a focal point of discussion, particularly for enterprise adoption of GenAI [1,2,3,4]. Long context window LLMs such as Gemini, GPT-4o, Llama-1-405B, and Mixtral have further fueled this interest.
With the emergence of LCW models, came the development of academic benchmarks to quantify the effectiveness of these large systems. The popular benchmarks designed for LCW evaluation are as follows:
In this article, we conduct a comparative analysis between LCW models, utilizing in-context learning, and retrieval-augmented generation (RAG) systems paired with smaller context window models. This comparison spans both ends of the AI architectural size spectrum and is applied to two variants of the Needle in a Haystack (NIAH) benchmark.
Our major contributions in this study are twofold:
Needle in a Haystack benchmark, as the name suggests, is a synthetic document wherein a “needle” (i.e. a relevant piece of information) is hidden with a large “Haystack” (i.e. large corpus of text); And models (or systems of models) are evaluated with a query that aims to retrieve that particular needle.
The NIAH benchmark is highly relevant for Enterprise AI use cases that involve accessing and analyzing specific pieces of information from a large corpus of text and documents within enterprises.
We evaluate the two systems (RAG and LCW models) on two variants of the needle-in-a-haystack benchmark, namely:
To understand the effectiveness of these long context window LLMs and RAG systems, we evaluated their performance on the biographies (NIAH) benchmark.
The biographies benchmark expects the model (or system of models) to retrieve the needle (i.e. natural language text pertaining to a particular individual being asked about) from the haystack and perform reasoning on it in order to report the following pieces of information, when available.
The RAG systems evaluated are:
On the other hand, the LCW model evaluated is GPT-4 that has a maximum context window of 32k tokens (yellow curve). As the maximum context window of the GPT-4 system used in this analysis is 32k, for documents with larger than 32k tokens, we are unable to perform in-context learning to retrieve the needle from the haystack.
As demonstrated in Figure 2, we find that RAG systems coupled with smaller context window models (blue and red curves) are significantly more performant than the LCW models (yellow curve). Additionally, it is noteworthy that the performance of RAG systems are almost constant as the size of the haystack is varied from 2k to 2M tokens, while the LCW models demonstrate a sharp drop in accuracy as the haystack size increases. The LCW model tested here (GPT-4 32k) has a maximum context window of 32k tokens, making it impossible to evaluate in-context learning on documents with more than 32k tokens. The minor difference between the two RAG systems on the benchmark could be a product of forced JSON creation, which is a well known source of hallucination and degradation[10].
Although RAG systems are way more performant than LCW models on this benchmark, we identified a couple of reasons why the RAG systems don’t get a 100% accuracy on the task across different haystack sizes:
In addition to testing the model (or system of models) performance on retrieving a single needle (similar in genre to the haystack), we would like to subject the system to discovering and analyzing multiple hidden needles from a haystack. In this dataset, the hidden needles are distributed across the haystack and are very different in genre from the haystack, with the goal of evaluating if the system can retrieve distributed hidden needles from haystacks with a lot of distracting text.
The haystack is composed of a collection of Paul Graham’s essays, while the 3 needles inserted in the haystack are:
The retrieval question that different systems are subjected to, for retrieving all the different needles from the haystack, is: “What are the secret ingredients needed to build the perfect pizza?”
For this multi-needle retrieval analysis, we compare two systems:
As shown in Figure 3, we find that the performance of in-context learning using GPT-4o falls from a perfect 100% as the document size (haystack size) increases. We also note that there are some haystack sizes that are more vulnerable than others, for instance, GPT-4o is able to discover only 1 needle of the 3 inserted needles (~33%) when inserted across different depths in a haystack of size 65k tokens, while the system does much better and retrieves all needles placed across different depths when reading a haystack of size 100k tokens. Also, the accuracy of retrieving needles is zero for haystacks greater than 128k tokens because GPT-4o has a maximum context window size of 128k tokens, preventing in-context learning for larger documents (or haystacks).
On the other hand, we observe that the Yurts RAG (end-to-end) system is able to retrieve all the needles across the different document (haystack) sizes. We get a perfect retrieval because the retrieval question is very straightforward and does not require any complex reasoning from the coupled LLM (here, Llama-3-8b-instruct).
To further examine the relationship between the accuracy of needle retrieval with the depth of needle insertion into the haystack, we insert the needle into multiple depths and pose the retrieval question to the system being evaluated.
We evaluated the mean accuracy over (3 runs) for GPT-4o subjected to the same task, with needles inserted into different depths within the haystack. We find that (Figure 4) GPT-4o typically is successful in retrieving the relevant needles from the haystack when the needle is either placed in the start or end of the document; but struggles when it’s placed within the document. On the other hand, the Yurts RAG system (Figure 5) is able to retrieve the relevant needles irrespective of its position within the haystack - a necessity while working with large volumes of documents within enterprises.
Having demonstrated that the raw accuracy of using smaller RAG systems coupled with smaller context window models is more capable than in-context learning using very long context window models, we want to highlight the compute requirements and costs associated with each of the systems (Table 1).
Yurts RAG (end-to-end) combined with Llama-3-8B-instruct requires up to 2 A10 GPUs for single-user operations and can scale to support 50 concurrent users with 4 A10 GPUs. While the RAG approach involves additional retrieval components, such as ElasticSearch, Vector databases, and smaller neural networks, we emphasize GPU requirements as they continue to be the primary cost driver.
On the other hand, long context window models can require a minimum 40 A10 GPUs for their inference (for a single user). As the number of parameters and architecture of GPT-4o is not public knowledge, we will detail the GPU requirements for an open-source model that has been tuned to endow it with a very long context window. The model we will investigate is Gradient AI’s Llama-3-8B-Instruct-Gradient-1048k. On pilot experiments using this model, we find that for naive hugging face implementation of the model for inference (without any optimizations), the model requires close to 80GB of GPU memory (i.e. a minimum of 3 A10 GPUs) for 16k tokens context window. We estimate that utilizing the entire 1M context window would require close to 1000GB of GPU memory (roughly about 40 A10 GPUs) - for a single user. We believe that using a vllm implementation for inference could enable more concurrency using the same 40 A10 GPU resource.
*A10 is a commercial GPU with a max GPU memory of 24GB. We use the costs (1.624$/hr) of renting GPU’s on-demand via AWS cloud for our cost evaluation.
As some of the longer context window models are available via API and are charged on a per token basis, we would like to demonstrate that applying a RAG filter on the entire document corpus prior to using long context window models is both efficient and cost-effective. For instance, we demonstrate that applying RAG on the Biographies benchmark (described above), consistently filtered out very short pieces of text from the large document (haystack) (for feeding to the LLM) ultimately ensuring that the model can perform in-context learning accurately and cost-effectively on the shorter context.
For instance in Figure 6, we find that despite the increasing haystack size (from 2k to 2M tokens), the effective context window (i.e. RAG filtered context fed to the smaller context window LLM) is typically in the range of 250 to 350 tokens. As recent work[11] has demonstrated that open-source models are best performant on reasoning and analysis tasks when provided with shorter contexts (or prompts), we believe RAG as an efficient filter coupled with an LLM would provide for a highly performant system.
Conclusions:
We have demonstrated that RAG systems are way more performant than LCW models on popular academic benchmarks (like the Needle in a haystack variants) developed to test the effectiveness of long context window models.
Moreover, we have shown that RAG systems can easily scale to large document corpora, as exemplified with documents containing 2 million tokens, without any degradation in performance or accuracy.
In addition to performance improvements, we note that RAG systems require very low compute resources as compared to long context window models - making the former a perfect candidate for enterprise adoption of Generative AI.
Future work:
Although we’ve shown that long context window models are simultaneously performance inadequate and cost-ineffective compared to their RAG counterparts (specifically the Yurts RAG system), we encourage the community to build better academic benchmarks that truly highlight the unique abilities of LCW models, which cannot be matched by smaller systems. For instance, designing tasks and benchmarks that genuinely necessitate very long contexts for the completion of user workflows.