Testing Markdown

Alessia Pandolfo9 aprile 20267 min di lettura

There is one drawback that LLMs have, and that is “They are frozen in time”. And this is by design. They don’t know what happened yesterday or what happened 5 seconds ago, they also don’t know anything about our private data. They only have information that was in their training data.

This raises a question, how do we get an LLM to know our information that wasn’t in its training data? This can be done by context injection and there are two ways of doing context injection, one is by using RAG(Retrieval Augmented Generation) and the second is by using Long Context.

What is RAG? Imagine we have documents containing our company policies, and we want an LLM to answer user queries based on the information present in the policy documents. What we can do is break the documents into chunks, pass them through an embedding model and create a vector embedding (Numerical representations of data such as text, images, or audio; converted into lists of numbers (vectors) that capture their semantic meaning and context). The last thing we do is that we save this information in a vector database such as pgVector or redis. Whenever we want to search for any information in the vector database, we embed the user input/query into a vector embedding and perform semantic search on the data. Once we get the chunks, we can then give it to our LLM to use in the user response.

Press enter or click to view image in full size

What is Long context The second method of doing context injection is by using a method called Long context. Long context can be considered as a bruteforce approach. This is where we take all the documents that we have and pass it alongside our prompt and let the AI’s attention mechanism do the heavy lifting in getting an answer based on our prompt.

For a long time since we had our first LLM, long context wasn’t pretty much of an option because the LLMs had a tiny context window. The earliest models had 4k tokens as their context window. This meant that we could not just lift our documents(maybe novels, corporate knowledge base, etc) and sent them directly to an LLM alongside our prompt. RAG was the only feasible option. However, current LLMs have a large context window, models such as Gemini 3 and claude Opus 4.6 have a context window of upto 1 million tokens. This means we can fit most (if not all) of our large documents directly into the LLMs context.

Become a Medium member Big Question: If we can fit our entire knowledge base into a context window, is there really need to continue using RAG given that it adds extra complexity into our system? (Think of the extra process of embedding user prompts and doing semantic search.) What if the search lifts inaccurate data from the vector database? We can narrow down our research to the fact that using long context is a much simpler approach.

Why is Long context is winning? While RAG is powerful, it is inherently complex. Long context (the “brute force” approach) offers three distinct advantages:

It collapses the infrastructure — Using RAG means we need to come up with a chunking strategy to chunk our data(Could be a sliding window, fixed size or recursive). We need an embedding model and a vector database to store the embeddings. We can see that RAG has a lot of moving parts that can easily break. In comparison, long context doesn’t need any stack. We just feed our data directly into an LLM. RAG Introduces a flaw called retrieval lottery — When performing semantic search in the vector database, the data may not be retrieved at all. This is called silent failure. This is where the data existed in the documents but the LLM never saw it because the retrieval step failed. In contrast, retrieval lottery cannot affect long context because the retrieval step is not there. The LLM gets to see everything. “Whole Book” Problem — By design, RAG is a surgical tool, it is built to find specific, highly relevant “needles” within a massive haystack of data. However, this granular approach becomes a liability when the answer to a query depends not on a single fact, but on the narrative arc of the entire “haystack.” To understand this, imagine asking an AI to summarize the character development of a protagonist in a 500-page novel. If you use a RAG-based system, the AI will search the database and pull out twenty random pages where the character’s name appears. It might see a scene where they are crying on page 12 and a scene where they are laughing on page 480. The problem is that the AI never sees the 468 pages in between. It lacks the “connective tissue”, the slow-burn transformation, the subtle plot twists, and the emotional context that explains why the character changed. Because the RAG pipeline only feeds the model disconnected snippets, the LLM is forced to guess what happened in the gaps. This often leads to “hallucinations,” where the model confidently invents a story to bridge the fragments it was given. When we use long context, we don’t encounter this problem because the AI gets to see everything. Is RAG dead? While long context is quite simple for majority of the tasks, we still need RAG. Here are some reasons why RAG is still important in some cases.

Re-Reading tax — Imagine we have a 700 page manual(350k tokens). Whenever a user asks a query and we need to get it from the manual, passing the document in context means we will have to process it every time(There is context caching in long context.But what if we are dealing with dynamic data?). This leads to processing inefficiencies. It’s also expensive. However, when we use RAG, the manual is indexed once and only a subset will be retrieved in every user query. Needle in the haystack problem — There is a general assumption that if the data is in the context window, then the model is gonna use it. But that is mostly not the case. If for example we pass 500k tokens into a model’s context window and we ask it a question on some information burried deep, the model is most likely going to hallucinate based on the surrounding text. But if we are to use a RAG, we will only give the model the chunks that it needs, this removes all the surrounding noise. By using RAG, we are basically getting rid of all the “hay” and presenting the model with the “needles” only. Infinite Dataset — While a million context window sounds big, it is not possible to fit entreprise data in the context window. Enterprise datalakes usually have terabytes of data. In such a case, using RAG would be the most ideal approach. So where does this leave us? If you are summarizing a book, preparing for an exam based on some unit notes, then long context sounds like the best approach because of it is simplicity. However, if you are dealing with entreprise knowledge base containing a huge amount of data, RAG might be the best approach for you.

Press enter or click to view image in full size

RAG vs Long Context Press enter or click to view image in full size

Rag LLM Context Engineering Vector Embeddings Embedding Model 2

OJIAMBO PATRICK Written by OJIAMBO PATRICK 13 followers · 9 following Edit profile Responses (1)

OJIAMBO PATRICK OJIAMBO PATRICK

Hai trovato questo articolo utile?

Contattami per un colloquio conoscitivo gratuito. Insieme vedremo come posso supportarti.

Contattami

Torna al blog