In “Retrieval-augmented generation, step by step,” we walked through a very simple RAG example. Our little application augmented a large language model (LLM) with our own documents, enabling the language model to answer questions about our own content. That example used an embedding model from OpenAI, which meant we had to send our content to OpenAI’s servers—a potential data privacy violation, depending on the application. We also used OpenAI’s public LLM.
This time we will build a fully local version of a retrieval-augmented generation system, using a local embedding model and a local LLM. As in the previous article, we’ll use LangChain to stitch together the various components of our application. Instead of FAISS (Facebook AI Similarity Search), we’ll use SQLite-vss to store our vector data. SQLite-vss is our familiar friend SQLite with an extension that makes it capable of similarity search.
Remember that similarity search for text does a best match on meaning (or semantics) using embeddings, which are numerical representations of words or phrases in a vector space. The shorter the distance between two embeddings in the vector space, the closer in meaning are the two words or phrases. Therefore, to feed our own documents to an LLM, we first need to convert them to embeddings, which is the only raw material that an LLM can take as input.
We save the embedding in the local vector store and then integrate that vector store with our LLM. We’ll use Llama 2 as our LLM, which we’ll run locally using an app called Ollama, which is available for macOS, Linux, and Windows (the latter in preview). You can read about installing Ollama in this InfoWorld article.
Here is the list of components we will need to build a simple, fully local RAG system:
- A document corpus. Here we will use just one document, the text of President Biden’s February 7, 2023, State of the Union Address. You can download this text at the link below.
- A loader for the document. This code will extract text from the document and pre-process it into chunks for generating an embedding.
- An embedding model. This model takes the pre-processed document chunks as input, and outputs an embedding (i.e. a set of vectors that represent the document chunks).
- A local vector data store with an index for searching.
- An LLM tuned for following instructions and running on your own machine. This machine could be a desktop, a laptop, or a VM in the cloud. In my example it is a Llama 2 model running on Ollama on my Mac.
- A chat template for asking questions. This template creates a framework for the LLM to respond in a format that human beings will understand.
Now the code with some more explanation in the comments.
Fully local RAG example—retrieval code
# LocalRAG.py # LangChain is a framework and toolkit for interacting with LLMs programmatically from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings from langchain.text_splitter import CharacterTextSplitter from langchain.vectorstores import SQLiteVSS from langchain.document_loaders import TextLoader # Load the document using a LangChain text loader loader = TextLoader("./sotu2023.txt") documents = loader.load() # Split the document into chunks text_splitter = CharacterTextSplitter (chunk_size=1000, chunk_overlap=0) docs = text_splitter.split_documents(documents) texts = [doc.page_content for doc in docs] # Use the sentence transformer package with the all-MiniLM-L6-v2 embedding model embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") # Load the text embeddings in SQLiteVSS in a table named state_union db = SQLiteVSS.from_texts( texts = texts, embedding = embedding_function, table = "state_union", db_file = "/tmp/vss.db" ) # First, we will do a simple retrieval using similarity search # Query question = "What did the president say about Nancy Pelosi?" data = db.similarity_search(question) # print results print(data[0].page_content)
Fully local RAG example—retrieval output
Mr. Speaker. Madam Vice President. Our First Lady and Second Gentleman.
Members of Congress and the Cabinet. Leaders of our military.
Mr. Chief Justice, Associate Justices, and retired Justices of the Supreme Court.
And you, my fellow Americans.
I start tonight by congratulating the members of the 118th Congress and the new Speaker of the House, Kevin McCarthy.
Mr. Speaker, I look forward to working together.
I also want to congratulate the new leader of the House Democrats and the first Black House Minority Leader in history, Hakeem Jeffries.
Congratulations to the longest serving Senate Leader in history, Mitch McConnell.
And congratulations to Chuck Schumer for another term as Senate Majority Leader, this time with an even bigger majority.
And I want to give special recognition to someone who I think will be considered the greatest Speaker in the history of this country, Nancy Pelosi.
Note that the result includes a literal chunk of text from the document that is relevant to the query. It is what is returned by the similarity search of the vector database, but it is not the answer to the query. The last line of the output is the answer to the query. The rest of the output is the context for the answer.
Note that chunks of your documents is just what you will get if you do a raw similarity search on a vector database. Often you will get more than one chunk, depending on your question and how broad or narrow it is. Because our example question was rather narrow, and because there is only one mention of Nancy Pelosi in the text, we received just one chunk back.
Now we will use the LLM to ingest the chunk of text that came from the similarity search and generate a compact answer to the query.
Before you can run the following code, Ollama must be installed and the llama2:7b model downloaded. Note that in macOS and Linux, Ollama stores the model in the .ollama subdirectory in the home directory of the user.
Fully local RAG—query code
# LLM from langchain.llms import Ollama from langchain.callbacks.manager import CallbackManager from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler llm = Ollama( model = "llama2:7b", verbose = True, callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]), ) # QA chain from langchain.chains import RetrievalQA from langchain import hub # LangChain Hub is a repository of LangChain prompts shared by the community QA_CHAIN_PROMPT = hub.pull("rlm/rag-prompt-llama") qa_chain = RetrievalQA.from_chain_type( llm, # we create a retriever to interact with the db using an augmented context retriever = db.as_retriever(), chain_type_kwargs = {"prompt": QA_CHAIN_PROMPT}, ) result = qa_chain({"query": question})
Fully local RAG example—query output
In the retrieved context, President Biden refers to Nancy Pelosi as
“someone who I think will be considered the greatest Speaker in the history of this country.”
This suggests that the President has a high opinion of Pelosi’s leadership skills and accomplishments as Speaker of the House.
Note the difference in the output of the two snippets. The first one is a literal chunk of text from the document relevant to the query. The second one is a distilled answer to the query. In the first case we are not using the LLM. We are just using the vector store to retrieve a chunk of text from the document. Only in the second case are we using the LLM, which generates a compact answer to the query.
To use RAG in practical applications you will need to import multiple document types such as PDF, DOCX, RTF, XLSX, and PPTX. Both LangChain and LlamaIndex (another popular framework for building LLM applications) have specialized loaders for a variety of document types.
In addition, you may want to explore other vector stores besides FAISS and SQLite-vss. Like large language models and other areas of generative AI, the vector database space is rapidly evolving. We’ll dive into other options along all of these fronts in future articles here.