By combining Retrieval Augmented Generation (RAG) with Generative AI, you can create a more intelligent and context aware AI solution that can more accurately generate text responses. I’ve previously written about building a generative AI application using C#, Phi-3 and the ONNX Runtime. This article will take the foundation covered in that article and add in the integration of Retrieval Augmented Generation (RAG) into the design.
If you’re unfamiliar with building generative AI applications using C# and ONNX Runtime, then I recommend you go read the “Build a Generative AI App in C# with Phi-3 and ONNX” article, then come back here.
What is Retrieval Augmented Generation (RAG)?
Retrieval Augmented Generation (RAG) is a design pattern that can be used to significantly enhance the capabilities of Generative AI applications built with large language models (LLMs). RAG does this by incorporating information retrieval into the application before the LLM is used to generate a response. This technique enables LLMs to provide more accurate, reliable, and contextually relevant responses by sourcing data from external sources; in addition to what the LLM was trained with.
The basic workflow of RAG is this:
Query Input: When a user submits a query prompt to be processed.
Information Retrieval: The app searches through an external data source to find relevant information. Often times this involves the use of a vector database, but could use some other search technique.
Data Augmentation: The app will augment the users prompt with the additional data that we retrieved.
Response Generation: The app will then call the LLM to generate a response using the augmented prompt to produce a more accurate response.
Diagram: How Does RAG Work?
Existing implementations of Generative AI, like Microsoft Copilot and OpenAI ChatGPT, implement RAG in the core of their architecture. These are larger AI systems that offer more elaborate AI Agent architectures, but at the core is still RAG and Generative AI using LLMs.
Retrieval Augmented Generation using Vector Database
The most common method of implementing RAG is through the use of a Vector Database. A Vector Database will convert text data into numerical vectors using techniques such as word embeddings and transformers. The Vector Database can then be searched using a text query that it will then find vectors (and thus text) within the database that are similar based on a distance metric using mathematical algorithm such as cosine similarity search.
Load the Vector Database
The first step to implement RAG within the Generative AI App written in C#, you will add code when the app starts to load some text data (or documents) into a vector database. In the following example, we’ll use the Build5Nines.SharpVector in-memory vector database to load a folder of documents:
// Initialize the Vector Database
var vectorDatabase = new Build5Nines.SharpVector.BasicMemoryVectorDatabase();
// Load Documents into Vector Database
var loader = new Build5Nines.SharpVector.Data.TextDataLoader<int, string>(vectorDatabase);
loader.AddDocument(document, new TextChunkingOptions<string>
{
Method = TextChunkingMethod.Paragraph,
RetrieveMetadata = (chunk) => {
// add json to metadata containing the source filename
return “{ filename: \”” + filename + \” }”;
}
});
This example uses Text Chunking to break up the loaded documents into individual paragraphs that are loaded into the vector database. This will have the result that each of the search results returned later will be individual paragraphs. Also, the RetrieveMetadata lambda expression is being used to set the Metadata for each text chuck to be a JSON string that contains the original filename of the document the paragraph is from. This can be used later to load the full document for building the augmented context used for AI generation.
More Information: You can find more examples and explanation on using the Build5Nines.SharpVector library in the “Perform Vector Database Similarity Search in .NET Apps using Build5Nines.SharpVector” article.
There are obviously other more robust vector database servers, such as Azure Cosmos DB, Azure AI Search, and PostgreSQL w/ pgvector. For most enterprise applications you’ll likely use those, but for smaller applications and learning exercises, the Build5Nines.SharpVector library works great. Plus, Build5Nines.SharpVector is Open Source!
Use Prompt to Search Vector Database
In order to do the Retrieval and Augmentation of the users Prompt with additional context and data, you’ll need to actually search the Vector Database. The following is an example using the Build5Nines.SharpVector in-memory database that was previously loaded:
var searchResult = vectorDatabase.Search(
userPrompt, // User Prompt
pageCount: 5 // Return the first 5 similarity matches
threshold: 0.3f // Set vector comparison threshold (range is 0.0 to 1.0)
);
The following arguments are being passed into the .Search() method:
userPrompt: This is the user prompt or text that we want to perform a vector search for.
pageCount: This is the number of results to return in the response from the search query. This will limit the results to the specified maximum count.
threshold: The threshold is used to limit the search results to only vector search matches that meet this minimum threshold. The values range from 0.0 to 1.0 and is a signifier of how strong a match the result is. A result with a threshold of 0.0 is not a match, 0.1 is a weak match, and 0.8 would be a really strong match.
When performing the vector search, the threshold may need to be adjusted to fine tune the results. Too low and it may not return results, and too high may return too many weak results.
Build AI Prompt Context
Once the vector database has been searched and relevant text data has been retrieved, then the full context for the AI Prompt can be built before sending it to the LLM for generating a response. The Phi-3 SLM supports a prompt format that enables you to specify the System and User Prompts. This can be used to build the full prompt to pass the AI that includes the full context retrieved from the vector database in addition to the users prompt. After the full prompt is built, it can then be passed to the LLM for generating a response.
Phi-3 Prompting Format
When using Microsoft Phi-3 to generate a response, the prompt can use the following format that enables you to specify a System Prompt and the User Prompt:
var fullPrompt = $”<|system|>{sysPrompt}<|end|><|user|>{userPrompt}<|end|><|assistant|>”
This prompt formatting supported by Phi-3 uses the <|system|>, <|user|> and <|assistant|> marker to denote the start of the system or user prompts, as well as the AI “assistant” generated response. This can be used to build a full conversation history when sending multiple user prompts to the AI. We’ll leave the full conversation implementation for a future article.
System Prompt
The System Prompt could be left blank or you could specify a system prompt to direct the AI how it should respond. Here’s an example System Prompt that could be used:
var sysPrompt = “You are a knowledgeable and friendly assistant made by Build5Nines named Jarvis. Answer the following question as clearly and concisely as possible, providing any relevant information and examples.”
Full Prompt with Augmented Context
The Phi-3 prompt format doesn’t natively support additional context as is needed for a RAG implementation. The context data from the vector search will still need to be added to the AI prompt in order to complete the RAG implementation. Since Phi-3 doesn’t support it natively, the User Prompt section can be used to inject the context data immediately before the User Prompt.
Here’s a sample format for building the full prompt with the RAG context included:
var fullPrompt = $”<|system|>{sysPrompt}<|end|><|user|>{ragContext}\n\n{userPrompt}<|end|><|assistant|>”
As you can see, the RAG context data (via the ragContext variable) is added immediately before the User Prompt (via the userPrompt variable) with a double line break separating the two. This essentially gives the additional context data as part of the User Prompt, so the AI will have the necessary data it needs to generate a better, more accurate response.
When using the previous vector search code with Build5Nines.VectorSearch, you can use the following code to build the RAG context:
var ragContext = string.Empty;
foreach (var match in searchResult.Texts)
{
// Load just the Text from Vector Search
ragContext += result.Text + “\n\n”;
// Alternatively, Load the full file via Metadata JSON ‘filename’
/*
dynamic jsonData = JsonSerializer.Deserialize<ExpandoObject>(
result.Metadata
);
var filename = jsonData.filename;
var textdata = File.ReadAllText(filename);
ragContext += textdata + “\n\n”;
*/
}
Depending on the LLM used, the full prompt size (aka tokens) supported will vary. The Phi-3 SLMs support a relatively small token count, so you’ll likely need to limit the RAG context to a small amount of data. With an LLM like OpenAI’s GPT-4 the context can be larger.
Full Code Sample: The Build5Nines.SharpVector project contains a full RAG + Generative AI sample using C# and Phi-3.
Summary
In this article we built upon the foundation of building a Generative AI app using C#, Phi-3 and ONNX Runtime with an implementation of Retrieval Augmented Generation (RAG). This enables the generative AI solution have increased accuracy and relevance in the generated responses to user prompts. This is a technique that can be used to integrate the generative AI with enterprise data or any other additional context needed to customize the solution to meet your business requirements.