Building a RAG system in C# with Semantic Kernel
Introduction
If you’ve tried using an LLM to answer questions about your own data — company docs, product specs, internal knowledge bases — you’ve probably noticed that it either hallucinates or just says “I don’t have information about that.” That’s because the model only knows what it was trained on.
RAG (Retrieval-Augmented Generation) fixes this. Instead of fine-tuning a model on your data, you retrieve relevant chunks of your documents at query time and pass them to the LLM as context. The model then generates answers grounded in your actual data.
In this post, I’ll walk you through building a complete RAG pipeline in C# using Semantic Kernel.
How RAG works
The flow is straightforward:
- Ingest: Split your documents into chunks, generate embeddings for each chunk, store them in a vector database
- Query: When a user asks a question, generate an embedding for the query, search the vector database for similar chunks
- Generate: Pass the retrieved chunks as context to the LLM along with the user’s question
That’s it. The magic is in the embeddings — they capture the semantic meaning of text as vectors, so you can find relevant content even when the exact words don’t match.
Prerequisites
dotnet add package Microsoft.SemanticKernel
dotnet add package Microsoft.SemanticKernel.Connectors.AzureOpenAI
dotnet add package Microsoft.Extensions.VectorData.Abstractions
dotnet add package Microsoft.SemanticKernel.Connectors.InMemory
For production, you’d swap the in-memory store for Azure AI Search, Qdrant, Pinecone, or any other supported vector database. But in-memory is perfect for learning and prototyping.
Setting up the kernel
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.AzureOpenAI;
using Microsoft.Extensions.VectorData;
using Microsoft.SemanticKernel.Connectors.InMemory;
using Microsoft.SemanticKernel.Embeddings;
var builder = Kernel.CreateBuilder();
builder.AddAzureOpenAIChatCompletion(
deploymentName: "gpt-4o",
endpoint: config["AzureOpenAI:Endpoint"],
apiKey: config["AzureOpenAI:ApiKey"]);
builder.AddAzureOpenAITextEmbeddingGeneration(
deploymentName: "text-embedding-3-small",
endpoint: config["AzureOpenAI:Endpoint"],
apiKey: config["AzureOpenAI:ApiKey"]);
var kernel = builder.Build();
We need two models: one for chat completion (answering questions) and one for generating embeddings (turning text into vectors).
Defining the data model
We need a class to represent our document chunks in the vector store:
using Microsoft.Extensions.VectorData;
public class DocumentChunk
{
[VectorStoreRecordKey]
public string Id { get; set; } = Guid.NewGuid().ToString();
[VectorStoreRecordData]
public string Content { get; set; } = string.Empty;
[VectorStoreRecordData]
public string Source { get; set; } = string.Empty;
[VectorStoreRecordData]
public int ChunkIndex { get; set; }
[VectorStoreRecordVector(1536)]
public ReadOnlyMemory<float> Embedding { get; set; }
}
The VectorStoreRecordVector(1536) attribute tells the vector store the dimension of our embeddings. The text-embedding-3-small model produces 1536-dimensional vectors.
Chunking documents
Before we can create embeddings, we need to split our documents into manageable chunks. Here’s a simple text splitter:
public static class TextChunker
{
public static List<string> SplitText(string text, int maxChunkSize = 500, int overlap = 50)
{
var chunks = new List<string>();
var paragraphs = text.Split("\n\n", StringSplitOptions.RemoveEmptyEntries);
var currentChunk = new System.Text.StringBuilder();
foreach (var paragraph in paragraphs)
{
if (currentChunk.Length + paragraph.Length > maxChunkSize && currentChunk.Length > 0)
{
chunks.Add(currentChunk.ToString().Trim());
// Keep overlap from the end of the previous chunk
var overlapText = currentChunk.ToString();
currentChunk.Clear();
if (overlapText.Length > overlap)
{
currentChunk.Append(overlapText[^overlap..]);
currentChunk.Append(' ');
}
}
currentChunk.Append(paragraph);
currentChunk.Append("\n\n");
}
if (currentChunk.Length > 0)
{
chunks.Add(currentChunk.ToString().Trim());
}
return chunks;
}
}
The overlap is important — it ensures that context at the boundary between chunks isn’t lost. If a relevant sentence gets split across two chunks, the overlap means it’ll appear fully in at least one of them.
Ingesting documents
Now let’s put it all together to ingest documents into our vector store:
var vectorStore = new InMemoryVectorStore();
var collection = vectorStore.GetCollection<string, DocumentChunk>("documents");
await collection.CreateCollectionIfNotExistsAsync();
var embeddingService = kernel.GetRequiredService<ITextEmbeddingGenerationService>();
async Task IngestDocument(string content, string source)
{
var chunks = TextChunker.SplitText(content);
for (int i = 0; i < chunks.Count; i++)
{
var embedding = await embeddingService.GenerateEmbeddingAsync(chunks[i]);
var chunk = new DocumentChunk
{
Content = chunks[i],
Source = source,
ChunkIndex = i,
Embedding = embedding
};
await collection.UpsertAsync(chunk);
}
Console.WriteLine($"✅ Ingested {chunks.Count} chunks from {source}");
}
// Ingest some documents
var doc1 = await File.ReadAllTextAsync("docs/product-guide.md");
var doc2 = await File.ReadAllTextAsync("docs/faq.md");
var doc3 = await File.ReadAllTextAsync("docs/troubleshooting.md");
await IngestDocument(doc1, "product-guide.md");
await IngestDocument(doc2, "faq.md");
await IngestDocument(doc3, "troubleshooting.md");
Searching for relevant chunks
When a user asks a question, we generate an embedding for their query and search for similar chunks:
async Task<List<DocumentChunk>> SearchAsync(string query, int topK = 3)
{
var queryEmbedding = await embeddingService.GenerateEmbeddingAsync(query);
var searchResults = await collection.VectorizedSearchAsync(
queryEmbedding,
new VectorSearchOptions { Top = topK });
var results = new List<DocumentChunk>();
await foreach (var result in searchResults.Results)
{
results.Add(result.Record);
}
return results;
}
Generating answers with context
Now the RAG part — we take the retrieved chunks and include them as context in our prompt:
using Microsoft.SemanticKernel.ChatCompletion;
var chatService = kernel.GetRequiredService<IChatCompletionService>();
async Task<string> AskAsync(string question)
{
// Step 1: Retrieve relevant chunks
var relevantChunks = await SearchAsync(question);
// Step 2: Build context from chunks
var context = string.Join("\n\n---\n\n",
relevantChunks.Select(c => $"[Source: {c.Source}]\n{c.Content}"));
// Step 3: Generate answer with context
var history = new ChatHistory();
history.AddSystemMessage($$"""
You are a helpful assistant that answers questions based on the provided context.
Use ONLY the information from the context to answer. If the context doesn't contain
enough information to answer the question, say "I don't have enough information
to answer that question."
Do not make up information. Always cite the source document when possible.
Context:
{{context}}
""");
history.AddUserMessage(question);
var response = await chatService.GetChatMessageContentAsync(history);
return response.Content ?? "No response generated.";
}
Using it
// Ask questions about your documents
var answer1 = await AskAsync("How do I reset my password?");
Console.WriteLine($"Q: How do I reset my password?\nA: {answer1}\n");
var answer2 = await AskAsync("What are the system requirements?");
Console.WriteLine($"Q: What are the system requirements?\nA: {answer2}\n");
var answer3 = await AskAsync("What's the capital of France?");
Console.WriteLine($"Q: What's the capital of France?\nA: {answer3}\n");
// Should respond with "I don't have enough information" since it's not in the docs
Moving to production
The in-memory vector store is great for prototyping, but for production you’ll want a persistent vector database. Semantic Kernel has connectors for several options:
# Azure AI Search
dotnet add package Microsoft.SemanticKernel.Connectors.AzureAISearch
# Qdrant
dotnet add package Microsoft.SemanticKernel.Connectors.Qdrant
# Redis
dotnet add package Microsoft.SemanticKernel.Connectors.Redis
Swapping is straightforward since they all implement the same IVectorStore interface:
// Instead of InMemoryVectorStore, use:
using Azure;
using Microsoft.SemanticKernel.Connectors.AzureAISearch;
var vectorStore = new AzureAISearchVectorStore(
new Azure.Search.Documents.Indexes.SearchIndexClient(
new Uri(config["AzureAISearch:Endpoint"]),
new AzureKeyCredential(config["AzureAISearch:ApiKey"])));
Everything else stays the same. That’s the beauty of the abstraction.
Tips from building RAG systems
A few things I’ve learned the hard way:
- Chunk size matters a lot. Too small and you lose context. Too large and you waste tokens on irrelevant content. Start with 500-800 tokens and adjust based on your data.
- Overlap prevents boundary issues. A 50-100 token overlap between chunks is usually enough.
- Retrieve more than you think. Start with
topK = 5and reduce if you’re getting too much noise. It’s better to have extra context than to miss the relevant chunk. - System prompts are crucial. Be very explicit about using only the provided context. Without that instruction, the model will happily hallucinate “based on its training data.”
- Track sources. Always store metadata with your chunks so you can cite where the answer came from. Users trust answers more when they can verify the source.
- Re-rank if needed. Vector similarity isn’t perfect. For critical applications, add a re-ranking step using a cross-encoder model to improve precision.
Conclusion
RAG is one of the most practical patterns in AI right now. It lets you build AI-powered Q&A systems over your own data without fine-tuning, and Semantic Kernel makes it surprisingly clean in C#. Start with the in-memory store, get your chunking and prompts right, then swap in a real vector database when you’re ready for production.
Happy coding!