How To Use LLMs: Retrieval-Augmented Generation (RAG Systems)

Sascha · Вчера в 12:59

RAG (Retrieval-Augmented Generation) is one of the most practical ways developers are applying LLMs today.

Large Language Models (LLMs) are very good at writing and reasoning in natural language. But used naively, they come with three practical limits:

Hallucinations: LLMs can make things up because they predict text by pattern-matching.
Outdated knowledge: LLMs knowledge is frozen at training time, so they don’t know new events after their last update.
Limited context window: LLMs can’t fit huge knowledge bases, like company wiki or long PDFs, into their limited prompt window, so they miss crucial details.

Retrieval-Augmented Generation (RAG) solves these problems by pairing an LLM with a search layer.

Let's unpack that...

Retrieval

Information Retrieval is finding relevant data within large datasets based on user's query.

Key Components of Information Retrieval

Indexing: Indexing means creating a well organized catalog of information, to make it easy to search by breaking down documents into words or phrases.
Querying: Querying involves searching through the indexed data to find relevant matches of the query input.
Ranking: Ranking sorts search results by relevance with algorithms to ensure the most relevant documents appear at the top

Types of Retrieval Systems

Boolean Retrieval Model: This uses boolean logics AND, OR, and NOT, to match document with queries. It gives control over search and is best for non-negotiable and precise requirements.
Probabilistic Retrieval Model: This ranks documents based on the probability of their relevance to user's query. It uses probabilistic reasoning. It is best for historical data for statistical reasoning retrieval.
Vector Space Model: This represents documents and queries as vectors with each dimension representing a unique term from the vocabulary. It is best for large datasets and partial match queries. It ranks by relevance .

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

is a practical implementation of these Retrieval Systems

Text Generation

Behind text generations are Neural Networks, specifically called Language Models. These models don't just memorize words but learn language patterns, structure and context to predict the next word. To achieve correct and relevant responses, we need great

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

skills.

Some of the models parameters can also be tuned to achieve better responses. These parameters controls the behavior of the text generation process, influencing the quality and diversity of the output.

temperature: This adjusts the randomness of generated text, balancing between focused and creative outputs.
top-k sampling: This restricts choices for next word to top k options, reducing randomness.
top-p sampling: This adjusts word options based on cumulative probability.
repetition penalty: This reduces repetitive phrases, making responses more diverse and human-like.
sampling model: This adds randomness, creating more varied and creative text.

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

is a practical guide on Text Generation using langchain-huggingface

Retrieval-Augmented Generation (RAG)

Traditional Generation Models struggles with accuracy and relevance problem.
Retrieval Models struggles with generating sensible text.

RAG means Retrieval-Augmented Generation, and it's a hybrid model that improves text generation by using information from a large document corpus, leading to more accurate responses.
It’s a way of improving Large Language Models (LLMs) by combining two processes:

Retrieval means searching and pulling in relevant information from external sources like a knowledge base, database, PDFs, or vector database.
Generation means using an LLM to take that retrieved information and generate a fluent, natural-language answer.

In simple terms:

Retrieval finds the facts

Generation writes the answer

RAG = LLM + Search.

How RAG Works Step by Step

Data collection: Collect every source the system need to know: PDFs, web pages, Notion/Confluence pages, database rows, customer-support transcripts, product specs, research papers, etc.
Chunking: Large documents must be split into smaller pieces (chunks) that fit into embedding and model context windows.
Embedding: Convert each chunk to a fixed-size vector that captures its semantics. These vectors lets us find similar text using math.
Storage (vector DB / index): Store the vectors in a vector database or nearest-neighbor index: FAISS, HNSWlib, Pinecone, Weaviate, Milvus, etc.
Input Query (user asks a question): The user submits a query (question, instruction). Usually the query is embedded using the same embedding model as the chunks.
Retrieve (similarity search & reranking): Find the top-k chunks most similar to the query vector. Typical k is between 3 and 20 depending on chunk size and task.
Augment (prepare prompt + context): Take the retrieved chunks and add them to the LLM prompt in a controlled way so the LLM can use them as evidence.
Generate (LLM produces the final answer): The LLM synthesizes the retrieved context + the input query (question) and produces a grounded, well-written response.

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

is a practical guide on how RAG works step by step without abstraction layers.

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

Practical Implementation Using LangChain and OpenAI

from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI

# 1. Prepare documents
docs = ["LLMs are powerful", "RAG helps with private data"]

# 2. Create embeddings
embeddings = OpenAIEmbeddings()

# 3. Create vector store
vectorstore = FAISS.from_texts(docs, embeddings)

# 4. Build RAG chain
qa = RetrievalQA.from_chain_type(
llm=OpenAI(),
retriever=vectorstore.as_retriever()
)

# 5. Ask a question
question = "What is RAG?"

# 6. Execute and print result
result = qa.run(question)
print(result)

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

is a full implementation.

Some Real-World Use Cases of RAG

Customer Support Bots Problem: Traditional chatbots struggle when users ask detailed questions about niche company policies, product manuals, or troubleshooting steps. They either hand off to human agents or give generic, unhelpful responses.

RAG Solution:

Store company knowledge (FAQs, documentation, troubleshooting guides) in a vector database.
When a customer asks a question, the system retrieves the most relevant sections and feeds them into the LLM.
The LLM then crafts a response tailored to the customer’s question, grounded in the company’s own documents.

Medical assistants retrieving recent research.
Legal advisors searching law databases.
Personalized learning assistants fetching textbooks.

Some advanced topics to improve RAG systems

RAG systems uses external knowledge during response generation, retrieving relevant data from larger datasets.
LongRAG (preserves context by using larger token segments) and LightRAG (graph based retrieval) enhance the original RAG architecture by solving context fragmentation and inefficiency in handling long contexts.

Summary

At its core, Retrieval-Augmented Generation (RAG) is about combining two complementary strengths:

Retrieval handles the facts. It pulls in the most relevant, up-to-date, and domain-specific information.
Generation handles the language. It takes those facts and turns them into clear, human-like answers.

By joining these two pieces, RAG transforms LLMs from general-purpose text generators into practical, reliable, and customizable assistants that can work with your unique data, stay current, and reduce hallucinations.

This makes RAG one of the most important building blocks in applied AI today.

The next post, will go one step further on:

AI Agents – LLMs that can take actions, not just generate answers.
Agentic RAG – where retrieval becomes part of a larger reasoning-and-action pipeline.
RAGAS (Retrieval-Augmented Generation Assessment Suite) – tools and techniques for evaluating the quality and reliability of RAG systems.

Stay tuned and happy coding!!!

Источник:

Пожалуйста Авторизируйтесь или Зарегистрируйтесь для просмотра скрытого текста.

How To Use LLMs: Retrieval-Augmented Generation (RAG Systems)

Sascha

Заместитель Администратора