- Регистрация
- 1 Мар 2015
- Сообщения
- 1,481
- Баллы
- 155
RAG (Retrieval-Augmented Generation) Workflow
Import Required Libraries
This imports all the necessary libraries for loading datasets, performing text splitting, creating embeddings, building a retrieval system, and evaluating metrics.
dotenv: For loading environment variables (e.g., API keys).
load_diabetes: Provides the diabetes dataset from scikit-learn.
LangChain libraries: Tools for text splitting, embeddings, and setting up a question-answering (QA) pipeline.
FAISS: A library for efficient similarity search and clustering of dense vectors.
Ragas: For evaluating retrieval-based question-answering systems.
Dataset: From the datasets library, useful for organizing data for evaluation.
userdata: Used for securely retrieving sensitive data in Google Colab.
1. Ground Truth - Source of Truth
The foundation of the system lies in the source data. In this example:
diabetes = load_diabetes()
raw_text = diabetes.DESCR
The raw text (input) is split into smaller chunks, and a vector store is created using FAISS. Queries retrieve the most relevant chunks.
# Split into chunks (simulate document retrieval)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_text(raw_text)
# Create Embeddings & Build FAISS Index
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002") # Recommended model
docsearch = FAISS.from_texts(texts, embeddings)
The RetrievalQA chain in LangChain manages augmentation.
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=docsearch.as_retriever(),
return_source_documents=True # helpful for debugging
)
The LLM generates a response by combining the query and augmented context retrieved in the previous step.
for query in queries:
result = qa_chain.invoke({"query": query})
answers.append(result["result"])
# Extract retrieved docs for Ragas evaluation
retrieved_docs = result.get("source_documents", [])
contexts.append([doc.page_content for doc in retrieved_docs])
5. Traceability - Explaining the Source
The system ensures traceability by showing the origin of the retrieved information, helping users understand the response's basis.
retrieved_docs = result.get("source_documents", [])
contexts.append([doc.page_content for doc in retrieved_docs])
Import Required Libraries
This imports all the necessary libraries for loading datasets, performing text splitting, creating embeddings, building a retrieval system, and evaluating metrics.
dotenv: For loading environment variables (e.g., API keys).
load_diabetes: Provides the diabetes dataset from scikit-learn.
LangChain libraries: Tools for text splitting, embeddings, and setting up a question-answering (QA) pipeline.
FAISS: A library for efficient similarity search and clustering of dense vectors.
Ragas: For evaluating retrieval-based question-answering systems.
Dataset: From the datasets library, useful for organizing data for evaluation.
userdata: Used for securely retrieving sensitive data in Google Colab.
1. Ground Truth - Source of Truth
The foundation of the system lies in the source data. In this example:
diabetes = load_diabetes()
raw_text = diabetes.DESCR
- load_diabetes(): Fetches the diabetes dataset from sklearn.datasets.
- diabetes.DESCR: Contains a detailed description of the dataset (variables, data characteristics).
- raw_text: Represents the "Ground Truth" that the RAG system will reference for its operations.
The raw text (input) is split into smaller chunks, and a vector store is created using FAISS. Queries retrieve the most relevant chunks.
# Split into chunks (simulate document retrieval)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_text(raw_text)
# Create Embeddings & Build FAISS Index
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002") # Recommended model
docsearch = FAISS.from_texts(texts, embeddings)
- RecursiveCharacterTextSplitter: Splits the large raw_text into smaller, overlapping segments (texts).
- OpenAIEmbeddings: Converts input text chunk into numerical vector representations for processing.
- FAISS.from_texts(texts, embeddings): Builds an index of these embeddings for efficient retrieval.
The RetrievalQA chain in LangChain manages augmentation.
- It takes the user's query and uses the docsearch index to find relevant documents.
- These documents are passed to the LLM along with the original query, providing contextual information.
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=docsearch.as_retriever(),
return_source_documents=True # helpful for debugging
)
- RetrievalQA.from_chain_type sets up the RAG pipeline.
- retriever=docsearch.as_retriever() connects FAISS index to QA chain, enables fetch relevant doc/info based on query
- chain_type="stuff" pass retrieved doc/info to LLM. It stuffs query+info/doc.
- return_source_documents=True for traceability
The LLM generates a response by combining the query and augmented context retrieved in the previous step.
for query in queries:
result = qa_chain.invoke({"query": query})
answers.append(result["result"])
# Extract retrieved docs for Ragas evaluation
retrieved_docs = result.get("source_documents", [])
contexts.append([doc.page_content for doc in retrieved_docs])
5. Traceability - Explaining the Source
The system ensures traceability by showing the origin of the retrieved information, helping users understand the response's basis.
retrieved_docs = result.get("source_documents", [])
contexts.append([doc.page_content for doc in retrieved_docs])