Building Production-Ready RAG Applications with LangChain and Pinecone
Building Production-Ready RAG Applications
Retrieval-Augmented Generation (RAG) has become the go-to architecture for building intelligent applications that need to answer questions based on your own data. In this guide, I'll walk you through building a production-ready RAG system that goes beyond the basic tutorials.
Why RAG?
Traditional LLMs have a knowledge cutoff date and can't access your proprietary data. RAG solves this by:
- Grounding responses in your data - No more hallucinations about facts
- Cost efficiency - Context is retrieved dynamically instead of fine-tuning
- Real-time updates - Your knowledge base stays current
- Source attribution - Users can verify information
Architecture Overview
A production RAG system consists of several key components:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
# Initialize embeddings
embeddings = OpenAIEmbeddings()
# Vector store with Pinecone
vectorstore = Pinecone.from_existing_index(
index_name="my-knowledge-base",
embedding=embeddings
)
# LLM for generation
llm = ChatOpenAI(model="gpt-4", temperature=0)
# RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True
)
Document Processing Pipeline
The quality of your RAG system depends heavily on how you process documents:
1. Chunking Strategy
Don't just split by character count. Use semantic chunking:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200, # Important for context continuity
separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
)
chunks = text_splitter.split_documents(documents)
2. Metadata Enrichment
Add rich metadata for better retrieval:
for chunk in chunks:
chunk.metadata.update({
"source": document.source,
"page": chunk.metadata.get("page", 0),
"doc_type": classify_document(chunk),
"created_at": datetime.now().isoformat()
})
3. Embedding Generation
Use the right embedding model for your domain:
- OpenAI ada-002: General purpose, great for most use cases
- Cohere multilingual: For multi-language support
- Custom fine-tuned: For specialized domains
Vector Database Best Practices
Indexing Strategies
import pinecone
# Initialize with metadata filtering support
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
index = pinecone.Index("my-knowledge-base")
# Upsert with metadata
index.upsert(vectors=[
{
"id": f"doc_{i}",
"values": embedding,
"metadata": metadata
}
for i, (embedding, metadata) in enumerate(zip(embeddings, metadatas))
])
Hybrid Search
Combine semantic and keyword search for best results:
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever
# Semantic retriever
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Keyword retriever
keyword_retriever = BM25Retriever.from_documents(documents)
keyword_retriever.k = 10
# Combine with weights
ensemble_retriever = EnsembleRetriever(
retrievers=[semantic_retriever, keyword_retriever],
weights=[0.7, 0.3]
)
Prompt Engineering for RAG
The prompt template is crucial for quality outputs:
from langchain.prompts import PromptTemplate
template = """You are an AI assistant helping users find accurate information.
Use the following pieces of context to answer the question. If you don't know
the answer based on the context, say so - don't make up information.
Context:
{context}
Question: {question}
Instructions:
1. Answer based only on the provided context
2. Cite the source document when possible
3. If the context doesn't contain the answer, say "I don't have enough information"
4. Be specific and detailed in your response
Answer:"""
PROMPT = PromptTemplate(
template=template,
input_variables=["context", "question"]
)
Performance Optimization
Caching Strategies
from langchain.cache import RedisCache
import redis
# Cache LLM responses
redis_client = redis.Redis(host='localhost', port=6379)
langchain.llm_cache = RedisCache(redis_client)
Async Processing
Handle multiple queries efficiently:
import asyncio
from langchain.chains import AsyncRetrievalQA
async def process_queries(queries):
tasks = [qa_chain.ainvoke({"query": q}) for q in queries]
results = await asyncio.gather(*tasks)
return results
Monitoring and Evaluation
Track key metrics:
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
response = qa_chain.invoke({"query": question})
print(f"Total Tokens: {cb.total_tokens}")
print(f"Prompt Tokens: {cb.prompt_tokens}")
print(f"Completion Tokens: {cb.completion_tokens}")
print(f"Total Cost (USD): ${cb.total_cost}")
Real-World Challenges
Handling Multi-Document Questions
Some questions require information from multiple sources:
# Use map-reduce for complex queries
from langchain.chains.question_answering import load_qa_chain
chain = load_qa_chain(
llm=llm,
chain_type="map_reduce",
return_intermediate_steps=True
)
result = chain({"input_documents": docs, "question": question})
Dealing with Contradictions
When sources disagree:
template = """The following sources contain information about the question,
but they may contradict each other. Analyze each source and explain:
1. What each source says
2. Points of agreement
3. Points of disagreement
4. Your best assessment based on source credibility
Sources:
{context}
Question: {question}
Analysis:"""
Conclusion
Building production RAG systems requires attention to:
- Document processing quality - Chunking, metadata, and embeddings
- Retrieval strategy - Hybrid search, reranking, and filtering
- Prompt engineering - Clear instructions and context formatting
- Performance - Caching, async processing, and cost optimization
- Monitoring - Track accuracy, latency, and costs
The examples in this guide give you a solid foundation, but remember to iterate based on your specific use case and user feedback.
Resources
Happy building! 🚀