PDF Q&A Assistant
Build an intelligent document assistant that can answer questions about any PDF using RAG (Retrieval-Augmented Generation) powered by your choice of LLM provider.
Document Intelligence at Scale
Extract insights from PDFs using advanced RAG techniques
Use Cases
π Contract Analysis
Quickly find specific clauses, terms, and conditions in legal documents
π¬ Research Papers
Extract key findings, methodologies, and conclusions from academic papers
π Documentation
Create intelligent help systems for technical documentation
Key Features
- Smart Chunking: Intelligently splits documents for optimal context retrieval
- Vector Search: Uses embeddings for semantic similarity matching
- Context-Aware Answers: Provides relevant excerpts with answers
- Multi-Provider Support: Works with any LLM through AISuite
RAG Architecture
ββββββββββββ ββββββββββββββββ βββββββββββββββ β PDF ββββββΆβ Text ββββββΆβ Chunking β β Upload β β Extraction β β Strategy β ββββββββββββ ββββββββββββββββ βββββββββββββββ β βΌ ββββββββββββ ββββββββββββββββ βββββββββββββββ β User β β Vector βββββββ Embeddings β β Question ββββββΆβ Search β β Creation β ββββββββββββ ββββββββββββββββ βββββββββββββββ β βΌ ββββββββββββββββ β Relevant β β Chunks β ββββββββββββββββ β βΌ ββββββββββββ ββββββββββββββββ βββββββββββββββ β Answer βββββββ AISuite ββββββΆβ LLM β β β β Client β β Provider β ββββββββββββ ββββββββββββββββ βββββββββββββββ
Implementation
pdf_qa.py
import aisuite as ai
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
import streamlit as st
class PDFQuestionAnswering:
def __init__(self, provider="openai:gpt-4"):
self.client = ai.Client()
self.provider = provider
self.embeddings = OpenAIEmbeddings()
self.vector_store = None
def process_pdf(self, pdf_file):
"""Extract and chunk PDF text"""
reader = PdfReader(pdf_file)
text = ""
for page in reader.pages:
text += page.extract_text()
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)
# Create vector store
self.vector_store = FAISS.from_texts(chunks, self.embeddings)
return len(chunks)
def answer_question(self, question):
"""Answer question based on PDF content"""
if not self.vector_store:
return "Please upload a PDF first."
# Find relevant chunks
relevant_docs = self.vector_store.similarity_search(question, k=4)
context = "\n\n".join([doc.page_content for doc in relevant_docs])
# Create prompt with context
prompt = f"""Based on the following context from the document,
please answer the question. If the answer is not in the context,
say so clearly.
Context:
{context}
Question: {question}
Answer:"""
# Get response from AI
response = self.client.chat.completions.create(
model=self.provider,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
# Streamlit UI
st.title("π PDF Q&A Assistant")
st.write("Upload a PDF and ask questions about its content")
# Initialize the QA system
@st.cache_resource
def init_qa_system():
return PDFQuestionAnswering()
qa_system = init_qa_system()
# File upload
uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")
if uploaded_file:
with st.spinner("Processing PDF..."):
chunk_count = qa_system.process_pdf(uploaded_file)
st.success(f"PDF processed! Created {chunk_count} text chunks.")
# Question input
question = st.text_input("Ask a question about the document:")
if question:
with st.spinner("Generating answer..."):
answer = qa_system.answer_question(question)
st.write("### Answer:")
st.write(answer)
β¨ AISuite Integration Points
- βΈLLM Flexibility: Switch between providers for cost/quality optimization
- βΈUnified API: Same code works with GPT-4, Claude, or Gemini
- βΈContext Management: Handles large documents through chunking
- βΈError Handling: Gracefully handles provider failures
Try It Out
Quick Start
- 1. Clone the repository:
git clone https://github.com/andrewyng/aisuite.git cd aisuite/examples
- 2. Install dependencies:
pip install aisuite PyPDF2 langchain faiss-cpu streamlit
- 3. Set up API keys:
export OPENAI_API_KEY="your-key" # For embeddings and LLM
- 4. Run the application:
streamlit run pdf_qa.py
Extend It
π Add Source Citations
Include page numbers and excerpts in answers
# Track source pages
chunks_with_metadata = [
{"text": chunk, "page": i}
for i, chunk in enumerate(chunks)
]
ποΈ Multi-Document Support
Query across multiple PDFs simultaneously
# Combine multiple PDFs
for pdf in pdf_files:
chunks = process_pdf(pdf)
vector_store.add_texts(chunks)
πΎ Persistent Storage
Save processed documents for faster retrieval
# Save vector store
vector_store.save_local("./db")
# Load later
FAISS.load_local("./db", embeddings)
π Advanced Retrieval
Implement hybrid search with BM25 + embeddings
# Hybrid retrieval
bm25_results = bm25_search(query)
vector_results = vector_search(query)
combined = merge_results(...)