PDF Q&A Assistant

Build an intelligent document assistant that can answer questions about any PDF using RAG (Retrieval-Augmented Generation) powered by your choice of LLM provider.

Document Intelligence at Scale

Extract insights from PDFs using advanced RAG techniques

Use Cases

📋 Contract Analysis

Quickly find specific clauses, terms, and conditions in legal documents

🔬 Research Papers

Extract key findings, methodologies, and conclusions from academic papers

📚 Documentation

Create intelligent help systems for technical documentation

Key Features

Smart Chunking: Intelligently splits documents for optimal context retrieval
Vector Search: Uses embeddings for semantic similarity matching
Context-Aware Answers: Provides relevant excerpts with answers
Multi-Provider Support: Works with any LLM through AISuite

RAG Architecture

┌──────────┐     ┌──────────────┐     ┌─────────────┐
│   PDF    │────▶│  Text        │────▶│   Chunking  │
│  Upload  │     │  Extraction  │     │   Strategy  │
└──────────┘     └──────────────┘     └─────────────┘
                                              │
                                              ▼
┌──────────┐     ┌──────────────┐     ┌─────────────┐
│   User   │     │   Vector     │◀────│  Embeddings │
│ Question │────▶│   Search     │     │   Creation  │
└──────────┘     └──────────────┘     └─────────────┘
                        │
                        ▼
                 ┌──────────────┐
                 │   Relevant   │
                 │   Chunks     │
                 └──────────────┘
                        │
                        ▼
┌──────────┐     ┌──────────────┐     ┌─────────────┐
│  Answer  │◀────│   AISuite    │────▶│     LLM     │
│          │     │   Client     │     │  Provider   │
└──────────┘     └──────────────┘     └─────────────┘

Implementation

pdf_qa.py

import aisuite as ai
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
import streamlit as st

class PDFQuestionAnswering:
    def __init__(self, provider="openai:gpt-4"):
        self.client = ai.Client()
        self.provider = provider
        self.embeddings = OpenAIEmbeddings()
        self.vector_store = None
        
    def process_pdf(self, pdf_file):
        """Extract and chunk PDF text"""
        reader = PdfReader(pdf_file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        
        # Split text into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            length_function=len
        )
        chunks = text_splitter.split_text(text)
        
        # Create vector store
        self.vector_store = FAISS.from_texts(chunks, self.embeddings)
        return len(chunks)
    
    def answer_question(self, question):
        """Answer question based on PDF content"""
        if not self.vector_store:
            return "Please upload a PDF first."
        
        # Find relevant chunks
        relevant_docs = self.vector_store.similarity_search(question, k=4)
        context = "\n\n".join([doc.page_content for doc in relevant_docs])
        
        # Create prompt with context
        prompt = f"""Based on the following context from the document, 
        please answer the question. If the answer is not in the context, 
        say so clearly.
        
        Context:
        {context}
        
        Question: {question}
        
        Answer:"""
        
        # Get response from AI
        response = self.client.chat.completions.create(
            model=self.provider,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ]
        )
        
        return response.choices[0].message.content

# Streamlit UI
st.title("📄 PDF Q&A Assistant")
st.write("Upload a PDF and ask questions about its content")

# Initialize the QA system
@st.cache_resource
def init_qa_system():
    return PDFQuestionAnswering()

qa_system = init_qa_system()

# File upload
uploaded_file = st.file_uploader("Choose a PDF file", type="pdf")

if uploaded_file:
    with st.spinner("Processing PDF..."):
        chunk_count = qa_system.process_pdf(uploaded_file)
        st.success(f"PDF processed! Created {chunk_count} text chunks.")
    
    # Question input
    question = st.text_input("Ask a question about the document:")
    
    if question:
        with st.spinner("Generating answer..."):
            answer = qa_system.answer_question(question)
            st.write("### Answer:")
            st.write(answer)

✨ AISuite Integration Points

▸LLM Flexibility: Switch between providers for cost/quality optimization
▸Unified API: Same code works with GPT-4, Claude, or Gemini
▸Context Management: Handles large documents through chunking
▸Error Handling: Gracefully handles provider failures

Try It Out

Quick Start

1. Clone the repository:

git clone https://github.com/andrewyng/aisuite.git cd aisuite/examples

2. Install dependencies:

pip install aisuite PyPDF2 langchain faiss-cpu streamlit

3. Set up API keys:

export OPENAI_API_KEY="your-key"  # For embeddings and LLM

4. Run the application:
```
streamlit run pdf_qa.py
```

Extend It

📍 Add Source Citations

Include page numbers and excerpts in answers

# Track source pages
chunks_with_metadata = [
  {"text": chunk, "page": i}
  for i, chunk in enumerate(chunks)
]

🗂️ Multi-Document Support

Query across multiple PDFs simultaneously

# Combine multiple PDFs
for pdf in pdf_files:
  chunks = process_pdf(pdf)
  vector_store.add_texts(chunks)

💾 Persistent Storage

Save processed documents for faster retrieval

# Save vector store
vector_store.save_local("./db")
# Load later
FAISS.load_local("./db", embeddings)

🔍 Advanced Retrieval

Implement hybrid search with BM25 + embeddings

# Hybrid retrieval
bm25_results = bm25_search(query)
vector_results = vector_search(query)
combined = merge_results(...)

View Notebook on GitHub