Agentic RAG System
Document Intelligence Platform with Medical CSV Support
Overview
Upload PDFs or CSVs; PHI redaction on medical CSVs with regex and spaCy; Pinecone vector search; GPT-4o-mini answers with citations.
Problem
Organizations cannot efficiently query private documents. Medical CSV files contain Protected Health Information requiring removal before processing.
Solution
Users upload PDFs or CSVs through React interface. For CSVs, regex patterns and spaCy scan for PHI (names, IDs, dates) and replace with [REDACTED]. Text splits into 1000-character overlapping chunks. OpenAI text-embedding-3-small generates 1536-dimension vectors stored in Pinecone. User questions embed similarly; system retrieves top 5 similar chunks via cosine similarity. GPT-4o-mini answers only from provided context with source citations.
Technologies
- FastAPI 0.104
- React 18.2
- OpenAI API (GPT-4o-mini, text-embedding-3-small)
- Pinecone
- PyPDF2
- pandas
- spaCy 3.7
- Tailwind CSS
- Docker
Results
Instant Q&A over 1,000+ page documents. Automatic PHI redaction for medical CSVs. Sub-second retrieval latency below 300 milliseconds.
