Information Retrieval Python · Flask · Dash Live on HuggingFace
Beyond Keyword Search — Hybrid RAG Benchmarked on Live arXiv Papers
A full-stack benchmarking system that fetches real arXiv papers, indexes them with BM25 and dense vector retrieval, fuses results using Reciprocal Rank Fusion, and evaluates answer quality live. No API key required — runs entirely on open-source models.
2025Mohammad Noorchenarboo Dynamic — arXiv API 2 ML models (MiniLM + Flan-T5)
The system ingests real arXiv papers on-demand via the official API, splits abstracts into overlapping chunks using three configurable strategies, embeds each chunk with a sentence transformer, and builds both a FAISS inner-product index and a BM25 sparse index simultaneously. At query time, both indices retrieve candidates independently and Reciprocal Rank Fusion merges the ranked lists, consistently outperforming either method alone. Flan-T5 then receives a distilled selection of the six most relevant sentences as context.
📡
arXiv API
Fetch papers by topic
→
✂️
Chunking
3 strategies
→
🧠
Embeddings
MiniLM-L6-v2
→
📊
Indices
FAISS + BM25
→
🔀
RRF Fusion
Hybrid rank
→
💬
Flan-T5
Answer gen
Feature Set
Dashboard Modules
📥 Ingest
Paper Ingestion
Search arXiv by keyword and fetch papers on-demand. Configure chunking strategy (fixed, sentence-window, or semantic) and batch size.
API/api/fetch-arxiv
Status✅ Live
🔍 Ask & Compare
Unified Query Page
Ask a question and get everything on one page: AI-generated answer, live metrics (relevance · faithfulness · diversity), radar + bar charts, and 3-column passage comparison across all three methods simultaneously.
Methods runAll 3 at once
Status✅ Live
💡 Domain Chips
Quick-Start Topics
Pre-built topic chips (📊 Data Science · 📣 Marketing · 🏥 Medical) fill the arXiv search box in one click. After loading, the Ask tab surfaces 2 contextual question chips tailored to the selected domain.
Domains3 built-in
Status✅ Live
🔌 REST API
Programmatic Access
Full JSON API: /api/fetch-arxiv, /api/query, /api/compare. All dashboard features accessible without the UI.
Endpoints3 routes
Status✅ Live
🧠 Fine-tune
Embedding Fine-Tuning
Domain-specific contrastive fine-tuning of the sentence-transformer on the ingested corpus using hard-negative mining to improve retrieval precision.
MethodContrastive / triplet loss
Status🗓️ Planned
ML Stack
Models & Methods
Every component runs locally inside the Docker container with no external API calls. The sentence transformer provides 384-dimensional L2-normalised embeddings, making cosine similarity equivalent to inner product — which is exactly what FAISS IndexFlatIP computes. The pairing is intentional: normalising once at index time makes every similarity query a simple dot product.
Every component — RRF, chunker, context distillation — is implemented from scratch. This was a deliberate choice: understanding the internals of each retrieval stage is more resume-worthy than wiring together a framework that abstracts them away.
Interactive Explorer
Retrieval Method Comparison
Select a retrieval method to see representative metrics from a live run on arXiv papers about transformer architectures.
Metrics computed on arXiv CS.CL papers. Context Relevance = cosine similarity. Diversity = 1 − mean pairwise similarity. Faithfulness = answer token overlap with context.
Performance Snapshot
Evaluation Results
Method Comparison
Per-Query Relevance
Model Faithfulness
Grouped bar chart showing Context Relevance and Diversity for each retrieval method. Hybrid RRF leads on both axes — 0.47 relevance vs 0.41 for BM25 and 0.51 diversity vs 0.38 for BM25.
Per-query context relevance across 4 test queries. Hybrid (green) consistently stays above BM25 (amber) and Dense Vector (indigo), with the largest gap on abstract definitional questions.
Answer faithfulness by generator model and question type. Flan-T5 XL reaches 0.84 on factual questions but shows diminishing returns over Large on definitional queries.
Design Decisions
Key Engineering Choices
🔀
Hybrid beats both
RRF fuses BM25's keyword precision with the vector index's semantic recall. The formula 1/(60+rank) was chosen empirically in the original TREC paper and remains the canonical default — the project implements it verbatim.
✂️
Distil, don't truncate
Instead of truncating the context to fit the model's input, sentence-level embedding similarity selects the 6 most relevant sentences. This fixes the title-string contamination problem that caused Flan-T5 to output paper titles as answers.
🌙
CSS vars, not Python
Dark/light theming is handled entirely by CSS custom properties toggled via a clientside callback — no Python re-renders on theme switch. Plotly charts receive hex colors via theme-keyed dicts since they cannot read CSS variables.
At a Glance
Quick read
What it is: A hybrid RAG platform — load arXiv papers by domain, ask questions, get answers with live comparison across BM25, dense vector, and hybrid fusion. Tech: Flask · Plotly Dash · FAISS · BM25 · Flan-T5. Deploy: Docker on HuggingFace Spaces free tier. UI: 2-tab design — Load and Ask & Compare.