An open-source RAG infrastructure that indexes arXiv at scale, enabling researchers and AI agents to retrieve high-signal academic literature without hallucinations.
LLMs hallucinate when asked about cutting-edge research. We solve this with a rigorously engineered, citation-filtered retrieval pipeline.
We exclude papers with 5 or fewer citations, removing ~70% of uncited preprints. What remains is a peer-validated, dense knowledge base optimized for retrieval quality.
arXiv publishes ~1,000 new papers daily. Our automated pipeline continuously ingests, parses, embeds, and indexes new publications as they appear.
Autonomous AI agents can independently query and synthesize literature through our API, accelerating automated scientific problem-solving and discovery workflows.
Every response is grounded in real, cited academic work. No fabricated references, no outdated training cutoffs — just verified scientific knowledge.
A multi-stage pipeline combining hybrid vector search, intelligent query routing, and LLM-powered decomposition.
Fast NLP-based router classifies incoming queries as Direct, Decompose, or HyDE in <1ms.
Complex queries are split into atomic sub-queries with metadata filters extracted via structured JSON.
Dense embeddings (BGE) and sparse retrieval (BM25) are fused using Reciprocal Rank Fusion on Qdrant.
Metadata constraints are applied natively at the database level via Qdrant Prefetch filters.
Where we are today, and where we are headed.
Hybrid search (dense + sparse), Reciprocal Rank Fusion, Qdrant vector store integration, and end-to-end ingestion pipeline built and tested.
Intelligent query routing (Direct / Decompose / HyDE), LLM-based query decomposition, and metadata filter extraction via structured JSON output.
Large-scale ingestion of 250,000+ citation-filtered CS papers from arXiv using GPU-accelerated batch embedding with the chunk pooling architecture.
FastAPI-based streaming endpoint exposing the full retrieval pipeline. Enables developers and AI agents worldwide to query peer-validated scientific literature on-demand.
Expand ingestion to 915,000+ high-impact papers across all arXiv domains — Physics, Mathematics, Biology, Statistics, and more.