ArXiv Scholar — Open-Source Agentic Research Infrastructure

Why This Exists

Bridging the Knowledge Gap

LLMs hallucinate when asked about cutting-edge research. We solve this with a rigorously engineered, citation-filtered retrieval pipeline.

🎯

High-Signal Filtering

We exclude papers with 5 or fewer citations, removing ~70% of uncited preprints. What remains is a peer-validated, dense knowledge base optimized for retrieval quality.

🔄

Live Synchronization

arXiv publishes ~1,000 new papers daily. Our automated pipeline continuously ingests, parses, embeds, and indexes new publications as they appear.

🤖

Agentic Research

Autonomous AI agents can independently query and synthesize literature through our API, accelerating automated scientific problem-solving and discovery workflows.

🛡️

Hallucination Prevention

Every response is grounded in real, cited academic work. No fabricated references, no outdated training cutoffs — just verified scientific knowledge.

Under the Hood

Retrieval Architecture

A multi-stage pipeline combining hybrid vector search, intelligent query routing, and LLM-powered decomposition.

1

Query Routing

Fast NLP-based router classifies incoming queries as Direct, Decompose, or HyDE in <1ms.

2

LLM Decomposition

Complex queries are split into atomic sub-queries with metadata filters extracted via structured JSON.

3

Hybrid Search

Dense embeddings (BGE) and sparse retrieval (BM25) are fused using Reciprocal Rank Fusion on Qdrant.

4

Filtered Results

Metadata constraints are applied natively at the database level via Qdrant Prefetch filters.

Progress

Project Roadmap

Where we are today, and where we are headed.

Core RAG Pipeline Complete

May 2026

Hybrid search (dense + sparse), Reciprocal Rank Fusion, Qdrant vector store integration, and end-to-end ingestion pipeline built and tested.

Advanced Query Orchestrator Complete

May 2026

Intelligent query routing (Direct / Decompose / HyDE), LLM-based query decomposition, and metadata filter extraction via structured JSON output.

Initial Ingestion Run In Progress

June 2026

Large-scale ingestion of 250,000+ citation-filtered CS papers from arXiv using GPU-accelerated batch embedding with the chunk pooling architecture.

Public API Release Coming Soon

Q3 2026

FastAPI-based streaming endpoint exposing the full retrieval pipeline. Enables developers and AI agents worldwide to query peer-validated scientific literature on-demand.

Full arXiv Coverage Planned

Q4 2026

Expand ingestion to 915,000+ high-impact papers across all arXiv domains — Physics, Mathematics, Biology, Statistics, and more.

The People

Built By

AD

Ayush Dubey

Co-founder & Engineer

TD

Trinetra Devkatte