
NanoIndex
Open-source Agentic Harness for Long Documents
Self-validating trees. Entity graphs. Karpathy-inspired LLM wikis. Cited answers down to the pixel.
Quickstart
Open source. Self-hosted. Interactive setup walks you through API keys and your first document.
$ pip install nanoindex
from nanoindex import NanoIndex
ni = NanoIndex()
tree = ni.index("report.pdf")
answer = ni.ask("What was the revenue?", tree)
print(answer.content)Tested on real documents,
not synthetic datasets.
Retrieval Accuracy
149 out of 150 questions had the correct data retrieved from the tree
99.3%
FinanceBench (Agentic)
84 SEC filings · 150 questions · Claude Sonnet 4.6
95%
DocBench Legal
51 court filings · 54 avg pages
96%
FinanceBench accuracy comparison
From PDF to cited answers
in three steps.
01
Index the document.
Upload a PDF. A single API call extracts sections, tables, entities, relationships, and bounding boxes. Builds a tree with 200-500 nodes, 8+ levels deep.
02
Agent navigates the tree.
The LLM reads the outline, reasons about which branches to explore, drills into relevant sections, follows cross-references through the entity graph.
03
Get cited answers.
Precise answers with pixel-level citations. Every claim points to the exact bounding box on the page. Verifiable and traceable.
Everything you need to query
long documents accurately.
Document Trees
Navigable tree from the document's actual structure. PART I > Item 1 > Business > Divisions. Not arbitrary token chunks.
Entity Graphs
Companies, metrics, legal references, and their relationships. Built from a single API call, not a separate NER pipeline.
Pixel Citations
Bounding box coordinates on the exact page region. Draw a rectangle showing where the answer came from.
Agentic Retrieval
LLM agent navigates the tree like a human analyst. Scans the outline, drills in, cross-references, and verifies.
Cross-References
Section 15.2 references Section 2.5(b)? A graph edge is created automatically. The agent follows it during retrieval.
Best Extraction Quality
Powered by Nanonets OCR-3, ranked #1 on idp-leaderboard.org. State-of-the-art document understanding for any PDF.
Vector similarity measures word overlap,
not how information connects.
| Traditional RAG | NanoIndex | |
|---|---|---|
| Document structure | Destroyed by chunking | Preserved as navigable tree |
| Cross-references | Lost across chunk boundaries | Resolved as graph edges |
| Financial tables | Split mid-row | Extracted with headers and rows |
| Multi-section queries | Hope retrieval finds all chunks | Agent navigates to each section |
| Citations | Page number at best | Pixel-level bounding boxes |
| Setup complexity | Vector DB + embedding model + chunk tuning | pip install nanoindex |
Works across domains.
Finance, legal, healthcare, insurance, research papers.
from nanoindex import NanoIndex
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6", financial_doc=True)
tree = ni.index("amazon_10k.pdf")
# Tree: PART I > Item 1 > Business > Divisions
# Graph: "Net Revenue" --has_value--> "$280.5B"
answer = ni.ask(
"What was the operating margin in FY2019?",
tree
)
# "Operating margin was 5.2% ($14.5B / $280.5B)"
# Citations: [Results of Operations, p.40-41]Frequently asked questions
Stop chunking.
Start understanding.
Open source. Self-hosted. Three lines of Python. No vector database. No embedding model. No chunk size tuning.
$ pip install nanoindex