NanoIndex document tree

NanoIndex

Open-source Agentic Harness for Long Documents

Self-validating trees. Entity graphs. Karpathy-inspired LLM wikis. Cited answers down to the pixel.

Quickstart

Open source. Self-hosted. Interactive setup walks you through API keys and your first document.

$ pip install nanoindex
from nanoindex import NanoIndex

ni = NanoIndex()
tree = ni.index("report.pdf")
answer = ni.ask("What was the revenue?", tree)
print(answer.content)
Benchmarks

Tested on real documents,
not synthetic datasets.

Retrieval Accuracy

149 out of 150 questions had the correct data retrieved from the tree

99.3%

FinanceBench (Agentic)

84 SEC filings · 150 questions · Claude Sonnet 4.6

95%

97% on SEC 10-K89% Fast mode (2 LLM calls)

DocBench Legal

51 court filings · 54 avg pages

96%

FinanceBench accuracy comparison

Chunk + embed
65%
Chunk + reranker
78%
NanoIndex (fast)
89%
NanoIndex (agentic)
95%
How it works

From PDF to cited answers
in three steps.

01

Index the document.

Upload a PDF. A single API call extracts sections, tables, entities, relationships, and bounding boxes. Builds a tree with 200-500 nodes, 8+ levels deep.

02

Agent navigates the tree.

The LLM reads the outline, reasons about which branches to explore, drills into relevant sections, follows cross-references through the entity graph.

03

Get cited answers.

Precise answers with pixel-level citations. Every claim points to the exact bounding box on the page. Verifiable and traceable.

Features

Everything you need to query
long documents accurately.

Document Trees

Navigable tree from the document's actual structure. PART I > Item 1 > Business > Divisions. Not arbitrary token chunks.

Entity Graphs

Companies, metrics, legal references, and their relationships. Built from a single API call, not a separate NER pipeline.

Pixel Citations

Bounding box coordinates on the exact page region. Draw a rectangle showing where the answer came from.

Agentic Retrieval

LLM agent navigates the tree like a human analyst. Scans the outline, drills in, cross-references, and verifies.

Cross-References

Section 15.2 references Section 2.5(b)? A graph edge is created automatically. The agent follows it during retrieval.

Best Extraction Quality

Powered by Nanonets OCR-3, ranked #1 on idp-leaderboard.org. State-of-the-art document understanding for any PDF.

Why not chunk-and-embed?

Vector similarity measures word overlap,
not how information connects.

Traditional RAGNanoIndex
Document structure Destroyed by chunking Preserved as navigable tree
Cross-references Lost across chunk boundaries Resolved as graph edges
Financial tables Split mid-row Extracted with headers and rows
Multi-section queries Hope retrieval finds all chunks Agent navigates to each section
Citations Page number at best Pixel-level bounding boxes
Setup complexity Vector DB + embedding model + chunk tuning pip install nanoindex
Use cases

Works across domains.

Finance, legal, healthcare, insurance, research papers.

from nanoindex import NanoIndex

ni = NanoIndex(llm="anthropic:claude-sonnet-4-6", financial_doc=True)
tree = ni.index("amazon_10k.pdf")

# Tree: PART I > Item 1 > Business > Divisions
# Graph: "Net Revenue" --has_value--> "$280.5B"

answer = ni.ask(
    "What was the operating margin in FY2019?",
    tree
)
# "Operating margin was 5.2% ($14.5B / $280.5B)"
# Citations: [Results of Operations, p.40-41]

Frequently asked questions

Stop chunking.
Start understanding.

Open source. Self-hosted. Three lines of Python. No vector database. No embedding model. No chunk size tuning.

$ pip install nanoindex