NanoIndex

Open-source Agentic Harness for Long Documents

Self-validating trees. Entity graphs. Karpathy-inspired LLM wikis. Cited answers down to the pixel.

GitHub Try the demo

Quickstart

Open source. Self-hosted. Interactive setup walks you through API keys and your first document.

GitHub Read the docs

$ pip install nanoindex

from nanoindex import NanoIndex

ni = NanoIndex()
tree = ni.index("report.pdf")
answer = ni.ask("What was the revenue?", tree)
print(answer.content)

Benchmarks

Tested on real documents,
not synthetic datasets.

Retrieval Accuracy

149 out of 150 questions had the correct data retrieved from the tree

99.3%

FinanceBench (Agentic)

84 SEC filings · 150 questions · Claude Sonnet 4.6

95%

97% on SEC 10-K89% Fast mode (2 LLM calls)

DocBench Legal

51 court filings · 54 avg pages

96%

FinanceBench accuracy comparison

Chunk + embed

65%

Chunk + reranker

78%

NanoIndex (fast)

89%

NanoIndex (agentic)

95%

How it works

From PDF to cited answers
in three steps.

Index the document.

Upload a PDF. A single API call extracts sections, tables, entities, relationships, and bounding boxes. Builds a tree with 200-500 nodes, 8+ levels deep.

Agent navigates the tree.

The LLM reads the outline, reasons about which branches to explore, drills into relevant sections, follows cross-references through the entity graph.

Get cited answers.

Precise answers with pixel-level citations. Every claim points to the exact bounding box on the page. Verifiable and traceable.

Features

Everything you need to query
long documents accurately.

Document Trees

Navigable tree from the document's actual structure. PART I > Item 1 > Business > Divisions. Not arbitrary token chunks.

Entity Graphs

Companies, metrics, legal references, and their relationships. Built from a single API call, not a separate NER pipeline.

Pixel Citations

Bounding box coordinates on the exact page region. Draw a rectangle showing where the answer came from.

Agentic Retrieval

LLM agent navigates the tree like a human analyst. Scans the outline, drills in, cross-references, and verifies.

Cross-References

Section 15.2 references Section 2.5(b)? A graph edge is created automatically. The agent follows it during retrieval.

Best Extraction Quality

Why not chunk-and-embed?

Vector similarity measures word overlap,
not how information connects.

	Traditional RAG	NanoIndex
Document structure	Destroyed by chunking	Preserved as navigable tree
Cross-references	Lost across chunk boundaries	Resolved as graph edges
Financial tables	Split mid-row	Extracted with headers and rows
Multi-section queries	Hope retrieval finds all chunks	Agent navigates to each section
Citations	Page number at best	Pixel-level bounding boxes
Setup complexity	Vector DB + embedding model + chunk tuning	pip install nanoindex

Use cases

Works across domains.

Finance, legal, healthcare, insurance, research papers.

from nanoindex import NanoIndex

ni = NanoIndex(llm="anthropic:claude-sonnet-4-6", financial_doc=True)
tree = ni.index("amazon_10k.pdf")

# Tree: PART I > Item 1 > Business > Divisions
# Graph: "Net Revenue" --has_value--> "$280.5B"

answer = ni.ask(
    "What was the operating margin in FY2019?",
    tree
)
# "Operating margin was 5.2% ($14.5B / $280.5B)"
# Citations: [Results of Operations, p.40-41]

Frequently asked questions

Stop chunking.
Start understanding.

Open source. Self-hosted. Three lines of Python. No vector database. No embedding model. No chunk size tuning.

$ pip install nanoindex

GitHub Documentation