Documentation
Everything you need to index, query, and build on top of NanoIndex.
Installation
pip install nanoindex
Requires Python 3.10+. Installs PyMuPDF, httpx, pydantic, networkx, and spaCy as dependencies.
You need two API keys:
- NANONETS_API_KEY for document extraction (free, 10K pages at docstrange.nanonets.com)
- LLM API key for answering questions (OpenAI, Anthropic, or Google)
export NANONETS_API_KEY=your_key_here export GOOGLE_API_KEY=your_key_here # or ANTHROPIC_API_KEY or OPENAI_API_KEY
Quickstart
from nanoindex import NanoIndex
ni = NanoIndex() # auto-detects API keys from env
tree = ni.index("annual_report.pdf") # builds tree + graph
answer = ni.ask("What was the revenue?", tree)
print(answer.content) # "Revenue was $280.5B in FY2019"
print(answer.citations) # [Citation(title="PART II", pages=[17,18])]That's it. Three lines from PDF to cited answer.
LLM Configuration
NanoIndex supports any OpenAI-compatible LLM. Pass the provider and model name:
# Anthropic ni = NanoIndex(llm="anthropic:claude-sonnet-4-6") # OpenAI ni = NanoIndex(llm="openai:gpt-5.4") # Google ni = NanoIndex(llm="google:gemini-2.5-flash") # Local models via Ollama ni = NanoIndex(llm="ollama:llama3")
If no llm= is passed, NanoIndex auto-detects from environment variables in this order: ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, GROQ_API_KEY.
Indexing Documents
Indexing extracts the document structure, builds a tree, generates summaries, and constructs an entity graph.
tree = ni.index("report.pdf")
# With options
tree = ni.index(
"report.pdf",
add_summaries=True, # generate per-node summaries (default: True)
add_doc_description=False, # generate document-level description
)What happens during indexing:
- Nanonets OCR-3 extracts sections, tables, entities, bounding boxes (single API call)
- Tree builder creates a navigable hierarchy (PART I > Item 1 > Business > Divisions)
- SEC filing enforcement normalizes 10-K/10-Q structure
- Entity graph built from extracted relationships
- Summaries generated for nodes without API-provided summaries
How Splitting Works
NanoIndex never splits by fixed token count. Instead, it uses a 4-tier cascade that preserves document meaning:
- Heading split (zero LLM cost). If a node's text contains markdown headings (
#,##), split on those boundaries. Each sub-heading becomes a child node. - LLM title split. Ask the LLM to identify 3-8 logical subsection boundaries in the text. Split at those points.
- Proposition extraction. For dense text with no structure, extract atomic factual claims. Each proposition is a self-contained statement that can stand alone.
- Paragraph chunking (last resort). Split at paragraph boundaries to respect token limits.
Example of proposition extraction on a legal paragraph:
# Original text: # "The vendor warrants all goods are free from defects under # Consumer Protection Act Section 84(a). Liability covers both # direct and consequential damages, capped at the contract value # defined in Clause 5(b). The vendor may be exempted under # Section 85 if the defect was caused by buyer negligence." # NanoIndex extracts 4 propositions: # P1: "Vendor warrants goods free from defects under CPA Section 84(a)" # P2: "Liability covers direct and consequential damages" # P3: "Liability capped at contract value defined in Clause 5(b)" # P4: "Vendor exempt under Section 85 if defect caused by buyer negligence" # Each is independently queryable: # "What is the liability cap?" -> retrieves P3 # "Can liability be exempted?" -> retrieves P4
This means every piece of the tree is meaningful on its own, not an arbitrary 512-token fragment that starts mid-sentence.
Querying
# Simple question
answer = ni.ask("What was the revenue?", tree)
# With PDF for vision mode
answer = ni.ask("What was the revenue?", tree, pdf_path="report.pdf")
# Access the answer
print(answer.content) # the answer text
print(answer.citations) # list of Citation objects
print(answer.mode) # "text", "fast", "agentic_vision", etc.Query Architecture
When you call ni.ask(), here is what happens internally:
- Query decomposition. The LLM breaks the question into required data points. "What was the EBITDA margin?" becomes: need operating income (from income statement) + D&A (from cash flow statement) + revenue (from income statement).
- Tree navigation. The agent reads the full tree outline (all node titles + summaries) and selects which sections to read. It reasons: "EBITDA needs Item 7 (MD&A) and Item 8 (Financial Statements)."
- Multi-round retrieval. The agent reads the selected sections. If it needs more context, it requests additional sections. This continues for 2-5 rounds until the agent has enough information.
- Sufficiency check. For financial queries, the system verifies all required data points from step 1 are covered. If "D&A" was needed but not found, it triggers targeted retrieval.
- Answer generation. The LLM generates an answer from the gathered context, with step-by-step reasoning for calculations.
- Calculation verification. If the answer contains math, a separate verification step checks the arithmetic against source data.
- Self-evaluation. If the answer looks like a refusal ("cannot determine"), the system retries with expanded context or falls back to keyword search.
In agentic_vision mode, steps 3-5 also include rendered PDF page images alongside the extracted text. The LLM can read tables, charts, and visual layouts directly from the page.
In fast mode, steps 2-4 are replaced by entity graph traversal (find entities matching the query, expand via relationships, LLM picks from candidates). This uses only 2 LLM calls instead of 5-11.
Query Modes
Different modes trade off speed vs accuracy. The key difference between agentic and agentic_vision: agentic mode sends extracted text to the LLM, while agentic_vision also renders the actual PDF pages as images and sends them alongside the text. This lets the LLM read tables, charts, and layouts directly from the page rather than relying on OCR text alone.
| Mode | LLM Calls | Best For |
|---|---|---|
| fast | 2 | Simple lookups, low latency |
| agentic | 5-11 | Complex multi-step, highest accuracy |
| agentic_vision | 5-11 | Tables, charts, scanned docs |
| agentic_graph | 3-6 | Entity-centric queries |
answer = ni.ask("What was the revenue?", tree, mode="fast")
answer = ni.ask("Calculate EBITDA margin", tree, mode="agentic")
answer = ni.ask("Read the table on page 40", tree, mode="agentic_vision", pdf_path="report.pdf")Tree Structure
The tree is the core data structure. Each node has:
from nanoindex.utils.tree_ops import save_tree, load_tree, iter_nodes
# Save and reload (no re-indexing needed)
save_tree(tree, "my_tree.json")
tree = load_tree("my_tree.json")
# Explore nodes
for node in iter_nodes(tree.structure):
print(f"[{node.node_id}] {node.title}")
print(f" Pages {node.start_index}-{node.end_index}")
print(f" Summary: {node.summary}")
print(f" Children: {len(node.nodes)}")
print(f" Bounding boxes: {len(node.bounding_boxes)}")Each TreeNode has: title, node_id, start_index, end_index, level, summary, text, bounding_boxes, tables, and child nodes.
Entity Graph
The entity graph connects companies, people, metrics, and their relationships. Built automatically from API-extracted entities during indexing.
# Access the graph after indexing
graph = ni._graphs.get(tree.doc_name)
# Entities
for entity in graph.entities[:5]:
print(f"{entity.name} ({entity.entity_type})")
print(f" Found in nodes: {entity.source_node_ids}")
# Relationships
for rel in graph.relationships[:5]:
print(f"{rel.source} --{rel.keywords}--> {rel.target}")Relationship types include: has_value, grew_by, subsidiary_of, holds_title, operates_in, governed_by, references, and more.
Cross-References
Legal contracts and SEC filings are full of internal references: "Section 15.2 limits liability to the contract value as defined in Section 2.5(b)." NanoIndex detects these automatically and creates graph edges between the referencing and referenced nodes.
Supported patterns:
Section 3.1,Section 2.5(b)Clause 5,Article 19Item 1A,Item 7Note 5,Exhibit 4.6Part I,Schedule A
When the agent encounters a section that references another, it can follow the graph edge to retrieve the referenced content automatically. No manual linking required.
Citations
Every answer includes citations with page numbers and optional bounding boxes.
answer = ni.ask("What was the revenue?", tree)
for citation in answer.citations:
print(f"Section: {citation.title}")
print(f"Pages: {citation.pages}")
print(f"Node ID: {citation.node_id}")
# Pixel-level bounding boxes (when available)
for bb in citation.bounding_boxes:
print(f" Page {bb.page}: ({bb.x}, {bb.y}) {bb.width}x{bb.height}")Domain Configuration
NanoIndex supports domain-specific extraction with custom entity types:
# Financial documents (auto-detected for 10-K/10-Q)
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6", financial_doc=True)
# Custom entity types for any domain
ni = NanoIndex(
llm="google:gemini-2.5-flash",
custom_entity_types=["Drug", "Diagnosis", "Procedure", "Dosage", "ClinicalTrial"],
custom_relationship_types=["treats", "has_endpoint", "dosed_at"],
)Built-in domains: sec_10k, sec_10q, earnings, financial, legal, legal_contract, medical, insurance, generic.
Offline Mode
Use PyMuPDF for fully local text extraction with no external API:
ni = NanoIndex(
parser="pymupdf", # local extraction, no API key needed
llm="ollama:llama3", # local LLM
)
tree = ni.index("report.pdf") # works fully offlineOffline mode uses PyMuPDF for text extraction and any local LLM for answering. No Nanonets API key required. Tree building, graph construction, and querying all work locally.
Knowledge Base
Add multiple documents to a persistent knowledge base. Query across all of them.
from nanoindex import KnowledgeBase
kb = KnowledgeBase("./my_kb", llm="anthropic:claude-sonnet-4-6")
# Add documents
kb.add("q1_report.pdf")
kb.add("q2_report.pdf")
kb.add("contract.pdf")
# Query across all documents
answer = kb.ask("Compare Q1 and Q2 revenue")
# Check status
print(kb.status()) # documents, concepts, queriesThe knowledge base persists trees, graphs, and embeddings to disk. Browse the generated wiki in Obsidian.
CLI
# Index a document nanoindex index report.pdf --output tree.json # Ask a question nanoindex ask tree.json "What was the revenue?" --llm-model claude-sonnet-4-6 # Launch the interactive visualizer nanoindex viz tree.json # Get help nanoindex --help
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
| llm | auto-detect | LLM provider and model |
| parser | "nanonets" | "nanonets" or "pymupdf" |
| financial_doc | False | Enable financial-specific prompts |
| build_graph | True | Build entity graph during indexing |
| add_summaries | True | Generate per-node summaries |
| max_node_tokens | 20,000 | Split nodes larger than this |
| confidence_threshold | 0.7 | Min bbox confidence to keep |
| custom_entity_types | None | Custom entity types for extraction |
| custom_relationship_types | None | Custom relationship types |