Documentation

Everything you need to index, query, and build on top of NanoIndex.

Installation

pip install nanoindex

Requires Python 3.10+. Installs PyMuPDF, httpx, pydantic, networkx, and spaCy as dependencies.

You need two API keys:

  • NANONETS_API_KEY for document extraction (free, 10K pages at docstrange.nanonets.com)
  • LLM API key for answering questions (OpenAI, Anthropic, or Google)
export NANONETS_API_KEY=your_key_here
export GOOGLE_API_KEY=your_key_here  # or ANTHROPIC_API_KEY or OPENAI_API_KEY

Quickstart

from nanoindex import NanoIndex

ni = NanoIndex()                              # auto-detects API keys from env
tree = ni.index("annual_report.pdf")          # builds tree + graph
answer = ni.ask("What was the revenue?", tree)
print(answer.content)                         # "Revenue was $280.5B in FY2019"
print(answer.citations)                       # [Citation(title="PART II", pages=[17,18])]

That's it. Three lines from PDF to cited answer.

LLM Configuration

NanoIndex supports any OpenAI-compatible LLM. Pass the provider and model name:

# Anthropic
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6")

# OpenAI
ni = NanoIndex(llm="openai:gpt-5.4")

# Google
ni = NanoIndex(llm="google:gemini-2.5-flash")

# Local models via Ollama
ni = NanoIndex(llm="ollama:llama3")

If no llm= is passed, NanoIndex auto-detects from environment variables in this order: ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, GROQ_API_KEY.

Indexing Documents

Indexing extracts the document structure, builds a tree, generates summaries, and constructs an entity graph.

tree = ni.index("report.pdf")

# With options
tree = ni.index(
    "report.pdf",
    add_summaries=True,       # generate per-node summaries (default: True)
    add_doc_description=False, # generate document-level description
)

What happens during indexing:

  1. Nanonets OCR-3 extracts sections, tables, entities, bounding boxes (single API call)
  2. Tree builder creates a navigable hierarchy (PART I > Item 1 > Business > Divisions)
  3. SEC filing enforcement normalizes 10-K/10-Q structure
  4. Entity graph built from extracted relationships
  5. Summaries generated for nodes without API-provided summaries

How Splitting Works

NanoIndex never splits by fixed token count. Instead, it uses a 4-tier cascade that preserves document meaning:

  1. Heading split (zero LLM cost). If a node's text contains markdown headings (#, ##), split on those boundaries. Each sub-heading becomes a child node.
  2. LLM title split. Ask the LLM to identify 3-8 logical subsection boundaries in the text. Split at those points.
  3. Proposition extraction. For dense text with no structure, extract atomic factual claims. Each proposition is a self-contained statement that can stand alone.
  4. Paragraph chunking (last resort). Split at paragraph boundaries to respect token limits.

Example of proposition extraction on a legal paragraph:

# Original text:
# "The vendor warrants all goods are free from defects under
#  Consumer Protection Act Section 84(a). Liability covers both
#  direct and consequential damages, capped at the contract value
#  defined in Clause 5(b). The vendor may be exempted under
#  Section 85 if the defect was caused by buyer negligence."

# NanoIndex extracts 4 propositions:
# P1: "Vendor warrants goods free from defects under CPA Section 84(a)"
# P2: "Liability covers direct and consequential damages"
# P3: "Liability capped at contract value defined in Clause 5(b)"
# P4: "Vendor exempt under Section 85 if defect caused by buyer negligence"

# Each is independently queryable:
# "What is the liability cap?" -> retrieves P3
# "Can liability be exempted?" -> retrieves P4

This means every piece of the tree is meaningful on its own, not an arbitrary 512-token fragment that starts mid-sentence.

Querying

# Simple question
answer = ni.ask("What was the revenue?", tree)

# With PDF for vision mode
answer = ni.ask("What was the revenue?", tree, pdf_path="report.pdf")

# Access the answer
print(answer.content)     # the answer text
print(answer.citations)   # list of Citation objects
print(answer.mode)        # "text", "fast", "agentic_vision", etc.

Query Architecture

When you call ni.ask(), here is what happens internally:

  1. Query decomposition. The LLM breaks the question into required data points. "What was the EBITDA margin?" becomes: need operating income (from income statement) + D&A (from cash flow statement) + revenue (from income statement).
  2. Tree navigation. The agent reads the full tree outline (all node titles + summaries) and selects which sections to read. It reasons: "EBITDA needs Item 7 (MD&A) and Item 8 (Financial Statements)."
  3. Multi-round retrieval. The agent reads the selected sections. If it needs more context, it requests additional sections. This continues for 2-5 rounds until the agent has enough information.
  4. Sufficiency check. For financial queries, the system verifies all required data points from step 1 are covered. If "D&A" was needed but not found, it triggers targeted retrieval.
  5. Answer generation. The LLM generates an answer from the gathered context, with step-by-step reasoning for calculations.
  6. Calculation verification. If the answer contains math, a separate verification step checks the arithmetic against source data.
  7. Self-evaluation. If the answer looks like a refusal ("cannot determine"), the system retries with expanded context or falls back to keyword search.

In agentic_vision mode, steps 3-5 also include rendered PDF page images alongside the extracted text. The LLM can read tables, charts, and visual layouts directly from the page.

In fast mode, steps 2-4 are replaced by entity graph traversal (find entities matching the query, expand via relationships, LLM picks from candidates). This uses only 2 LLM calls instead of 5-11.

Query Modes

Different modes trade off speed vs accuracy. The key difference between agentic and agentic_vision: agentic mode sends extracted text to the LLM, while agentic_vision also renders the actual PDF pages as images and sends them alongside the text. This lets the LLM read tables, charts, and layouts directly from the page rather than relying on OCR text alone.

ModeLLM CallsBest For
fast2Simple lookups, low latency
agentic5-11Complex multi-step, highest accuracy
agentic_vision5-11Tables, charts, scanned docs
agentic_graph3-6Entity-centric queries
answer = ni.ask("What was the revenue?", tree, mode="fast")
answer = ni.ask("Calculate EBITDA margin", tree, mode="agentic")
answer = ni.ask("Read the table on page 40", tree, mode="agentic_vision", pdf_path="report.pdf")

Tree Structure

The tree is the core data structure. Each node has:

from nanoindex.utils.tree_ops import save_tree, load_tree, iter_nodes

# Save and reload (no re-indexing needed)
save_tree(tree, "my_tree.json")
tree = load_tree("my_tree.json")

# Explore nodes
for node in iter_nodes(tree.structure):
    print(f"[{node.node_id}] {node.title}")
    print(f"  Pages {node.start_index}-{node.end_index}")
    print(f"  Summary: {node.summary}")
    print(f"  Children: {len(node.nodes)}")
    print(f"  Bounding boxes: {len(node.bounding_boxes)}")

Each TreeNode has: title, node_id, start_index, end_index, level, summary, text, bounding_boxes, tables, and child nodes.

Entity Graph

The entity graph connects companies, people, metrics, and their relationships. Built automatically from API-extracted entities during indexing.

# Access the graph after indexing
graph = ni._graphs.get(tree.doc_name)

# Entities
for entity in graph.entities[:5]:
    print(f"{entity.name} ({entity.entity_type})")
    print(f"  Found in nodes: {entity.source_node_ids}")

# Relationships
for rel in graph.relationships[:5]:
    print(f"{rel.source} --{rel.keywords}--> {rel.target}")

Relationship types include: has_value, grew_by, subsidiary_of, holds_title, operates_in, governed_by, references, and more.

Cross-References

Legal contracts and SEC filings are full of internal references: "Section 15.2 limits liability to the contract value as defined in Section 2.5(b)." NanoIndex detects these automatically and creates graph edges between the referencing and referenced nodes.

Supported patterns:

  • Section 3.1, Section 2.5(b)
  • Clause 5, Article 19
  • Item 1A, Item 7
  • Note 5, Exhibit 4.6
  • Part I, Schedule A

When the agent encounters a section that references another, it can follow the graph edge to retrieve the referenced content automatically. No manual linking required.

Citations

Every answer includes citations with page numbers and optional bounding boxes.

answer = ni.ask("What was the revenue?", tree)

for citation in answer.citations:
    print(f"Section: {citation.title}")
    print(f"Pages: {citation.pages}")
    print(f"Node ID: {citation.node_id}")

    # Pixel-level bounding boxes (when available)
    for bb in citation.bounding_boxes:
        print(f"  Page {bb.page}: ({bb.x}, {bb.y}) {bb.width}x{bb.height}")

Domain Configuration

NanoIndex supports domain-specific extraction with custom entity types:

# Financial documents (auto-detected for 10-K/10-Q)
ni = NanoIndex(llm="anthropic:claude-sonnet-4-6", financial_doc=True)

# Custom entity types for any domain
ni = NanoIndex(
    llm="google:gemini-2.5-flash",
    custom_entity_types=["Drug", "Diagnosis", "Procedure", "Dosage", "ClinicalTrial"],
    custom_relationship_types=["treats", "has_endpoint", "dosed_at"],
)

Built-in domains: sec_10k, sec_10q, earnings, financial, legal, legal_contract, medical, insurance, generic.

Offline Mode

Use PyMuPDF for fully local text extraction with no external API:

ni = NanoIndex(
    parser="pymupdf",           # local extraction, no API key needed
    llm="ollama:llama3",        # local LLM
)
tree = ni.index("report.pdf")  # works fully offline

Offline mode uses PyMuPDF for text extraction and any local LLM for answering. No Nanonets API key required. Tree building, graph construction, and querying all work locally.

Knowledge Base

Add multiple documents to a persistent knowledge base. Query across all of them.

from nanoindex import KnowledgeBase

kb = KnowledgeBase("./my_kb", llm="anthropic:claude-sonnet-4-6")

# Add documents
kb.add("q1_report.pdf")
kb.add("q2_report.pdf")
kb.add("contract.pdf")

# Query across all documents
answer = kb.ask("Compare Q1 and Q2 revenue")

# Check status
print(kb.status())  # documents, concepts, queries

The knowledge base persists trees, graphs, and embeddings to disk. Browse the generated wiki in Obsidian.

CLI

# Index a document
nanoindex index report.pdf --output tree.json

# Ask a question
nanoindex ask tree.json "What was the revenue?" --llm-model claude-sonnet-4-6

# Launch the interactive visualizer
nanoindex viz tree.json

# Get help
nanoindex --help

Configuration Reference

ParameterDefaultDescription
llmauto-detectLLM provider and model
parser"nanonets""nanonets" or "pymupdf"
financial_docFalseEnable financial-specific prompts
build_graphTrueBuild entity graph during indexing
add_summariesTrueGenerate per-node summaries
max_node_tokens20,000Split nodes larger than this
confidence_threshold0.7Min bbox confidence to keep
custom_entity_typesNoneCustom entity types for extraction
custom_relationship_typesNoneCustom relationship types