Retrieval · Ingestion

Unstructured.io

Parsing PDFs, HTML, DOCX, images, and tables into clean chunks ready for embedding

25+ file types
6 sections
Python-first SDK
Contents
  1. Why parsing is hard
  2. Element types
  3. Partition functions
  4. Chunking strategies
  5. Cloud vs local
  6. Integration
  7. References
01 — The Problem

Why Document Parsing Is Hard

Most documents are messy. PDFs embed text without structure — you get character positions and font info, but no semantic meaning of what's a title, table, or paragraph. HTML mixes markup with content. DOCX files are XML. Images contain text that needs OCR. And tables are notoriously difficult: cell detection, header inference, and value alignment all fail silently on complex layouts.

Naive extraction (splitting on whitespace, regex-based chunking) fails catastrophically: tables get flattened into gibberish, multi-column layouts become interleaved nonsense, and semantic boundaries disappear. When you feed that garbage to an LLM, retrieval quality collapses.

Common Failure Modes

🔲 PDF Layout Loss

  • Two-column layouts interleave lines
  • Sidebars and footnotes appear out of order
  • Headings separate from content

📊 Table Destruction

  • Headers unlinked from data rows
  • Cell values scrambled by column spanning
  • Merged cells create orphan values

🖼️ Image Blindness

  • Text in images never extracted
  • Charts and diagrams treated as void
  • OCR errors corrupt extracted text

🏷️ Semantic Loss

  • No distinction between title and body
  • List structure becomes flat text
  • Code blocks indistinguishable from prose
💡 The parsing cost: Unstructured.io isn't magic, but it solves these systematically. Document understanding is 70% of RAG quality. Garbage parsing ruins even the best retriever.
02 — Semantics

Unstructured Element Types

Unstructured parses documents into semantic elements — not just raw text. Each element has a type, text content, and metadata (font size, position, page number). Understanding element types lets you build smart chunking and retrieval strategies.

Element Type Description Example RAG Use
Title Top-level document heading "Annual Report 2024" Add to chunk metadata; high relevance for matching
Heading Section header (h2, h3, etc.) "Financial Performance" Mark section boundaries; use for hierarchy
NarrativeText Body paragraph "Revenue grew by 15%..." Main content; embed and retrieve
ListItem Bulleted or numbered item "• Focus on customer retention" Keep together with siblings; preserve order
Table Structured data grid CSV-like rows and columns Convert to markdown; chunk per row or preserve whole table
Image Raster or vector graphic PNG, JPG embedded OCR for text; optional caption extraction
CodeBlock Source code snippet Python, SQL, JavaScript Preserve indentation; don't chunk; embed as-is
PageBreak Explicit page boundary (metadata only) Track document location; include in chunk metadata

Accessing Element Metadata

Each element object carries rich metadata: text, type, element_id, page_number, bbox (bounding box), and language. Use this to build context-aware chunks — for example, always include the parent heading with narrative text, or keep a code block with its preceding comment.

# Elements have rich metadata from unstructured.documents.elements import Title, Heading, NarrativeText title = Title(text="Quarterly Report") heading = Heading(text="Q3 Results", level=2) para = NarrativeText(text="Revenue: $5.2M", metadata={ "page_number": 3, "bbox": [50, 100, 400, 120], "language": "en" }) # Use type to build smarter chunking if isinstance(para, Title): # Add extra weight to titles in embeddings chunk_importance = "high" else: chunk_importance = "normal"
⚠️ Element types are not perfect: Classification is heuristic-based. Some headings might be labeled as NarrativeText. Always validate with sampled documents, especially on novel formats.
03 — Core API

Partition Functions

Unstructured provides partition functions for each document type. Each function reads a file, applies format-specific parsing, and returns a list of element objects. The API is consistent: partition_* takes a file path and optional parameters, returns List[Element].

Main Partition Functions

1

partition_pdf() — PDF & scanned docs

Extracts text, layout, and tables from PDFs. Handles both digital PDFs (with embedded text) and scanned PDFs (with images). Two strategies: fast extracts text directly; hi_res uses layout analysis for better structure.

  • strategy="fast": Quick, suitable for clean digital PDFs
  • strategy="hi_res": Slower, handles multi-column and complex layouts
  • infer_table_structure=True: Extract table cell relationships
2

partition_html() — Web pages

Parses HTML into semantic elements. Strips markup, identifies headings and paragraphs, extracts tables from <table> tags. Optional include/exclude selectors to keep or ignore specific DOM subtrees.

  • include_metadata=True: Preserve CSS classes and IDs
  • skip_headers_footers=True: Skip nav/footer regions
  • Handles dynamically-loaded content (use with Playwright)
3

partition_docx() — Microsoft Word

Reads DOCX (Office Open XML) directly. Preserves styles, heading levels, table structure, and embedded images. DOCX is structured, so this is one of the most reliable partitioners.

  • Respects document outline and heading hierarchy
  • Extracts image captions and alt-text
  • Preserves table cell relationships
4

partition_image() — OCR

Extracts text from image files (PNG, JPG, etc.) using Tesseract or cloud OCR. Optional layout analysis. For scanned PDFs, this is called internally.

  • strategy="hi_res": Layout-aware OCR
  • ocr_languages=["en", "fr"]: Multi-language support
  • Integrates with cloud OCR (Azure, Google) for better quality

Partition Example

from unstructured.partition.pdf import partition_pdf from unstructured.partition.html import partition_html from unstructured.partition.docx import partition_docx # Fast extraction elements = partition_pdf("report.pdf", strategy="fast") # High-resolution layout analysis elements = partition_pdf( "complex_layout.pdf", strategy="hi_res", infer_table_structure=True ) # HTML parsing elements = partition_html("article.html", skip_headers_footers=True) # DOCX with styles preserved elements = partition_docx("document.docx") # Print element types for elem in elements[:10]: print(f"{elem.type:15s} | {elem.text[:60]}")

Strategy Comparison: Fast vs Hi-Res

Fast: Extracts text stream. Best for single-column documents with clean formatting. ~100ms per page.
Hi-Res: Uses layout detection (Detectron2) to understand columns, footers, sidebars. Significantly slower (~5-10s per page) but captures structure. Choose hi_res for multi-column PDFs, magazine layouts, or documents with sidebars.

💡 Start with fast, measure quality: Test both strategies on your documents. Fast often works well and is 50-100× faster. Only upgrade to hi_res if quality assessment shows clear gains.
04 — Token Sizing

Chunking with Unstructured

Parsing is step one; chunking is step two. Raw elements are too granular (each paragraph is one element). Chunking combines elements into fixed-size, semantically coherent chunks suitable for embeddings and retrieval.

Built-in Chunking Functions

chunk_by_title() groups elements under each heading, respecting heading hierarchy. chunk_elements() uses token counting to combine elements until reaching max_characters. Both preserve element type and metadata.

from unstructured.chunking.title import chunk_by_title from unstructured.chunking.basic import Chunker # Chunk by heading hierarchy chunks = chunk_by_title(elements, combine_text_under_n_chars=500) # Result: each chunk is all content under one heading (recursively) # Chunk by token budget chunker = Chunker(max_characters=512, new_after_n_chars=400) chunks = chunker.chunk(elements) # Result: combine elements until reaching 512 chars, start new chunk after 400 # Access chunk metadata for chunk in chunks[:3]: print(f"Tokens: {len(chunk.text.split())}") print(f"Types: {set(e.type for e in chunk.elements)}") print(f"Text: {chunk.text[:100]}...") print()

Chunking Best Practices

✂️ Size Matters

  • 512 chars = ~100 tokens (typical)
  • For sparse retrieval: 256–512 chars
  • For dense (vector): 256–1024 chars
  • Adjust based on embedding context window

🔀 Semantic Boundaries

  • Chunk by heading when possible
  • Don't split tables or code blocks
  • Include context (parent heading) in metadata
  • Preserve list structure

📦 Overlapping Windows

  • Use 10–20% overlap for continuity
  • Helps bridge context between chunks
  • Trade-off: more embeddings, better recall

🏷️ Rich Metadata

  • Add document title, file path
  • Track page numbers, headings
  • Tag sensitive data (PII, confidential)
  • Include chunk index for reranking
⚠️ Beware chunk explosion: With overlap, a 100-page document can generate 10,000+ chunks. Embedding costs balloon. Set aggressive max_characters and test carefully.
05 — Deployment

Cloud API vs Local Processing

Unstructured.io offers two deployment modes: Cloud API (managed service) and Local SDK (self-hosted). Cloud is simpler but less private; Local is full control but requires infrastructure.

Aspect Cloud API Local SDK
Setup API key, HTTP requests pip install, GPU optional
Latency 100–500ms per document 10–100ms per document (with GPU)
Cost $0.001–0.01 per page One-time model DL; GPU rental if scaling
Privacy Documents sent to Unstructured servers Fully on-premise
OCR Quality Cloud OCR (Azure, Google) Tesseract (lower quality)
Scaling Unstructured manages scaling You manage queue, workers, GPU
Model Updates Automatic Manual (pip upgrade)

Cloud API Example

from unstructured_client import UnstructuredClient from unstructured_client.models import shared client = UnstructuredClient(api_key="your-api-key") # Upload file to cloud with open("document.pdf", "rb") as f: elements = client.general.partition( files=shared.Files(content=f), strategy="hi_res", model_name="chipper" ) # Receive parsed elements for elem in elements: print(f"{elem.type}: {elem.text[:80]}")

Local SDK Example

from unstructured.partition.pdf import partition_pdf from unstructured.chunking.title import chunk_by_title # All processing happens locally elements = partition_pdf( "document.pdf", strategy="hi_res" ) # Chunking also local chunks = chunk_by_title(elements, combine_text_under_n_chars=500) # Return to your vector database for chunk in chunks: embedding = embedder.embed(chunk.text) store_in_db(chunk.text, embedding, metadata=chunk.metadata)
💡 Hybrid approach: For very large pipelines, use Cloud API for heavy lifting (tables, scanned PDFs), fall back to Local SDK for simple docs. Or batch cloud requests during off-peak hours to reduce costs.
06 — Ecosystem

Integration with LangChain & LlamaIndex

Unstructured.io integrates seamlessly with the major LLM frameworks. LangChain has UnstructuredLoader; LlamaIndex has UnstructuredReader. Both handle partition and chunking transparently.

LangChain Integration

from langchain_community.document_loaders import UnstructuredLoader from langchain_text_splitters import RecursiveCharacterTextSplitter # Load and chunk in one step loader = UnstructuredLoader( file_path="document.pdf", mode="elements", # Return element objects strategy="hi_res" ) elements = loader.load() # Convert to LangChain Document objects documents = [ Document( page_content=elem.text, metadata={ "type": elem.type, "page": elem.metadata.get("page_number"), "source": "document.pdf" } ) for elem in elements ] # Chain to embeddings + vector store vector_store = FAISS.from_documents(documents, embedder)

LlamaIndex Integration

from llama_index.readers.file import UnstructuredReader from llama_index import SimpleDirectoryReader # Load with Unstructured backend reader = UnstructuredReader(mode="elements") documents = reader.load_data(file=Path("document.pdf")) # Or use SimpleDirectoryReader with Unstructured loader = SimpleDirectoryReader( input_dir="./documents", file_extractor={".pdf": UnstructuredReader()} ) documents = loader.load_data() # Build index from llama_index import VectorStoreIndex index = VectorStoreIndex.from_documents(documents)

Custom Pipeline

For full control, partition manually, chunk, embed, and store:

from unstructured.partition.auto import partition from unstructured.chunking.title import chunk_by_title import pinecone # Auto-detect file type elements = partition("document.pdf") # Custom chunking chunks = chunk_by_title( elements, combine_text_under_n_chars=500, max_characters=1000 ) # Embed and store for i, chunk in enumerate(chunks): embedding = embed_model.encode(chunk.text) pinecone_index.upsert([( f"doc__{i}", embedding, { "text": chunk.text, "type": chunk.type, "document": "filename.pdf" } )]) print(f"Indexed {len(chunks)} chunks")
Tools & Ecosystem

Related Tools

Parsing
Unstructured.io
Multi-format document parser with cloud and local options
PDF
Docling
IBM's structured document converter for PDF, DOCX, PPTX
PDF
LlamaParse
LlamaIndex's document parser optimized for tables and complex layouts
PDF
PyMuPDF
Fast low-level PDF manipulation and text extraction
PDF
pdfplumber
Precise table and text extraction from PDFs
Tables
Camelot
Specialized table extraction from PDFs
OCR
Tesseract
Open-source OCR engine for scanned documents
Framework
LangChain
LLM framework with Unstructured loader integration
Framework
LlamaIndex
Data indexing framework with UnstructuredReader
07 — Further Reading

References

Documentation
Blog & Guides
Papers & Research