Unstructured.io

Contents

Why parsing is hard
Element types
Partition functions
Chunking strategies
Cloud vs local
Integration
References

01 — The Problem

Why Document Parsing Is Hard

Most documents are messy. PDFs embed text without structure — you get character positions and font info, but no semantic meaning of what's a title, table, or paragraph. HTML mixes markup with content. DOCX files are XML. Images contain text that needs OCR. And tables are notoriously difficult: cell detection, header inference, and value alignment all fail silently on complex layouts.

Naive extraction (splitting on whitespace, regex-based chunking) fails catastrophically: tables get flattened into gibberish, multi-column layouts become interleaved nonsense, and semantic boundaries disappear. When you feed that garbage to an LLM, retrieval quality collapses.

Common Failure Modes

🔲 PDF Layout Loss

Two-column layouts interleave lines
Sidebars and footnotes appear out of order
Headings separate from content

📊 Table Destruction

Headers unlinked from data rows
Cell values scrambled by column spanning
Merged cells create orphan values

🖼️ Image Blindness

Text in images never extracted
Charts and diagrams treated as void
OCR errors corrupt extracted text

🏷️ Semantic Loss

No distinction between title and body
List structure becomes flat text
Code blocks indistinguishable from prose

💡 The parsing cost: Unstructured.io isn't magic, but it solves these systematically. Document understanding is 70% of RAG quality. Garbage parsing ruins even the best retriever.

02 — Semantics

Unstructured Element Types

Unstructured parses documents into semantic elements — not just raw text. Each element has a type, text content, and metadata (font size, position, page number). Understanding element types lets you build smart chunking and retrieval strategies.

Element Type	Description	Example	RAG Use
Title	Top-level document heading	"Annual Report 2024"	Add to chunk metadata; high relevance for matching
Heading	Section header (h2, h3, etc.)	"Financial Performance"	Mark section boundaries; use for hierarchy
NarrativeText	Body paragraph	"Revenue grew by 15%..."	Main content; embed and retrieve
ListItem	Bulleted or numbered item	"• Focus on customer retention"	Keep together with siblings; preserve order
Table	Structured data grid	CSV-like rows and columns	Convert to markdown; chunk per row or preserve whole table
Image	Raster or vector graphic	PNG, JPG embedded	OCR for text; optional caption extraction
CodeBlock	Source code snippet	Python, SQL, JavaScript	Preserve indentation; don't chunk; embed as-is
PageBreak	Explicit page boundary	(metadata only)	Track document location; include in chunk metadata

Accessing Element Metadata

Each element object carries rich metadata: text, type, element_id, page_number, bbox (bounding box), and language. Use this to build context-aware chunks — for example, always include the parent heading with narrative text, or keep a code block with its preceding comment.

# Elements have rich metadata from unstructured.documents.elements import Title, Heading, NarrativeText title = Title(text="Quarterly Report") heading = Heading(text="Q3 Results", level=2) para = NarrativeText(text="Revenue: $5.2M", metadata={ "page_number": 3, "bbox": [50, 100, 400, 120], "language": "en" }) # Use type to build smarter chunking if isinstance(para, Title): # Add extra weight to titles in embeddings chunk_importance = "high" else: chunk_importance = "normal"

⚠️ Element types are not perfect: Classification is heuristic-based. Some headings might be labeled as NarrativeText. Always validate with sampled documents, especially on novel formats.

03 — Core API

Partition Functions

Unstructured provides partition functions for each document type. Each function reads a file, applies format-specific parsing, and returns a list of element objects. The API is consistent: partition_* takes a file path and optional parameters, returns List[Element].

Main Partition Functions

`partition_pdf()` — PDF & scanned docs

Extracts text, layout, and tables from PDFs. Handles both digital PDFs (with embedded text) and scanned PDFs (with images). Two strategies: fast extracts text directly; hi_res uses layout analysis for better structure.

strategy="fast": Quick, suitable for clean digital PDFs
strategy="hi_res": Slower, handles multi-column and complex layouts
infer_table_structure=True: Extract table cell relationships

`partition_html()` — Web pages

Parses HTML into semantic elements. Strips markup, identifies headings and paragraphs, extracts tables from <table> tags. Optional include/exclude selectors to keep or ignore specific DOM subtrees.

include_metadata=True: Preserve CSS classes and IDs
skip_headers_footers=True: Skip nav/footer regions
Handles dynamically-loaded content (use with Playwright)

`partition_docx()` — Microsoft Word

Reads DOCX (Office Open XML) directly. Preserves styles, heading levels, table structure, and embedded images. DOCX is structured, so this is one of the most reliable partitioners.

Respects document outline and heading hierarchy
Extracts image captions and alt-text
Preserves table cell relationships

`partition_image()` — OCR

Extracts text from image files (PNG, JPG, etc.) using Tesseract or cloud OCR. Optional layout analysis. For scanned PDFs, this is called internally.

strategy="hi_res": Layout-aware OCR
ocr_languages=["en", "fr"]: Multi-language support
Integrates with cloud OCR (Azure, Google) for better quality

Partition Example

from unstructured.partition.pdf import partition_pdf from unstructured.partition.html import partition_html from unstructured.partition.docx import partition_docx # Fast extraction elements = partition_pdf("report.pdf", strategy="fast") # High-resolution layout analysis elements = partition_pdf( "complex_layout.pdf", strategy="hi_res", infer_table_structure=True ) # HTML parsing elements = partition_html("article.html", skip_headers_footers=True) # DOCX with styles preserved elements = partition_docx("document.docx") # Print element types for elem in elements[:10]: print(f"{elem.type:15s} | {elem.text[:60]}")

Strategy Comparison: Fast vs Hi-Res

Fast: Extracts text stream. Best for single-column documents with clean formatting. ~100ms per page.
Hi-Res: Uses layout detection (Detectron2) to understand columns, footers, sidebars. Significantly slower (~5-10s per page) but captures structure. Choose hi_res for multi-column PDFs, magazine layouts, or documents with sidebars.

💡 Start with fast, measure quality: Test both strategies on your documents. Fast often works well and is 50-100× faster. Only upgrade to hi_res if quality assessment shows clear gains.

04 — Token Sizing

Chunking with Unstructured

Parsing is step one; chunking is step two. Raw elements are too granular (each paragraph is one element). Chunking combines elements into fixed-size, semantically coherent chunks suitable for embeddings and retrieval.

Built-in Chunking Functions

chunk_by_title() groups elements under each heading, respecting heading hierarchy. chunk_elements() uses token counting to combine elements until reaching max_characters. Both preserve element type and metadata.

from unstructured.chunking.title import chunk_by_title from unstructured.chunking.basic import Chunker # Chunk by heading hierarchy chunks = chunk_by_title(elements, combine_text_under_n_chars=500) # Result: each chunk is all content under one heading (recursively) # Chunk by token budget chunker = Chunker(max_characters=512, new_after_n_chars=400) chunks = chunker.chunk(elements) # Result: combine elements until reaching 512 chars, start new chunk after 400 # Access chunk metadata for chunk in chunks[:3]: print(f"Tokens: {len(chunk.text.split())}") print(f"Types: {set(e.type for e in chunk.elements)}") print(f"Text: {chunk.text[:100]}...") print()

Chunking Best Practices

✂️ Size Matters

512 chars = ~100 tokens (typical)
For sparse retrieval: 256–512 chars
For dense (vector): 256–1024 chars
Adjust based on embedding context window

🔀 Semantic Boundaries

Chunk by heading when possible
Don't split tables or code blocks
Include context (parent heading) in metadata
Preserve list structure

📦 Overlapping Windows

Use 10–20% overlap for continuity
Helps bridge context between chunks
Trade-off: more embeddings, better recall

🏷️ Rich Metadata

Add document title, file path
Track page numbers, headings
Tag sensitive data (PII, confidential)
Include chunk index for reranking

⚠️ Beware chunk explosion: With overlap, a 100-page document can generate 10,000+ chunks. Embedding costs balloon. Set aggressive max_characters and test carefully.

05 — Deployment

Cloud API vs Local Processing

Unstructured.io offers two deployment modes: Cloud API (managed service) and Local SDK (self-hosted). Cloud is simpler but less private; Local is full control but requires infrastructure.

Aspect	Cloud API	Local SDK
Setup	API key, HTTP requests	pip install, GPU optional
Latency	100–500ms per document	10–100ms per document (with GPU)
Cost	$0.001–0.01 per page	One-time model DL; GPU rental if scaling
Privacy	Documents sent to Unstructured servers	Fully on-premise
OCR Quality	Cloud OCR (Azure, Google)	Tesseract (lower quality)
Scaling	Unstructured manages scaling	You manage queue, workers, GPU
Model Updates	Automatic	Manual (pip upgrade)

Cloud API Example

from unstructured_client import UnstructuredClient from unstructured_client.models import shared client = UnstructuredClient(api_key="your-api-key") # Upload file to cloud with open("document.pdf", "rb") as f: elements = client.general.partition( files=shared.Files(content=f), strategy="hi_res", model_name="chipper" ) # Receive parsed elements for elem in elements: print(f"{elem.type}: {elem.text[:80]}")

Local SDK Example

from unstructured.partition.pdf import partition_pdf from unstructured.chunking.title import chunk_by_title # All processing happens locally elements = partition_pdf( "document.pdf", strategy="hi_res" ) # Chunking also local chunks = chunk_by_title(elements, combine_text_under_n_chars=500) # Return to your vector database for chunk in chunks: embedding = embedder.embed(chunk.text) store_in_db(chunk.text, embedding, metadata=chunk.metadata)

💡 Hybrid approach: For very large pipelines, use Cloud API for heavy lifting (tables, scanned PDFs), fall back to Local SDK for simple docs. Or batch cloud requests during off-peak hours to reduce costs.

06 — Ecosystem

Integration with LangChain & LlamaIndex

Unstructured.io integrates seamlessly with the major LLM frameworks. LangChain has UnstructuredLoader; LlamaIndex has UnstructuredReader. Both handle partition and chunking transparently.

LangChain Integration

from langchain_community.document_loaders import UnstructuredLoader from langchain_text_splitters import RecursiveCharacterTextSplitter # Load and chunk in one step loader = UnstructuredLoader( file_path="document.pdf", mode="elements", # Return element objects strategy="hi_res" ) elements = loader.load() # Convert to LangChain Document objects documents = [ Document( page_content=elem.text, metadata={ "type": elem.type, "page": elem.metadata.get("page_number"), "source": "document.pdf" } ) for elem in elements ] # Chain to embeddings + vector store vector_store = FAISS.from_documents(documents, embedder)

LlamaIndex Integration

from llama_index.readers.file import UnstructuredReader from llama_index import SimpleDirectoryReader # Load with Unstructured backend reader = UnstructuredReader(mode="elements") documents = reader.load_data(file=Path("document.pdf")) # Or use SimpleDirectoryReader with Unstructured loader = SimpleDirectoryReader( input_dir="./documents", file_extractor={".pdf": UnstructuredReader()} ) documents = loader.load_data() # Build index from llama_index import VectorStoreIndex index = VectorStoreIndex.from_documents(documents)

Custom Pipeline

For full control, partition manually, chunk, embed, and store:

from unstructured.partition.auto import partition from unstructured.chunking.title import chunk_by_title import pinecone # Auto-detect file type elements = partition("document.pdf") # Custom chunking chunks = chunk_by_title( elements, combine_text_under_n_chars=500, max_characters=1000 ) # Embed and store for i, chunk in enumerate(chunks): embedding = embed_model.encode(chunk.text) pinecone_index.upsert([( f"doc__{i}", embedding, { "text": chunk.text, "type": chunk.type, "document": "filename.pdf" } )]) print(f"Indexed {len(chunks)} chunks")

Tools & Ecosystem

Related Tools

Parsing

Unstructured.io

Multi-format document parser with cloud and local options

PDF

Docling

IBM's structured document converter for PDF, DOCX, PPTX

PDF

LlamaParse

LlamaIndex's document parser optimized for tables and complex layouts

PDF

PyMuPDF

Fast low-level PDF manipulation and text extraction

PDF

pdfplumber

Precise table and text extraction from PDFs

Tables

Camelot

Specialized table extraction from PDFs

OCR

Tesseract

Open-source OCR engine for scanned documents

Framework

LangChain

LLM framework with Unstructured loader integration

Framework

LlamaIndex

Data indexing framework with UnstructuredReader

07 — Further Reading

References

Documentation

Blog & Guides

Papers & Research

Paper LayoutLM: Layout-aware language models for information extraction (arXiv:2110.06220) ↗

Why Document Parsing Is Hard

Common Failure Modes

🔲 PDF Layout Loss

📊 Table Destruction

🖼️ Image Blindness

🏷️ Semantic Loss

Unstructured Element Types

Accessing Element Metadata

Partition Functions

Main Partition Functions

partition_pdf() — PDF & scanned docs

partition_html() — Web pages

partition_docx() — Microsoft Word

partition_image() — OCR

Partition Example

Strategy Comparison: Fast vs Hi-Res

Chunking with Unstructured

Built-in Chunking Functions

Chunking Best Practices

✂️ Size Matters

🔀 Semantic Boundaries

📦 Overlapping Windows

🏷️ Rich Metadata

Cloud API vs Local Processing

Cloud API Example

Local SDK Example

Integration with LangChain & LlamaIndex

LangChain Integration

LlamaIndex Integration

Custom Pipeline

Related Tools

References

Related concepts

`partition_pdf()` — PDF & scanned docs

`partition_html()` — Web pages

`partition_docx()` — Microsoft Word

`partition_image()` — OCR