01 — The Problem
Why Document Parsing Is Hard
Most documents are messy. PDFs embed text without structure — you get character positions and font info, but no semantic meaning of what's a title, table, or paragraph. HTML mixes markup with content. DOCX files are XML. Images contain text that needs OCR. And tables are notoriously difficult: cell detection, header inference, and value alignment all fail silently on complex layouts.
Naive extraction (splitting on whitespace, regex-based chunking) fails catastrophically: tables get flattened into gibberish, multi-column layouts become interleaved nonsense, and semantic boundaries disappear. When you feed that garbage to an LLM, retrieval quality collapses.
Common Failure Modes
🔲 PDF Layout Loss
- Two-column layouts interleave lines
- Sidebars and footnotes appear out of order
- Headings separate from content
📊 Table Destruction
- Headers unlinked from data rows
- Cell values scrambled by column spanning
- Merged cells create orphan values
🖼️ Image Blindness
- Text in images never extracted
- Charts and diagrams treated as void
- OCR errors corrupt extracted text
🏷️ Semantic Loss
- No distinction between title and body
- List structure becomes flat text
- Code blocks indistinguishable from prose
💡
The parsing cost: Unstructured.io isn't magic, but it solves these systematically. Document understanding is 70% of RAG quality. Garbage parsing ruins even the best retriever.
02 — Semantics
Unstructured Element Types
Unstructured parses documents into semantic elements — not just raw text. Each element has a type, text content, and metadata (font size, position, page number). Understanding element types lets you build smart chunking and retrieval strategies.
| Element Type |
Description |
Example |
RAG Use |
| Title |
Top-level document heading |
"Annual Report 2024" |
Add to chunk metadata; high relevance for matching |
| Heading |
Section header (h2, h3, etc.) |
"Financial Performance" |
Mark section boundaries; use for hierarchy |
| NarrativeText |
Body paragraph |
"Revenue grew by 15%..." |
Main content; embed and retrieve |
| ListItem |
Bulleted or numbered item |
"• Focus on customer retention" |
Keep together with siblings; preserve order |
| Table |
Structured data grid |
CSV-like rows and columns |
Convert to markdown; chunk per row or preserve whole table |
| Image |
Raster or vector graphic |
PNG, JPG embedded |
OCR for text; optional caption extraction |
| CodeBlock |
Source code snippet |
Python, SQL, JavaScript |
Preserve indentation; don't chunk; embed as-is |
| PageBreak |
Explicit page boundary |
(metadata only) |
Track document location; include in chunk metadata |
Accessing Element Metadata
Each element object carries rich metadata: text, type, element_id, page_number, bbox (bounding box), and language. Use this to build context-aware chunks — for example, always include the parent heading with narrative text, or keep a code block with its preceding comment.
# Elements have rich metadata
from unstructured.documents.elements import Title, Heading, NarrativeText
title = Title(text="Quarterly Report")
heading = Heading(text="Q3 Results", level=2)
para = NarrativeText(text="Revenue: $5.2M", metadata={
"page_number": 3,
"bbox": [50, 100, 400, 120],
"language": "en"
})
# Use type to build smarter chunking
if isinstance(para, Title):
# Add extra weight to titles in embeddings
chunk_importance = "high"
else:
chunk_importance = "normal"
⚠️
Element types are not perfect: Classification is heuristic-based. Some headings might be labeled as NarrativeText. Always validate with sampled documents, especially on novel formats.
03 — Core API
Partition Functions
Unstructured provides partition functions for each document type. Each function reads a file, applies format-specific parsing, and returns a list of element objects. The API is consistent: partition_* takes a file path and optional parameters, returns List[Element].
Main Partition Functions
1
partition_pdf() — PDF & scanned docs
Extracts text, layout, and tables from PDFs. Handles both digital PDFs (with embedded text) and scanned PDFs (with images). Two strategies: fast extracts text directly; hi_res uses layout analysis for better structure.
strategy="fast": Quick, suitable for clean digital PDFs
strategy="hi_res": Slower, handles multi-column and complex layouts
infer_table_structure=True: Extract table cell relationships
2
partition_html() — Web pages
Parses HTML into semantic elements. Strips markup, identifies headings and paragraphs, extracts tables from <table> tags. Optional include/exclude selectors to keep or ignore specific DOM subtrees.
include_metadata=True: Preserve CSS classes and IDs
skip_headers_footers=True: Skip nav/footer regions
- Handles dynamically-loaded content (use with Playwright)
3
partition_docx() — Microsoft Word
Reads DOCX (Office Open XML) directly. Preserves styles, heading levels, table structure, and embedded images. DOCX is structured, so this is one of the most reliable partitioners.
- Respects document outline and heading hierarchy
- Extracts image captions and alt-text
- Preserves table cell relationships
4
partition_image() — OCR
Extracts text from image files (PNG, JPG, etc.) using Tesseract or cloud OCR. Optional layout analysis. For scanned PDFs, this is called internally.
strategy="hi_res": Layout-aware OCR
ocr_languages=["en", "fr"]: Multi-language support
- Integrates with cloud OCR (Azure, Google) for better quality
Partition Example
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.html import partition_html
from unstructured.partition.docx import partition_docx
# Fast extraction
elements = partition_pdf("report.pdf", strategy="fast")
# High-resolution layout analysis
elements = partition_pdf(
"complex_layout.pdf",
strategy="hi_res",
infer_table_structure=True
)
# HTML parsing
elements = partition_html("article.html", skip_headers_footers=True)
# DOCX with styles preserved
elements = partition_docx("document.docx")
# Print element types
for elem in elements[:10]:
print(f"{elem.type:15s} | {elem.text[:60]}")
Strategy Comparison: Fast vs Hi-Res
Fast: Extracts text stream. Best for single-column documents with clean formatting. ~100ms per page.
Hi-Res: Uses layout detection (Detectron2) to understand columns, footers, sidebars. Significantly slower (~5-10s per page) but captures structure. Choose hi_res for multi-column PDFs, magazine layouts, or documents with sidebars.
💡
Start with fast, measure quality: Test both strategies on your documents. Fast often works well and is 50-100× faster. Only upgrade to hi_res if quality assessment shows clear gains.
04 — Token Sizing
Chunking with Unstructured
Parsing is step one; chunking is step two. Raw elements are too granular (each paragraph is one element). Chunking combines elements into fixed-size, semantically coherent chunks suitable for embeddings and retrieval.
Built-in Chunking Functions
chunk_by_title() groups elements under each heading, respecting heading hierarchy. chunk_elements() uses token counting to combine elements until reaching max_characters. Both preserve element type and metadata.
from unstructured.chunking.title import chunk_by_title
from unstructured.chunking.basic import Chunker
# Chunk by heading hierarchy
chunks = chunk_by_title(elements, combine_text_under_n_chars=500)
# Result: each chunk is all content under one heading (recursively)
# Chunk by token budget
chunker = Chunker(max_characters=512, new_after_n_chars=400)
chunks = chunker.chunk(elements)
# Result: combine elements until reaching 512 chars, start new chunk after 400
# Access chunk metadata
for chunk in chunks[:3]:
print(f"Tokens: {len(chunk.text.split())}")
print(f"Types: {set(e.type for e in chunk.elements)}")
print(f"Text: {chunk.text[:100]}...")
print()
Chunking Best Practices
🔀 Semantic Boundaries
- Chunk by heading when possible
- Don't split tables or code blocks
- Include context (parent heading) in metadata
- Preserve list structure
📦 Overlapping Windows
- Use 10–20% overlap for continuity
- Helps bridge context between chunks
- Trade-off: more embeddings, better recall
🏷️ Rich Metadata
- Add document title, file path
- Track page numbers, headings
- Tag sensitive data (PII, confidential)
- Include chunk index for reranking
⚠️
Beware chunk explosion: With overlap, a 100-page document can generate 10,000+ chunks. Embedding costs balloon. Set aggressive max_characters and test carefully.
05 — Deployment
Cloud API vs Local Processing
Unstructured.io offers two deployment modes: Cloud API (managed service) and Local SDK (self-hosted). Cloud is simpler but less private; Local is full control but requires infrastructure.
| Aspect |
Cloud API |
Local SDK |
| Setup |
API key, HTTP requests |
pip install, GPU optional |
| Latency |
100–500ms per document |
10–100ms per document (with GPU) |
| Cost |
$0.001–0.01 per page |
One-time model DL; GPU rental if scaling |
| Privacy |
Documents sent to Unstructured servers |
Fully on-premise |
| OCR Quality |
Cloud OCR (Azure, Google) |
Tesseract (lower quality) |
| Scaling |
Unstructured manages scaling |
You manage queue, workers, GPU |
| Model Updates |
Automatic |
Manual (pip upgrade) |
Cloud API Example
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
client = UnstructuredClient(api_key="your-api-key")
# Upload file to cloud
with open("document.pdf", "rb") as f:
elements = client.general.partition(
files=shared.Files(content=f),
strategy="hi_res",
model_name="chipper"
)
# Receive parsed elements
for elem in elements:
print(f"{elem.type}: {elem.text[:80]}")
Local SDK Example
from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title
# All processing happens locally
elements = partition_pdf(
"document.pdf",
strategy="hi_res"
)
# Chunking also local
chunks = chunk_by_title(elements, combine_text_under_n_chars=500)
# Return to your vector database
for chunk in chunks:
embedding = embedder.embed(chunk.text)
store_in_db(chunk.text, embedding, metadata=chunk.metadata)
💡
Hybrid approach: For very large pipelines, use Cloud API for heavy lifting (tables, scanned PDFs), fall back to Local SDK for simple docs. Or batch cloud requests during off-peak hours to reduce costs.
06 — Ecosystem
Integration with LangChain & LlamaIndex
Unstructured.io integrates seamlessly with the major LLM frameworks. LangChain has UnstructuredLoader; LlamaIndex has UnstructuredReader. Both handle partition and chunking transparently.
LangChain Integration
from langchain_community.document_loaders import UnstructuredLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Load and chunk in one step
loader = UnstructuredLoader(
file_path="document.pdf",
mode="elements", # Return element objects
strategy="hi_res"
)
elements = loader.load()
# Convert to LangChain Document objects
documents = [
Document(
page_content=elem.text,
metadata={
"type": elem.type,
"page": elem.metadata.get("page_number"),
"source": "document.pdf"
}
)
for elem in elements
]
# Chain to embeddings + vector store
vector_store = FAISS.from_documents(documents, embedder)
LlamaIndex Integration
from llama_index.readers.file import UnstructuredReader
from llama_index import SimpleDirectoryReader
# Load with Unstructured backend
reader = UnstructuredReader(mode="elements")
documents = reader.load_data(file=Path("document.pdf"))
# Or use SimpleDirectoryReader with Unstructured
loader = SimpleDirectoryReader(
input_dir="./documents",
file_extractor={".pdf": UnstructuredReader()}
)
documents = loader.load_data()
# Build index
from llama_index import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
Custom Pipeline
For full control, partition manually, chunk, embed, and store:
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
import pinecone
# Auto-detect file type
elements = partition("document.pdf")
# Custom chunking
chunks = chunk_by_title(
elements,
combine_text_under_n_chars=500,
max_characters=1000
)
# Embed and store
for i, chunk in enumerate(chunks):
embedding = embed_model.encode(chunk.text)
pinecone_index.upsert([(
f"doc__{i}",
embedding,
{
"text": chunk.text,
"type": chunk.type,
"document": "filename.pdf"
}
)])
print(f"Indexed {len(chunks)} chunks")
Tools & Ecosystem
Related Tools
07 — Further Reading
References
Documentation
Blog & Guides
Papers & Research