Document Processing

Contents

Why document parsing is hard
Tools and ecosystem
Parsing strategies
Table and image handling
Smart chunking
Metadata enrichment
End-to-end pipeline design

01 — Challenge

Why Document Parsing Is Hard

PDFs are not designed for machine consumption. They encode layout, not structure. A PDF is a visual format — text, fonts, positioning, bounding boxes — but doesn't explicitly mark paragraphs, sections, tables, or semantic boundaries. Extracting meaning requires heuristics, OCR, and layout analysis.

Core Challenges

Challenge	What goes wrong	Why it matters
Multi-column layouts	Text order is wrong if not read left-to-right top-to-bottom	Chunking breaks mid-sentence; context is lost
Tables	Naive PDF extraction treats cells as separate chunks, loses row/col relationships	RAG can't reason about table structure
Scanned PDFs	No embedded text — need OCR (Tesseract, PaddleOCR)	OCR is slow and error-prone
Headers/footers	Repeated on every page; pollutes chunks	Redundant context confuses LLM
Images and diagrams	Text extraction ignores them; important context lost	Missing visual information
Special formatting	Bold, italics, font size hint structure but are lost	Semantic hierarchy is flattened

💡 The core insight: PDFs encode presentation, not structure. Modern document intelligence tools (Docling, Unstructured.io) reverse this: they infer structure from layout using ML models, then output clean markdown with preserved hierarchies.

02 — Tools

Document Processing Tools

The ecosystem spans simple libraries (PyMuPDF) to sophisticated ML-based parsers (Docling). Choice depends on document diversity, latency budget, and accuracy targets.

Tool Comparison Matrix

Tool	Speed	Table handling	Layout awareness	Cost	Best for
Docling	Slow	Excellent	Excellent (LSTM)	Open	Complex PDFs, precision-critical
Unstructured	Medium	Good	Good (heuristic)	Open + Managed API	Diverse doc types, no setup
PyMuPDF	Very fast	Poor	None	Open	Simple PDFs, low latency
pdfplumber	Fast	Good	Poor	Open	Table-heavy documents
Marker	Fast	Good	Good (CV)	Open	Balance speed and quality

03 — Extraction

Parsing Strategies

Different document types need different approaches. A research paper has sections and citations. A manual has steps and warnings. A legal document has articles and clauses. Parse accordingly.

Strategy by Document Type

PDFs (scientific papers, reports) — Layout-aware parsing

Use Docling or Marker: infer document structure (title, abstract, sections), preserve hierarchy in markdown.

Detects columns, paragraphs, list items
Outputs markdown with ### hierarchy preserved
Chunks can respect section boundaries

Scanned/image PDFs — OCR pipeline

First check if text is embedded. If not, apply PaddleOCR or Tesseract. Accept lower accuracy; filter low-confidence tokens.

Detect if PDF is scanned (check for text layer)
Apply OCR with language hints
Post-process: fix common OCR errors (O→0, etc)

HTML / web content — DOM extraction

Parse HTML structure; remove boilerplate (nav, ads, footer). BeautifulSoup or trafilatura for cleanup.

Identify main content blocks (article, main, post)
Remove script, style, nav elements
Preserve semantic tags (h1, h2, ul, blockquote)

DOCX / Rich formatted — AST parsing

Use python-docx or Unstructured to parse document tree. Preserve formatting, extract text + metadata.

Read XML tree inside DOCX
Preserve paragraphs, lists, tables, styles
Extract embedded images and hyperlinks

✓ Best practice: Detect document type automatically (MIME type, file extension, magic bytes). Route to appropriate parser. Combine lightweight + heavyweight parsers: fast PyMuPDF first, fall back to Docling if tables detected.

04 — Tables & Images

Table and Image Handling

Tables and diagrams carry semantic information that naive text extraction loses. Strategies: preserve table structure in markdown, embed images separately, or use multi-modal LLMs that understand images.

Table Extraction

Naive extraction: "cell1 cell2 cell3..." → loses row/column relationships. Better: output markdown tables or CSV. Best: preserve table as structured data (rows, columns, metadata). Tools: pdfplumber has table.extract(), Docling outputs tables in markdown, Unstructured returns Table elements.

Input PDF table: ┌─────────────┬──────────┐ │ Product │ Price │ ├─────────────┼──────────┤ │ Widget A │ $29.99 │ │ Widget B │ $39.99 │ └─────────────┴──────────┘ Naive extraction output: "Product Price Widget A $29.99 Widget B $39.99" → Lost structure; can't query "what's price of Widget B?" Docling markdown output: | Product | Price | |-----------|---------| | Widget A | $29.99 | | Widget B | $39.99 | → Structure preserved; LLM understands it's a table

Image Handling

Approach 1 (naive): Ignore images. Problem: charts, diagrams, screenshots carry critical context. Approach 2 (save separately): Extract images, store by reference, include caption in text. Approach 3 (multi-modal): Use vision LLM to generate captions or detailed descriptions of images. Include in chunks.

⚠️ Cost/quality tradeoff: Vision LLMs (Claude 3.5 Sonnet, GPT-4V) can caption images but cost ~$0.01 per image. For production, cache descriptions or use lightweight models (BLIP, LLaVA).

05 — Segmentation

Intelligent Chunking for Documents

Once parsed into clean markdown, chunk it. But naive chunking ignores the document structure you just extracted. Better: respect section boundaries, keep tables intact, avoid breaking mid-concept.

Structure-Aware Chunking

Hierarchical: Split on ## or ###, merge until chunk size limit. Respects document outline. Semantic: Embed paragraphs individually, use similarity to detect thematic breaks, split there. Overlap: Add last N tokens from previous chunk to next. Bridges concept boundaries. Table-aware: Never split a table mid-row. Keep tables with surrounding paragraphs.

Hierarchical chunking example: Document markdown: # Chapter 1: Foundations ## 1.1 Introduction [10 sentences] ## 1.2 Methodology [8 sentences] ## 1.3 Results [15 sentences] Chunk 1: "# Chapter 1..." + "## 1.1..." (full section) Chunk 2: "## 1.2..." (full section) Chunk 3: "## 1.3..." (full section) vs naive fixed-size chunking: Chunk 1: 512 tokens from start → cuts 1.2 in half; loses context Structure-aware is strictly better for documents with headers.

✓ Recommendation: After parsing with Docling/Marker, you have markdown with ## structure. Use recursive character splitter with separator=["\n## ", "\n### ", "\n\n", "\n"] to respect hierarchy. Never hardcode fixed sizes without understanding document structure.

06 — Enhancement

Metadata Enrichment

Each chunk needs metadata for filtering, tracking, and re-ranking. Metadata can be extracted (from document) or generated (via LLM).

Essential Metadata

Structural: source_file, page_number, section_title, document_type. Temporal: created_date, modified_date, retrieval_date. Content: language, length_tokens, contains_table, contains_image. Semantic: summary, keywords, entity_types (if extracted). Access: accessible_users, classification_level, owner.

Metadata field	How to extract	Use case
section_title	Parse markdown headers	Context in retrieval
page_number	From PDF; infer from layout	Citation, pagination
created_date	File metadata or text extraction	Recency ranking
contains_table	Detect table markers in markdown	Filter table-heavy docs
keywords	TF-IDF, NER, or LLM extraction	Sparse retrieval, discovery
summary	Abstractive summarization	Chunk re-ranking, preview

💡 Cost optimization: Extract structural metadata (headers, page num) from parsing. Generate semantic metadata (keywords, summary) lazily or in batch, not per-chunk at index time. Cache summaries; reuse across queries.

07 — Production

End-to-End Pipeline Design

A production document intelligence pipeline orchestrates: ingestion → parsing → chunking → enrichment → embedding → indexing. Stages can be parallel or sequential depending on scale.

Pipeline Architecture

Ingestion: Watch a folder, queue, or API. Deduplicate by hash. Parsing: Dispatch to appropriate parser. Handle errors gracefully; log problematic PDFs. Chunking: Apply hierarchy-aware strategy. Track source for citations. Enrichment: Add metadata, summaries. Optional: LLM-based entity extraction. Embedding: Batch embed chunks. Retry failed chunks. Indexing: Upsert into vector DB. Update metadata store. Monitoring: Log quality metrics, latency per stage, failure rates.

Pseudocode for pipeline: def process_document_batch(documents: List[Path]): for doc_path in documents: try: # Parse parsed = detect_type(doc_path) content = parse(parsed, doc_path) # → markdown # Chunk chunks = smart_chunk(content, doc_path) # Enrich for chunk in chunks: chunk.metadata = extract_metadata(chunk, doc_path) chunk.summary = summarize(chunk.text) # Embed + Index embeddings = embed_batch(chunks) vectorstore.upsert(chunks, embeddings) except Exception as e: log_error(doc_path, e) update_status(doc_path, 'failed') update_status('processing_complete') report_metrics() # timing, quality, errors

Tool	Best for	Limitation
PyMuPDF4LLM	Clean digital PDFs, markdown output	Struggles with scanned docs
Unstructured	Mixed PDFs, tables, images	Slower; requires hi_res for accuracy
Azure Document Intelligence	Enterprise forms, invoices	Cost; external API dependency
Docling (IBM)	Scientific papers, complex layouts	Newer; smaller community

Document Processing

Why Document Parsing Is Hard

Core Challenges

Document Processing Tools

Popular Document Processing Tools

Tool Comparison Matrix

Parsing Strategies

Strategy by Document Type

PDFs (scientific papers, reports) — Layout-aware parsing

Scanned/image PDFs — OCR pipeline

HTML / web content — DOM extraction

DOCX / Rich formatted — AST parsing

Table and Image Handling

Table Extraction

Image Handling

Intelligent Chunking for Documents

Structure-Aware Chunking

Metadata Enrichment

Essential Metadata

End-to-End Pipeline Design

Pipeline Architecture

End-to-End Pipeline Example

Document Processing

Why Document Parsing Is Hard

Core Challenges

Document Processing Tools

Popular Document Processing Tools

Tool Comparison Matrix

Parsing Strategies

Strategy by Document Type

PDFs (scientific papers, reports) — Layout-aware parsing

Scanned/image PDFs — OCR pipeline

HTML / web content — DOM extraction

DOCX / Rich formatted — AST parsing

Table and Image Handling

Table Extraction

Image Handling

Intelligent Chunking for Documents

Structure-Aware Chunking

Metadata Enrichment

Essential Metadata

End-to-End Pipeline Design

Pipeline Architecture

End-to-End Pipeline Example

Related concepts