APPLICATIONS & SYSTEMS

Document Processing

Intelligent parsing, extraction, and transformation of PDFs, HTML, tables, and scanned documents for RAG and document intelligence applications.

parse → chunk → embed the pipeline
multi-modal text + tables + images
structure aware preserve semantics
Contents
  1. Why document parsing is hard
  2. Tools and ecosystem
  3. Parsing strategies
  4. Table and image handling
  5. Smart chunking
  6. Metadata enrichment
  7. End-to-end pipeline design
01 — Challenge

Why Document Parsing Is Hard

PDFs are not designed for machine consumption. They encode layout, not structure. A PDF is a visual format — text, fonts, positioning, bounding boxes — but doesn't explicitly mark paragraphs, sections, tables, or semantic boundaries. Extracting meaning requires heuristics, OCR, and layout analysis.

Core Challenges

ChallengeWhat goes wrongWhy it matters
Multi-column layoutsText order is wrong if not read left-to-right top-to-bottomChunking breaks mid-sentence; context is lost
TablesNaive PDF extraction treats cells as separate chunks, loses row/col relationshipsRAG can't reason about table structure
Scanned PDFsNo embedded text — need OCR (Tesseract, PaddleOCR)OCR is slow and error-prone
Headers/footersRepeated on every page; pollutes chunksRedundant context confuses LLM
Images and diagramsText extraction ignores them; important context lostMissing visual information
Special formattingBold, italics, font size hint structure but are lostSemantic hierarchy is flattened
💡 The core insight: PDFs encode presentation, not structure. Modern document intelligence tools (Docling, Unstructured.io) reverse this: they infer structure from layout using ML models, then output clean markdown with preserved hierarchies.
02 — Tools

Document Processing Tools

The ecosystem spans simple libraries (PyMuPDF) to sophisticated ML-based parsers (Docling). Choice depends on document diversity, latency budget, and accuracy targets.

Popular Document Processing Tools

Parser
Docling
SOTA PDF-to-markdown converter with layout analysis. Handles tables, images, multi-column. By IBM.
Parser
Unstructured.io
Unified API for PDFs, HTML, DOCX. Partitions into elements, extracts tables. Managed + open-source.
Library
PyMuPDF (fitz)
Fast, lightweight PDF text extraction. No layout analysis. Good for simple documents.
Library
pdfplumber
Extracts text, tables, metadata. Better table detection than PyMuPDF.
Library
Marker
Fast PDF-to-markdown using computer vision. Lighter weight than Docling.
OCR
PaddleOCR / Tesseract
Optical character recognition for scanned PDFs. PaddleOCR is faster, multi-language.

Tool Comparison Matrix

ToolSpeedTable handlingLayout awarenessCostBest for
DoclingSlowExcellentExcellent (LSTM)OpenComplex PDFs, precision-critical
UnstructuredMediumGoodGood (heuristic)Open + Managed APIDiverse doc types, no setup
PyMuPDFVery fastPoorNoneOpenSimple PDFs, low latency
pdfplumberFastGoodPoorOpenTable-heavy documents
MarkerFastGoodGood (CV)OpenBalance speed and quality
03 — Extraction

Parsing Strategies

Different document types need different approaches. A research paper has sections and citations. A manual has steps and warnings. A legal document has articles and clauses. Parse accordingly.

Strategy by Document Type

1

PDFs (scientific papers, reports) — Layout-aware parsing

Use Docling or Marker: infer document structure (title, abstract, sections), preserve hierarchy in markdown.

  • Detects columns, paragraphs, list items
  • Outputs markdown with ### hierarchy preserved
  • Chunks can respect section boundaries
2

Scanned/image PDFs — OCR pipeline

First check if text is embedded. If not, apply PaddleOCR or Tesseract. Accept lower accuracy; filter low-confidence tokens.

  • Detect if PDF is scanned (check for text layer)
  • Apply OCR with language hints
  • Post-process: fix common OCR errors (O→0, etc)
3

HTML / web content — DOM extraction

Parse HTML structure; remove boilerplate (nav, ads, footer). BeautifulSoup or trafilatura for cleanup.

  • Identify main content blocks (article, main, post)
  • Remove script, style, nav elements
  • Preserve semantic tags (h1, h2, ul, blockquote)
4

DOCX / Rich formatted — AST parsing

Use python-docx or Unstructured to parse document tree. Preserve formatting, extract text + metadata.

  • Read XML tree inside DOCX
  • Preserve paragraphs, lists, tables, styles
  • Extract embedded images and hyperlinks
Best practice: Detect document type automatically (MIME type, file extension, magic bytes). Route to appropriate parser. Combine lightweight + heavyweight parsers: fast PyMuPDF first, fall back to Docling if tables detected.
04 — Tables & Images

Table and Image Handling

Tables and diagrams carry semantic information that naive text extraction loses. Strategies: preserve table structure in markdown, embed images separately, or use multi-modal LLMs that understand images.

Table Extraction

Naive extraction: "cell1 cell2 cell3..." → loses row/column relationships. Better: output markdown tables or CSV. Best: preserve table as structured data (rows, columns, metadata). Tools: pdfplumber has table.extract(), Docling outputs tables in markdown, Unstructured returns Table elements.

Input PDF table: ┌─────────────┬──────────┐ │ Product │ Price │ ├─────────────┼──────────┤ │ Widget A │ $29.99 │ │ Widget B │ $39.99 │ └─────────────┴──────────┘ Naive extraction output: "Product Price Widget A $29.99 Widget B $39.99" → Lost structure; can't query "what's price of Widget B?" Docling markdown output: | Product | Price | |-----------|---------| | Widget A | $29.99 | | Widget B | $39.99 | → Structure preserved; LLM understands it's a table

Image Handling

Approach 1 (naive): Ignore images. Problem: charts, diagrams, screenshots carry critical context. Approach 2 (save separately): Extract images, store by reference, include caption in text. Approach 3 (multi-modal): Use vision LLM to generate captions or detailed descriptions of images. Include in chunks.

⚠️ Cost/quality tradeoff: Vision LLMs (Claude 3.5 Sonnet, GPT-4V) can caption images but cost ~$0.01 per image. For production, cache descriptions or use lightweight models (BLIP, LLaVA).
05 — Segmentation

Intelligent Chunking for Documents

Once parsed into clean markdown, chunk it. But naive chunking ignores the document structure you just extracted. Better: respect section boundaries, keep tables intact, avoid breaking mid-concept.

Structure-Aware Chunking

Hierarchical: Split on ## or ###, merge until chunk size limit. Respects document outline. Semantic: Embed paragraphs individually, use similarity to detect thematic breaks, split there. Overlap: Add last N tokens from previous chunk to next. Bridges concept boundaries. Table-aware: Never split a table mid-row. Keep tables with surrounding paragraphs.

Hierarchical chunking example: Document markdown: # Chapter 1: Foundations ## 1.1 Introduction [10 sentences] ## 1.2 Methodology [8 sentences] ## 1.3 Results [15 sentences] Chunk 1: "# Chapter 1..." + "## 1.1..." (full section) Chunk 2: "## 1.2..." (full section) Chunk 3: "## 1.3..." (full section) vs naive fixed-size chunking: Chunk 1: 512 tokens from start → cuts 1.2 in half; loses context Structure-aware is strictly better for documents with headers.
Recommendation: After parsing with Docling/Marker, you have markdown with ## structure. Use recursive character splitter with separator=["\n## ", "\n### ", "\n\n", "\n"] to respect hierarchy. Never hardcode fixed sizes without understanding document structure.
06 — Enhancement

Metadata Enrichment

Each chunk needs metadata for filtering, tracking, and re-ranking. Metadata can be extracted (from document) or generated (via LLM).

Essential Metadata

Structural: source_file, page_number, section_title, document_type. Temporal: created_date, modified_date, retrieval_date. Content: language, length_tokens, contains_table, contains_image. Semantic: summary, keywords, entity_types (if extracted). Access: accessible_users, classification_level, owner.

Metadata fieldHow to extractUse case
section_titleParse markdown headersContext in retrieval
page_numberFrom PDF; infer from layoutCitation, pagination
created_dateFile metadata or text extractionRecency ranking
contains_tableDetect table markers in markdownFilter table-heavy docs
keywordsTF-IDF, NER, or LLM extractionSparse retrieval, discovery
summaryAbstractive summarizationChunk re-ranking, preview
💡 Cost optimization: Extract structural metadata (headers, page num) from parsing. Generate semantic metadata (keywords, summary) lazily or in batch, not per-chunk at index time. Cache summaries; reuse across queries.
07 — Production

End-to-End Pipeline Design

A production document intelligence pipeline orchestrates: ingestion → parsing → chunking → enrichment → embedding → indexing. Stages can be parallel or sequential depending on scale.

Pipeline Architecture

Ingestion: Watch a folder, queue, or API. Deduplicate by hash. Parsing: Dispatch to appropriate parser. Handle errors gracefully; log problematic PDFs. Chunking: Apply hierarchy-aware strategy. Track source for citations. Enrichment: Add metadata, summaries. Optional: LLM-based entity extraction. Embedding: Batch embed chunks. Retry failed chunks. Indexing: Upsert into vector DB. Update metadata store. Monitoring: Log quality metrics, latency per stage, failure rates.

Pseudocode for pipeline: def process_document_batch(documents: List[Path]): for doc_path in documents: try: # Parse parsed = detect_type(doc_path) content = parse(parsed, doc_path) # → markdown # Chunk chunks = smart_chunk(content, doc_path) # Enrich for chunk in chunks: chunk.metadata = extract_metadata(chunk, doc_path) chunk.summary = summarize(chunk.text) # Embed + Index embeddings = embed_batch(chunks) vectorstore.upsert(chunks, embeddings) except Exception as e: log_error(doc_path, e) update_status(doc_path, 'failed') update_status('processing_complete') report_metrics() # timing, quality, errors
8

End-to-End Pipeline Example

A production document processing pipeline typically combines several of the approaches above. The pattern below handles PDFs with mixed text, tables, and images — the most common enterprise use case.

# Full document pipeline: parse → chunk → embed → store # pip install pymupdf4llm unstructured[pdf] langchain-openai chromadb import pymupdf4llm from unstructured.partition.pdf import partition_pdf from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter def process_document(pdf_path: str, collection_name: str): # 1. Parse: use pymupdf4llm for clean markdown extraction md_text = pymupdf4llm.to_markdown(pdf_path) # 2. For complex layouts (forms, mixed tables), fallback to unstructured # elements = partition_pdf(filename=pdf_path, strategy="hi_res", # extract_images_in_pdf=True) # 3. Chunk with overlap, respect document structure splitter = RecursiveCharacterTextSplitter( chunk_size=800, chunk_overlap=100, separators=[" ## ", " ### ", " ", " ", " "] ) chunks = splitter.create_documents( [md_text], metadatas=[{"source": pdf_path, "page_count": 0}] ) # 4. Embed and store embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma.from_documents( chunks, embeddings, collection_name=collection_name, persist_directory="./chroma_db" ) print(f"Indexed {len(chunks)} chunks from {pdf_path}") return vectorstore vs = process_document("annual_report.pdf", "financials") results = vs.similarity_search("What was the revenue growth in Q4?", k=4) for r in results: print(r.page_content[:120])
ToolBest forLimitation
PyMuPDF4LLMClean digital PDFs, markdown outputStruggles with scanned docs
UnstructuredMixed PDFs, tables, imagesSlower; requires hi_res for accuracy
Azure Document IntelligenceEnterprise forms, invoicesCost; external API dependency
Docling (IBM)Scientific papers, complex layoutsNewer; smaller community