Intelligent parsing, extraction, and transformation of PDFs, HTML, tables, and scanned documents for RAG and document intelligence applications.
PDFs are not designed for machine consumption. They encode layout, not structure. A PDF is a visual format — text, fonts, positioning, bounding boxes — but doesn't explicitly mark paragraphs, sections, tables, or semantic boundaries. Extracting meaning requires heuristics, OCR, and layout analysis.
| Challenge | What goes wrong | Why it matters |
|---|---|---|
| Multi-column layouts | Text order is wrong if not read left-to-right top-to-bottom | Chunking breaks mid-sentence; context is lost |
| Tables | Naive PDF extraction treats cells as separate chunks, loses row/col relationships | RAG can't reason about table structure |
| Scanned PDFs | No embedded text — need OCR (Tesseract, PaddleOCR) | OCR is slow and error-prone |
| Headers/footers | Repeated on every page; pollutes chunks | Redundant context confuses LLM |
| Images and diagrams | Text extraction ignores them; important context lost | Missing visual information |
| Special formatting | Bold, italics, font size hint structure but are lost | Semantic hierarchy is flattened |
The ecosystem spans simple libraries (PyMuPDF) to sophisticated ML-based parsers (Docling). Choice depends on document diversity, latency budget, and accuracy targets.
| Tool | Speed | Table handling | Layout awareness | Cost | Best for |
|---|---|---|---|---|---|
| Docling | Slow | Excellent | Excellent (LSTM) | Open | Complex PDFs, precision-critical |
| Unstructured | Medium | Good | Good (heuristic) | Open + Managed API | Diverse doc types, no setup |
| PyMuPDF | Very fast | Poor | None | Open | Simple PDFs, low latency |
| pdfplumber | Fast | Good | Poor | Open | Table-heavy documents |
| Marker | Fast | Good | Good (CV) | Open | Balance speed and quality |
Different document types need different approaches. A research paper has sections and citations. A manual has steps and warnings. A legal document has articles and clauses. Parse accordingly.
Use Docling or Marker: infer document structure (title, abstract, sections), preserve hierarchy in markdown.
First check if text is embedded. If not, apply PaddleOCR or Tesseract. Accept lower accuracy; filter low-confidence tokens.
Parse HTML structure; remove boilerplate (nav, ads, footer). BeautifulSoup or trafilatura for cleanup.
Use python-docx or Unstructured to parse document tree. Preserve formatting, extract text + metadata.
Tables and diagrams carry semantic information that naive text extraction loses. Strategies: preserve table structure in markdown, embed images separately, or use multi-modal LLMs that understand images.
Naive extraction: "cell1 cell2 cell3..." → loses row/column relationships. Better: output markdown tables or CSV. Best: preserve table as structured data (rows, columns, metadata). Tools: pdfplumber has table.extract(), Docling outputs tables in markdown, Unstructured returns Table elements.
Approach 1 (naive): Ignore images. Problem: charts, diagrams, screenshots carry critical context. Approach 2 (save separately): Extract images, store by reference, include caption in text. Approach 3 (multi-modal): Use vision LLM to generate captions or detailed descriptions of images. Include in chunks.
Once parsed into clean markdown, chunk it. But naive chunking ignores the document structure you just extracted. Better: respect section boundaries, keep tables intact, avoid breaking mid-concept.
Hierarchical: Split on ## or ###, merge until chunk size limit. Respects document outline. Semantic: Embed paragraphs individually, use similarity to detect thematic breaks, split there. Overlap: Add last N tokens from previous chunk to next. Bridges concept boundaries. Table-aware: Never split a table mid-row. Keep tables with surrounding paragraphs.
Each chunk needs metadata for filtering, tracking, and re-ranking. Metadata can be extracted (from document) or generated (via LLM).
Structural: source_file, page_number, section_title, document_type. Temporal: created_date, modified_date, retrieval_date. Content: language, length_tokens, contains_table, contains_image. Semantic: summary, keywords, entity_types (if extracted). Access: accessible_users, classification_level, owner.
| Metadata field | How to extract | Use case |
|---|---|---|
| section_title | Parse markdown headers | Context in retrieval |
| page_number | From PDF; infer from layout | Citation, pagination |
| created_date | File metadata or text extraction | Recency ranking |
| contains_table | Detect table markers in markdown | Filter table-heavy docs |
| keywords | TF-IDF, NER, or LLM extraction | Sparse retrieval, discovery |
| summary | Abstractive summarization | Chunk re-ranking, preview |
A production document intelligence pipeline orchestrates: ingestion → parsing → chunking → enrichment → embedding → indexing. Stages can be parallel or sequential depending on scale.
Ingestion: Watch a folder, queue, or API. Deduplicate by hash. Parsing: Dispatch to appropriate parser. Handle errors gracefully; log problematic PDFs. Chunking: Apply hierarchy-aware strategy. Track source for citations. Enrichment: Add metadata, summaries. Optional: LLM-based entity extraction. Embedding: Batch embed chunks. Retry failed chunks. Indexing: Upsert into vector DB. Update metadata store. Monitoring: Log quality metrics, latency per stage, failure rates.
A production document processing pipeline typically combines several of the approaches above. The pattern below handles PDFs with mixed text, tables, and images — the most common enterprise use case.
| Tool | Best for | Limitation |
|---|---|---|
| PyMuPDF4LLM | Clean digital PDFs, markdown output | Struggles with scanned docs |
| Unstructured | Mixed PDFs, tables, images | Slower; requires hi_res for accuracy |
| Azure Document Intelligence | Enterprise forms, invoices | Cost; external API dependency |
| Docling (IBM) | Scientific papers, complex layouts | Newer; smaller community |