Data Ingestion

Docling — Structured Document Parsing

IBM's open-source library for converting PDFs, DOCX, and PPTX into structured DoclingDocument objects, with deep table extraction and Markdown export for RAG.

6+Input Formats
~95%Table Accuracy
4Export Formats
On this page
01 — Overview

Core Concepts

Docling is IBM Research's open-source document conversion library. It converts complex documents (PDFs, Word, PowerPoint) into a unified DoclingDocument object that preserves structure — headings, paragraphs, tables, figures, lists — and can export to Markdown, JSON, HTML, or DocTags for downstream RAG pipelines.

The DocumentConverter

The central entry point is DocumentConverter. Pass a file path and get back a conversion result:

from docling.document_converter import DocumentConverter # Basic conversion converter = DocumentConverter() result = converter.convert_document("report.pdf") # Access the structured document doc = result.document # Export to Markdown (best for RAG) markdown = doc.export_to_markdown() print(markdown[:500])

DoclingDocument Structure

Every conversion produces a DoclingDocument — a rich object with typed blocks, metadata, and page info:

doc = result.document # Metadata print(doc.metadata.doc_name) # filename print(doc.metadata.num_pages) # page count # Iterate all text blocks for element, level in doc.iterate_items(): print(type(element).__name__, element.text[:80]) # Tables only for table in doc.tables: df = table.export_to_dataframe() print(df.shape, "columns:", list(df.columns))
💡 Why Docling over PyPDF? PyPDF extracts raw text with no structure. Docling understands layout — it knows which text is a heading vs body, preserves table cells, and handles multi-column PDFs, making chunk quality significantly better for RAG.
02 — Input Formats

Supported Document Formats

Docling handles all common enterprise document formats through a unified pipeline. The same API call works for all types — format is detected automatically from the file extension.

FormatExtensionTable SupportNotes
PDF.pdf✅ TableFormer AIBest supported; layout + reading order
Word.docx✅ NativeStyles map to heading levels
PowerPoint.pptx✅ NativeSlides become sections
Excel.xlsx✅ NativeSheets exported as tables
HTML.html✅ NativeSemantic tags preserved
Images.png/.jpg⚠️ OCR onlyRequires tesseract or EasyOCR

Batch Conversion

from pathlib import Path from docling.document_converter import DocumentConverter converter = DocumentConverter() # Convert entire directory results = converter.convert_all( [str(p) for p in Path("./documents").glob("**/*.pdf")] ) for result in results: if result.status.success: md = result.document.export_to_markdown() out_path = Path("output") / (result.input.file.stem + ".md") out_path.write_text(md) else: print(f"Failed: {result.input.file.name} — {result.status.error_message}")
03 — Layout

Layout Analysis

Docling uses deep learning models to reconstruct the logical reading order of a document, classify text blocks by type, and segment the page into coherent regions. This is what makes it far more useful than text extractors for RAG.

Block TypeDescriptionRAG Relevance
SectionHeaderItemChapter / section headings with hierarchy levelHigh — chunk boundaries
TextItemBody paragraphs, list itemsCore content
TableItemStructured tabular data (cells + structure)High — structured facts
PictureItemImages and figures with captionsMetadata + alt text
ListItemBulleted / numbered list entriesPreserves enumeration
FormulaItemMathematical expressions (LaTeX)Technical docs

Iterating Layout Elements

from docling.datamodel.base_models import ItemAndImageEnrichmentElement from docling.datamodel.document import SectionHeaderItem, TableItem, TextItem doc = converter.convert_document("report.pdf").document # Iterate with level (heading depth) for item, level in doc.iterate_items(): if isinstance(item, SectionHeaderItem): print(f"{' ' * level}H{level}: {item.text}") elif isinstance(item, TableItem): print(f" [TABLE: {item.num_rows}×{item.num_cols}]") elif isinstance(item, TextItem): print(f" {item.text[:60]}...")
⚠️ Multi-column PDFs: Docling handles 2-column academic layouts. Reading order is reconstructed using layout analysis — text won't be interleaved across columns as it would with naive text extraction.
04 — Tables

Table Extraction

Docling uses TableFormer, a Transformer-based model trained specifically for table structure recognition. It identifies rows, columns, merged cells, and headers even in complex layouts — including spanning cells and nested headers.

Exporting Tables to DataFrames

import pandas as pd from docling.document_converter import DocumentConverter converter = DocumentConverter() doc = converter.convert_document("financial_report.pdf").document table_count = 0 for table in doc.tables: df = table.export_to_dataframe() table_count += 1 # Save to CSV df.to_csv(f"table_{table_count}.csv", index=False)

Markdown Export

Tables are preserved in Markdown format during document export:

# Use export_to_markdown() to preserve tables markdown_output = doc.export_to_markdown() # Markdown includes table formatting: # | Header 1 | Header 2 | # |----------|----------| # | Value 1 | Value 2 |
💡 TableFormer accuracy: ~95% accuracy on well-formed tables. Performance degrades on very complex layouts. Always validate extracted tables on your domain.
05 — Conversion

Export Formats

DoclingDocument can be exported to multiple formats, each with different tradeoffs for RAG pipelines. Choose based on your downstream needs: Markdown for readability, JSON for structured queries, or HTML for faithful layout preservation.

Format Use Case Fidelity Token Efficiency
Markdown RAG + LLM consumption. Tables, code blocks, headings preserved. High High
JSON Structured queries, programmatic access to blocks and tables. Perfect Lower (verbose)
HTML Web display, faithful layout with styles. Very High Lower (markup heavy)
DocTags XML-based semantic markup (experimental). Very High Medium

Export Examples

# Markdown export (best for RAG) markdown_text = doc.export_to_markdown() with open("output.md", "w") as f: f.write(markdown_text) # JSON export (structured) json_string = doc.export_to_json() with open("output.json", "w") as f: f.write(json_string) # HTML export (web display) html_text = doc.export_to_html() with open("output.html", "w") as f: f.write(html_text) # DocTags export (semantic XML) doctags = doc.export_to_doctags() with open("output.doctags", "w") as f: f.write(doctags)

Markdown for RAG Pipelines

Markdown export is ideal for RAG: it preserves structure (headings, lists, tables) while being compact and LLM-friendly. Downstream chunking becomes easier.

# Typical RAG pipeline with Docling from docling.document_converter import DocumentConverter from langchain.text_splitters import MarkdownHeaderTextSplitter # 1. Parse to DoclingDocument converter = DocumentConverter() doc = converter.convert_document("report.pdf").document # 2. Export to Markdown markdown = doc.export_to_markdown() # 3. Chunk by heading hierarchy (preserves semantic boundaries) headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ] splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on ) chunks = splitter.split_text(markdown) # 4. Embed and store for chunk in chunks: embedding = embedder.embed(chunk.page_content) vector_db.add(chunk.page_content, embedding, chunk.metadata)
06 — Framework Integration

RAG Integration

Docling integrates with LangChain and LlamaIndex. Both frameworks provide loaders that handle conversion and chunking automatically.

LangChain with Docling

LangChain's DoclingLoader wraps Docling for seamless document loading:

from langchain_community.document_loaders import DoclingLoader from langchain_text_splitters import RecursiveCharacterTextSplitter # Load PDF using Docling loader = DoclingLoader(file_path="document.pdf") documents = loader.load() # Chunk splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=50 ) chunks = splitter.split_documents(documents) # Store in vector DB from langchain_community.vectorstores import FAISS vector_store = FAISS.from_documents(chunks, embedder) # Query results = vector_store.similarity_search("revenue forecast")

LlamaIndex with Docling

LlamaIndex's DoclingReader converts Docling output to LlamaIndex Document objects:

from llama_index.readers.docling import DoclingReader from llama_index import SimpleDirectoryReader # Configure Docling reader reader = DoclingReader() # Load single file documents = reader.load_data(file=Path("document.pdf")) # Or load directory loader = SimpleDirectoryReader( input_dir="./documents", file_extractor={".pdf": DoclingReader()} ) documents = loader.load_data() # Build index from llama_index import VectorStoreIndex index = VectorStoreIndex.from_documents(documents) # Query retriever = index.as_retriever() results = retriever.retrieve("revenue forecast")

Custom Pipeline with Full Control

For maximum control, build your own pipeline:

from docling.document_converter import DocumentConverter import json # 1. Convert documents converter = DocumentConverter() docs = [] for pdf_path in Path("./pdfs").glob("*.pdf"): doc = converter.convert_document(str(pdf_path)).document docs.append((pdf_path.stem, doc)) # 2. Extract blocks with metadata chunks = [] for doc_name, doc in docs: for page_num, page in enumerate(doc.pages, 1): for block_idx, block in enumerate(page.blocks): chunk_id = f"{doc_name}_p{page_num}_b{block_idx}" if hasattr(block, 'text') and block.text: chunks.append({ "id": chunk_id, "text": block.text, "block_type": block.block_type.name, "page": page_num, "source": doc_name }) print(f"Extracted {len(chunks)} chunks from {len(docs)} documents")
💡 Best practice: Store block type in metadata. Use it during reranking to prioritize tables over paragraphs, or headings over body text, depending on your query.
Tools & Ecosystem

Related Tools

Parsing
Docling
IBM's structured document converter (PDF, DOCX, PPTX)
Parsing
Unstructured.io
Multi-format parser with cloud + local options
Tables
TableFormer
Deep learning for table extraction (in Docling)
PDF
PyMuPDF
Fast low-level PDF text extraction
PDF
pdfplumber
Precise table and text extraction
Cloud
LlamaParse
LlamaIndex's cloud document parser
Framework
LangChain
LLM framework with DoclingLoader
Framework
LlamaIndex
Data indexing with DoclingReader
Hugging Face
HuggingFace Hub
Models and datasets (TableFormer pretrained)
07 — Further Reading

References

Documentation
Research Papers
Blog & Guides