01 — Overview
Core Concepts
Docling is IBM Research's open-source document conversion library. It converts complex documents (PDFs, Word, PowerPoint) into a unified DoclingDocument object that preserves structure — headings, paragraphs, tables, figures, lists — and can export to Markdown, JSON, HTML, or DocTags for downstream RAG pipelines.
The DocumentConverter
The central entry point is DocumentConverter. Pass a file path and get back a conversion result:
from docling.document_converter import DocumentConverter
# Basic conversion
converter = DocumentConverter()
result = converter.convert_document("report.pdf")
# Access the structured document
doc = result.document
# Export to Markdown (best for RAG)
markdown = doc.export_to_markdown()
print(markdown[:500])
DoclingDocument Structure
Every conversion produces a DoclingDocument — a rich object with typed blocks, metadata, and page info:
doc = result.document
# Metadata
print(doc.metadata.doc_name) # filename
print(doc.metadata.num_pages) # page count
# Iterate all text blocks
for element, level in doc.iterate_items():
print(type(element).__name__, element.text[:80])
# Tables only
for table in doc.tables:
df = table.export_to_dataframe()
print(df.shape, "columns:", list(df.columns))
💡
Why Docling over PyPDF? PyPDF extracts raw text with no structure. Docling understands layout — it knows which text is a heading vs body, preserves table cells, and handles multi-column PDFs, making chunk quality significantly better for RAG.
03 — Layout
Layout Analysis
Docling uses deep learning models to reconstruct the logical reading order of a document, classify text blocks by type, and segment the page into coherent regions. This is what makes it far more useful than text extractors for RAG.
| Block Type | Description | RAG Relevance |
SectionHeaderItem | Chapter / section headings with hierarchy level | High — chunk boundaries |
TextItem | Body paragraphs, list items | Core content |
TableItem | Structured tabular data (cells + structure) | High — structured facts |
PictureItem | Images and figures with captions | Metadata + alt text |
ListItem | Bulleted / numbered list entries | Preserves enumeration |
FormulaItem | Mathematical expressions (LaTeX) | Technical docs |
Iterating Layout Elements
from docling.datamodel.base_models import ItemAndImageEnrichmentElement
from docling.datamodel.document import SectionHeaderItem, TableItem, TextItem
doc = converter.convert_document("report.pdf").document
# Iterate with level (heading depth)
for item, level in doc.iterate_items():
if isinstance(item, SectionHeaderItem):
print(f"{' ' * level}H{level}: {item.text}")
elif isinstance(item, TableItem):
print(f" [TABLE: {item.num_rows}×{item.num_cols}]")
elif isinstance(item, TextItem):
print(f" {item.text[:60]}...")
⚠️
Multi-column PDFs: Docling handles 2-column academic layouts. Reading order is reconstructed using layout analysis — text won't be interleaved across columns as it would with naive text extraction.
04 — Tables
Table Extraction
Docling uses TableFormer, a Transformer-based model trained specifically for table structure recognition. It identifies rows, columns, merged cells, and headers even in complex layouts — including spanning cells and nested headers.
Exporting Tables to DataFrames
import pandas as pd
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
doc = converter.convert_document("financial_report.pdf").document
table_count = 0
for table in doc.tables:
df = table.export_to_dataframe()
table_count += 1
# Save to CSV
df.to_csv(f"table_{table_count}.csv", index=False)
Markdown Export
Tables are preserved in Markdown format during document export:
# Use export_to_markdown() to preserve tables
markdown_output = doc.export_to_markdown()
# Markdown includes table formatting:
# | Header 1 | Header 2 |
# |----------|----------|
# | Value 1 | Value 2 |
💡
TableFormer accuracy: ~95% accuracy on well-formed tables. Performance degrades on very complex layouts. Always validate extracted tables on your domain.
05 — Conversion
Export Formats
DoclingDocument can be exported to multiple formats, each with different tradeoffs for RAG pipelines. Choose based on your downstream needs: Markdown for readability, JSON for structured queries, or HTML for faithful layout preservation.
| Format |
Use Case |
Fidelity |
Token Efficiency |
| Markdown |
RAG + LLM consumption. Tables, code blocks, headings preserved. |
High |
High |
| JSON |
Structured queries, programmatic access to blocks and tables. |
Perfect |
Lower (verbose) |
| HTML |
Web display, faithful layout with styles. |
Very High |
Lower (markup heavy) |
| DocTags |
XML-based semantic markup (experimental). |
Very High |
Medium |
Export Examples
# Markdown export (best for RAG)
markdown_text = doc.export_to_markdown()
with open("output.md", "w") as f:
f.write(markdown_text)
# JSON export (structured)
json_string = doc.export_to_json()
with open("output.json", "w") as f:
f.write(json_string)
# HTML export (web display)
html_text = doc.export_to_html()
with open("output.html", "w") as f:
f.write(html_text)
# DocTags export (semantic XML)
doctags = doc.export_to_doctags()
with open("output.doctags", "w") as f:
f.write(doctags)
Markdown for RAG Pipelines
Markdown export is ideal for RAG: it preserves structure (headings, lists, tables) while being compact and LLM-friendly. Downstream chunking becomes easier.
# Typical RAG pipeline with Docling
from docling.document_converter import DocumentConverter
from langchain.text_splitters import MarkdownHeaderTextSplitter
# 1. Parse to DoclingDocument
converter = DocumentConverter()
doc = converter.convert_document("report.pdf").document
# 2. Export to Markdown
markdown = doc.export_to_markdown()
# 3. Chunk by heading hierarchy (preserves semantic boundaries)
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = splitter.split_text(markdown)
# 4. Embed and store
for chunk in chunks:
embedding = embedder.embed(chunk.page_content)
vector_db.add(chunk.page_content, embedding, chunk.metadata)
06 — Framework Integration
RAG Integration
Docling integrates with LangChain and LlamaIndex. Both frameworks provide loaders that handle conversion and chunking automatically.
LangChain with Docling
LangChain's DoclingLoader wraps Docling for seamless document loading:
from langchain_community.document_loaders import DoclingLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Load PDF using Docling
loader = DoclingLoader(file_path="document.pdf")
documents = loader.load()
# Chunk
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
# Store in vector DB
from langchain_community.vectorstores import FAISS
vector_store = FAISS.from_documents(chunks, embedder)
# Query
results = vector_store.similarity_search("revenue forecast")
LlamaIndex with Docling
LlamaIndex's DoclingReader converts Docling output to LlamaIndex Document objects:
from llama_index.readers.docling import DoclingReader
from llama_index import SimpleDirectoryReader
# Configure Docling reader
reader = DoclingReader()
# Load single file
documents = reader.load_data(file=Path("document.pdf"))
# Or load directory
loader = SimpleDirectoryReader(
input_dir="./documents",
file_extractor={".pdf": DoclingReader()}
)
documents = loader.load_data()
# Build index
from llama_index import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
# Query
retriever = index.as_retriever()
results = retriever.retrieve("revenue forecast")
Custom Pipeline with Full Control
For maximum control, build your own pipeline:
from docling.document_converter import DocumentConverter
import json
# 1. Convert documents
converter = DocumentConverter()
docs = []
for pdf_path in Path("./pdfs").glob("*.pdf"):
doc = converter.convert_document(str(pdf_path)).document
docs.append((pdf_path.stem, doc))
# 2. Extract blocks with metadata
chunks = []
for doc_name, doc in docs:
for page_num, page in enumerate(doc.pages, 1):
for block_idx, block in enumerate(page.blocks):
chunk_id = f"{doc_name}_p{page_num}_b{block_idx}"
if hasattr(block, 'text') and block.text:
chunks.append({
"id": chunk_id,
"text": block.text,
"block_type": block.block_type.name,
"page": page_num,
"source": doc_name
})
print(f"Extracted {len(chunks)} chunks from {len(docs)} documents")
💡
Best practice: Store block type in metadata. Use it during reranking to prioritize tables over paragraphs, or headings over body text, depending on your query.
Tools & Ecosystem
Related Tools
07 — Further Reading
References
Documentation
Research Papers
Blog & Guides