Docling — Structured Document Parsing for RAG

On this page

Core Concepts
Supported Formats
Layout Analysis
Table Extraction
Export Formats
RAG Integration

01 — Overview

Core Concepts

Docling is IBM Research's open-source document conversion library. It converts complex documents (PDFs, Word, PowerPoint) into a unified DoclingDocument object that preserves structure — headings, paragraphs, tables, figures, lists — and can export to Markdown, JSON, HTML, or DocTags for downstream RAG pipelines.

The DocumentConverter

The central entry point is DocumentConverter. Pass a file path and get back a conversion result:

from docling.document_converter import DocumentConverter # Basic conversion converter = DocumentConverter() result = converter.convert_document("report.pdf") # Access the structured document doc = result.document # Export to Markdown (best for RAG) markdown = doc.export_to_markdown() print(markdown[:500])

DoclingDocument Structure

Every conversion produces a DoclingDocument — a rich object with typed blocks, metadata, and page info:

doc = result.document # Metadata print(doc.metadata.doc_name) # filename print(doc.metadata.num_pages) # page count # Iterate all text blocks for element, level in doc.iterate_items(): print(type(element).__name__, element.text[:80]) # Tables only for table in doc.tables: df = table.export_to_dataframe() print(df.shape, "columns:", list(df.columns))

💡 Why Docling over PyPDF? PyPDF extracts raw text with no structure. Docling understands layout — it knows which text is a heading vs body, preserves table cells, and handles multi-column PDFs, making chunk quality significantly better for RAG.

02 — Input Formats

Supported Document Formats

Docling handles all common enterprise document formats through a unified pipeline. The same API call works for all types — format is detected automatically from the file extension.

Format	Extension	Table Support	Notes
PDF	.pdf	✅ TableFormer AI	Best supported; layout + reading order
Word	.docx	✅ Native	Styles map to heading levels
PowerPoint	.pptx	✅ Native	Slides become sections
Excel	.xlsx	✅ Native	Sheets exported as tables
HTML	.html	✅ Native	Semantic tags preserved
Images	.png/.jpg	⚠️ OCR only	Requires tesseract or EasyOCR

Batch Conversion

from pathlib import Path from docling.document_converter import DocumentConverter converter = DocumentConverter() # Convert entire directory results = converter.convert_all( [str(p) for p in Path("./documents").glob("**/*.pdf")] ) for result in results: if result.status.success: md = result.document.export_to_markdown() out_path = Path("output") / (result.input.file.stem + ".md") out_path.write_text(md) else: print(f"Failed: {result.input.file.name} — {result.status.error_message}")

03 — Layout

Layout Analysis

Docling uses deep learning models to reconstruct the logical reading order of a document, classify text blocks by type, and segment the page into coherent regions. This is what makes it far more useful than text extractors for RAG.

Block Type	Description	RAG Relevance
`SectionHeaderItem`	Chapter / section headings with hierarchy level	High — chunk boundaries
`TextItem`	Body paragraphs, list items	Core content
`TableItem`	Structured tabular data (cells + structure)	High — structured facts
`PictureItem`	Images and figures with captions	Metadata + alt text
`ListItem`	Bulleted / numbered list entries	Preserves enumeration
`FormulaItem`	Mathematical expressions (LaTeX)	Technical docs

Iterating Layout Elements

from docling.datamodel.base_models import ItemAndImageEnrichmentElement from docling.datamodel.document import SectionHeaderItem, TableItem, TextItem doc = converter.convert_document("report.pdf").document # Iterate with level (heading depth) for item, level in doc.iterate_items(): if isinstance(item, SectionHeaderItem): print(f"{' ' * level}H{level}: {item.text}") elif isinstance(item, TableItem): print(f" [TABLE: {item.num_rows}×{item.num_cols}]") elif isinstance(item, TextItem): print(f" {item.text[:60]}...")

⚠️ Multi-column PDFs: Docling handles 2-column academic layouts. Reading order is reconstructed using layout analysis — text won't be interleaved across columns as it would with naive text extraction.

04 — Tables

Table Extraction

Docling uses TableFormer, a Transformer-based model trained specifically for table structure recognition. It identifies rows, columns, merged cells, and headers even in complex layouts — including spanning cells and nested headers.

Exporting Tables to DataFrames

import pandas as pd from docling.document_converter import DocumentConverter converter = DocumentConverter() doc = converter.convert_document("financial_report.pdf").document table_count = 0 for table in doc.tables: df = table.export_to_dataframe() table_count += 1 # Save to CSV df.to_csv(f"table_{table_count}.csv", index=False)

Markdown Export

Tables are preserved in Markdown format during document export:

# Use export_to_markdown() to preserve tables markdown_output = doc.export_to_markdown() # Markdown includes table formatting: # | Header 1 | Header 2 | # |----------|----------| # | Value 1 | Value 2 |

💡 TableFormer accuracy: ~95% accuracy on well-formed tables. Performance degrades on very complex layouts. Always validate extracted tables on your domain.

05 — Conversion

Export Formats

DoclingDocument can be exported to multiple formats, each with different tradeoffs for RAG pipelines. Choose based on your downstream needs: Markdown for readability, JSON for structured queries, or HTML for faithful layout preservation.

Format	Use Case	Fidelity	Token Efficiency
Markdown	RAG + LLM consumption. Tables, code blocks, headings preserved.	High	High
JSON	Structured queries, programmatic access to blocks and tables.	Perfect	Lower (verbose)
HTML	Web display, faithful layout with styles.	Very High	Lower (markup heavy)
DocTags	XML-based semantic markup (experimental).	Very High	Medium

Export Examples

# Markdown export (best for RAG) markdown_text = doc.export_to_markdown() with open("output.md", "w") as f: f.write(markdown_text) # JSON export (structured) json_string = doc.export_to_json() with open("output.json", "w") as f: f.write(json_string) # HTML export (web display) html_text = doc.export_to_html() with open("output.html", "w") as f: f.write(html_text) # DocTags export (semantic XML) doctags = doc.export_to_doctags() with open("output.doctags", "w") as f: f.write(doctags)

Markdown for RAG Pipelines

Markdown export is ideal for RAG: it preserves structure (headings, lists, tables) while being compact and LLM-friendly. Downstream chunking becomes easier.

# Typical RAG pipeline with Docling from docling.document_converter import DocumentConverter from langchain.text_splitters import MarkdownHeaderTextSplitter # 1. Parse to DoclingDocument converter = DocumentConverter() doc = converter.convert_document("report.pdf").document # 2. Export to Markdown markdown = doc.export_to_markdown() # 3. Chunk by heading hierarchy (preserves semantic boundaries) headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ] splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on ) chunks = splitter.split_text(markdown) # 4. Embed and store for chunk in chunks: embedding = embedder.embed(chunk.page_content) vector_db.add(chunk.page_content, embedding, chunk.metadata)

06 — Framework Integration

RAG Integration

Docling integrates with LangChain and LlamaIndex. Both frameworks provide loaders that handle conversion and chunking automatically.

LangChain with Docling

LangChain's DoclingLoader wraps Docling for seamless document loading:

from langchain_community.document_loaders import DoclingLoader from langchain_text_splitters import RecursiveCharacterTextSplitter # Load PDF using Docling loader = DoclingLoader(file_path="document.pdf") documents = loader.load() # Chunk splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=50 ) chunks = splitter.split_documents(documents) # Store in vector DB from langchain_community.vectorstores import FAISS vector_store = FAISS.from_documents(chunks, embedder) # Query results = vector_store.similarity_search("revenue forecast")

LlamaIndex with Docling

LlamaIndex's DoclingReader converts Docling output to LlamaIndex Document objects:

from llama_index.readers.docling import DoclingReader from llama_index import SimpleDirectoryReader # Configure Docling reader reader = DoclingReader() # Load single file documents = reader.load_data(file=Path("document.pdf")) # Or load directory loader = SimpleDirectoryReader( input_dir="./documents", file_extractor={".pdf": DoclingReader()} ) documents = loader.load_data() # Build index from llama_index import VectorStoreIndex index = VectorStoreIndex.from_documents(documents) # Query retriever = index.as_retriever() results = retriever.retrieve("revenue forecast")

Custom Pipeline with Full Control

For maximum control, build your own pipeline:

from docling.document_converter import DocumentConverter import json # 1. Convert documents converter = DocumentConverter() docs = [] for pdf_path in Path("./pdfs").glob("*.pdf"): doc = converter.convert_document(str(pdf_path)).document docs.append((pdf_path.stem, doc)) # 2. Extract blocks with metadata chunks = [] for doc_name, doc in docs: for page_num, page in enumerate(doc.pages, 1): for block_idx, block in enumerate(page.blocks): chunk_id = f"{doc_name}_p{page_num}_b{block_idx}" if hasattr(block, 'text') and block.text: chunks.append({ "id": chunk_id, "text": block.text, "block_type": block.block_type.name, "page": page_num, "source": doc_name }) print(f"Extracted {len(chunks)} chunks from {len(docs)} documents")

💡 Best practice: Store block type in metadata. Use it during reranking to prioritize tables over paragraphs, or headings over body text, depending on your query.

Tools & Ecosystem

Related Tools

Parsing

Docling

IBM's structured document converter (PDF, DOCX, PPTX)

Parsing

Unstructured.io

Multi-format parser with cloud + local options

Tables

TableFormer

Deep learning for table extraction (in Docling)

PDF

PyMuPDF

Fast low-level PDF text extraction

PDF

pdfplumber

Precise table and text extraction

Cloud

LlamaParse

LlamaIndex's cloud document parser

Framework

LangChain

LLM framework with DoclingLoader

Framework

LlamaIndex

Data indexing with DoclingReader

Hugging Face

HuggingFace Hub

Models and datasets (TableFormer pretrained)

07 — Further Reading

References

Documentation

Research Papers

Blog & Guides

Docling — Structured Document Parsing

Core Concepts

The DocumentConverter

DoclingDocument Structure

Supported Document Formats

Batch Conversion

Layout Analysis

Iterating Layout Elements

Table Extraction

Exporting Tables to DataFrames

Markdown Export

Export Formats

Export Examples

Markdown for RAG Pipelines

RAG Integration

LangChain with Docling

LlamaIndex with Docling

Custom Pipeline with Full Control

Related Tools

References

Related concepts