Full-stack RAG framework with 500+ integrations — loaders for any data source, query engines, chat engines, agents, and production-ready pipelines. The most complete RAG toolkit in the Python ecosystem.
Both LlamaIndex and LangChain are Python frameworks for building LLM applications, but they have different design philosophies:
LlamaIndex is data-first. It's optimised for connecting LLMs to your data — documents, databases, APIs — with a rich set of abstractions specifically for retrieval and question-answering over heterogeneous data sources. Its data connectors, index types, and query engines are more mature than LangChain's equivalents for pure RAG use cases.
LangChain is chain-first. It's optimised for composing LLM calls into pipelines (chains) with a large ecosystem of integrations. Better for workflows that aren't purely retrieval-focused.
In practice: use LlamaIndex for sophisticated RAG (multi-document, structured data, hybrid retrieval, agentic RAG), or when you need many data connectors without writing custom loaders. Use LangChain for non-RAG LLM pipelines or if you're already embedded in the LangChain ecosystem. Both frameworks have converged on similar capabilities, so the choice is often about ecosystem fit.
pip install llama-index llama-index-llms-anthropic llama-index-embeddings-huggingface
from llama_index.core import Settings, VectorStoreIndex, Document
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# Configure global defaults
Settings.llm = Anthropic(model="claude-haiku-4-5-20251001")
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.chunk_size = 512
Settings.chunk_overlap = 50
# Core abstractions:
# - Document: a piece of text + metadata
# - Node: a chunk of a Document (after splitting)
# - Index: a searchable structure over Nodes (VectorStoreIndex, KeywordIndex, etc.)
# - QueryEngine: takes a query, searches the index, generates a response
# - ChatEngine: stateful conversation over an index (maintains history)
# - Retriever: returns relevant Nodes for a query (can be used standalone)
# Create documents
docs = [
Document(text="LlamaIndex is a data framework for LLM applications.", metadata={"source": "intro.txt"}),
Document(text="It supports 500+ data integrations and multiple index types.", metadata={"source": "features.txt"}),
]
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# Load documents from a directory (handles PDF, DOCX, TXT, etc.)
documents = SimpleDirectoryReader("./data").load_data()
print(f"Loaded {len(documents)} documents")
# Build index (chunks, embeds, stores in memory by default)
index = VectorStoreIndex.from_documents(documents)
# Create query engine
query_engine = index.as_query_engine(
similarity_top_k=3, # retrieve top 3 chunks
response_mode="tree_summarize", # or "compact", "refine"
)
# Query
response = query_engine.query("What are the main capabilities of LlamaIndex?")
print(response.response)
# See source nodes (which chunks were retrieved)
for node in response.source_nodes:
print(f"Score: {node.score:.3f} | {node.text[:100]}...")
# Persist index to disk
index.storage_context.persist(persist_dir="./index_store")
# Load from disk
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./index_store")
loaded_index = load_index_from_storage(storage_context)
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.postprocessor import SimilarityPostprocessor
# Custom retriever with post-processing
retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
postprocessor = SimilarityPostprocessor(similarity_cutoff=0.7) # filter weak matches
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[postprocessor],
)
response = query_engine.query("Explain the architecture")
print(f"Retrieved {len(response.source_nodes)} chunks after filtering")
# Chat engine — maintains conversation history
chat_engine = index.as_chat_engine(
chat_mode="condense_plus_context", # reformulates questions with chat history
verbose=True,
)
# Multi-turn conversation
response = chat_engine.chat("What is LlamaIndex?")
print(response.response)
response = chat_engine.chat("How does it compare to alternatives?") # uses history
print(response.response)
chat_engine.reset() # clear conversation history
# Streaming chat
streaming_response = chat_engine.stream_chat("Summarise the key points.")
for token in streaming_response.response_gen:
print(token, end="", flush=True)
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, FunctionTool
# Wrap query engines as tools
rag_tool = QueryEngineTool.from_defaults(
query_engine=query_engine,
name="knowledge_base",
description="Useful for answering questions about LlamaIndex documentation."
)
# Add custom function tools
import requests
def get_github_stars(repo: str) -> str:
'''Get GitHub star count for a repository (format: owner/repo)'''
r = requests.get(f"https://api.github.com/repos/{repo}")
if r.status_code == 200:
return f"{r.json()['stargazers_count']:,} stars"
return "Could not fetch star count"
github_tool = FunctionTool.from_defaults(fn=get_github_stars)
# Create ReAct agent with both tools
agent = ReActAgent.from_tools(
tools=[rag_tool, github_tool],
verbose=True,
max_iterations=10,
)
# Agent decides which tools to use and when
response = agent.chat(
"How many GitHub stars does LlamaIndex have, and what does the documentation say about its main use cases?"
)
print(response.response)
from llama_index.core import SummaryIndex, VectorStoreIndex
from llama_index.core.tools import QueryEngineTool
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
# Multi-document: create per-document indices + a router
docs_by_topic = {
"architecture": load_topic_docs("architecture"),
"api_reference": load_topic_docs("api"),
"tutorials": load_topic_docs("tutorials"),
}
tools = []
for topic, docs in docs_by_topic.items():
index = VectorStoreIndex.from_documents(docs)
tool = QueryEngineTool.from_defaults(
query_engine=index.as_query_engine(),
name=topic,
description=f"Answers questions about {topic}"
)
tools.append(tool)
# Router directs each query to the most relevant index
router_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=tools,
)
response = router_engine.query("How do I install LlamaIndex?")
# Routes to "tutorials" automatically
# SQL integration — query databases with natural language
from llama_index.core import SQLDatabase
from llama_index.core.query_engine import NLSQLTableQueryEngine
from sqlalchemy import create_engine
engine = create_engine("sqlite:///./sales.db")
sql_database = SQLDatabase(engine, include_tables=["orders", "customers"])
nl_query_engine = NLSQLTableQueryEngine(
sql_database=sql_database,
tables=["orders", "customers"],
)
response = nl_query_engine.query("What were total sales by region last quarter?")
print(response.response) # LlamaIndex generates SQL, runs it, explains results
LlamaIndex's abstractions add indirection that makes debugging hard. When your RAG pipeline returns wrong answers, tracing through LlamaIndex's response modes, postprocessors, and synthesisers to find the problem is non-trivial. Enable verbose mode (Settings.debug = True or verbose=True on query engines) to see intermediate outputs. LlamaIndex integrates with LangSmith and LlamaTrace for production tracing.
The default in-memory index is lost on restart. VectorStoreIndex.from_documents() stores everything in memory. Call storage_context.persist() to save to disk, or connect a real vector database (Qdrant, Weaviate, Pinecone, Chroma) via LlamaIndex's vector store integrations for production use.
Chunking strategy dramatically affects retrieval quality. LlamaIndex defaults to 512-token chunks with 50-token overlap. This works poorly for documents with tables, code, or structured data. Use SentenceSplitter for prose, CodeSplitter for code files, and consider LlamaIndex's SemanticSplitterNodeParser (embeddings-based chunking) for heterogeneous documents.
| Component | Purpose | Key Class | When to Customise |
|---|---|---|---|
| Document loader | Ingest files, URLs, DBs into Document objects | SimpleDirectoryReader | Custom metadata extraction needed |
| Node parser | Split Documents into indexable Nodes | SentenceSplitter | Domain-specific chunking logic |
| Embedding model | Generate vector representations | OpenAIEmbedding | Cost control or private deployment |
| Vector store | Store and retrieve embeddings | SimpleVectorStore | Scale beyond memory or multi-tenant |
| LLM | Synthesise answers from retrieved nodes | OpenAI / Anthropic | Different task types need different models |
| Response synthesiser | Combine retrieved chunks into final answer | get_response_synthesizer | Map-reduce or refine modes for long docs |
LlamaIndex's Settings object (formerly ServiceContext) is the global configuration container. Set it once at app start rather than passing LLM and embedding objects to every component. Override locally only when a specific pipeline needs a different model — for example, using Haiku for node summarisation during indexing while using Sonnet for final answer synthesis. This keeps configuration DRY and makes model swaps a one-line change.
When migrating from LlamaIndex v0.x to v0.10+, the key breaking change is the removal of the global ServiceContext in favour of Settings. Update every ServiceContext.from_defaults() call to Settings.llm / Settings.embed_model assignments at app startup. Index objects created with the old API must be rebuilt.