LlamaIndex

LlamaIndex vs LangChain
Core abstractions
Building a basic RAG pipeline
Query engines and chat engines
Agents in LlamaIndex
Multi-document and structured data RAG
Gotchas

SECTION 01

LlamaIndex vs LangChain

Both LlamaIndex and LangChain are Python frameworks for building LLM applications, but they have different design philosophies:

LlamaIndex is data-first. It's optimised for connecting LLMs to your data — documents, databases, APIs — with a rich set of abstractions specifically for retrieval and question-answering over heterogeneous data sources. Its data connectors, index types, and query engines are more mature than LangChain's equivalents for pure RAG use cases.

LangChain is chain-first. It's optimised for composing LLM calls into pipelines (chains) with a large ecosystem of integrations. Better for workflows that aren't purely retrieval-focused.

In practice: use LlamaIndex for sophisticated RAG (multi-document, structured data, hybrid retrieval, agentic RAG), or when you need many data connectors without writing custom loaders. Use LangChain for non-RAG LLM pipelines or if you're already embedded in the LangChain ecosystem. Both frameworks have converged on similar capabilities, so the choice is often about ecosystem fit.

SECTION 02

Core abstractions

pip install llama-index llama-index-llms-anthropic llama-index-embeddings-huggingface

from llama_index.core import Settings, VectorStoreIndex, Document
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Configure global defaults
Settings.llm = Anthropic(model="claude-haiku-4-5-20251001")
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.chunk_size = 512
Settings.chunk_overlap = 50

# Core abstractions:
# - Document: a piece of text + metadata
# - Node: a chunk of a Document (after splitting)
# - Index: a searchable structure over Nodes (VectorStoreIndex, KeywordIndex, etc.)
# - QueryEngine: takes a query, searches the index, generates a response
# - ChatEngine: stateful conversation over an index (maintains history)
# - Retriever: returns relevant Nodes for a query (can be used standalone)

# Create documents
docs = [
    Document(text="LlamaIndex is a data framework for LLM applications.", metadata={"source": "intro.txt"}),
    Document(text="It supports 500+ data integrations and multiple index types.", metadata={"source": "features.txt"}),
]

SECTION 03

Building a basic RAG pipeline

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Load documents from a directory (handles PDF, DOCX, TXT, etc.)
documents = SimpleDirectoryReader("./data").load_data()
print(f"Loaded {len(documents)} documents")

# Build index (chunks, embeds, stores in memory by default)
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = index.as_query_engine(
    similarity_top_k=3,          # retrieve top 3 chunks
    response_mode="tree_summarize",  # or "compact", "refine"
)

# Query
response = query_engine.query("What are the main capabilities of LlamaIndex?")
print(response.response)

# See source nodes (which chunks were retrieved)
for node in response.source_nodes:
    print(f"Score: {node.score:.3f} | {node.text[:100]}...")

# Persist index to disk
index.storage_context.persist(persist_dir="./index_store")

# Load from disk
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./index_store")
loaded_index = load_index_from_storage(storage_context)

SECTION 04

Query engines and chat engines

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.postprocessor import SimilarityPostprocessor

# Custom retriever with post-processing
retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
postprocessor = SimilarityPostprocessor(similarity_cutoff=0.7)  # filter weak matches

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[postprocessor],
)
response = query_engine.query("Explain the architecture")
print(f"Retrieved {len(response.source_nodes)} chunks after filtering")

# Chat engine — maintains conversation history
chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",  # reformulates questions with chat history
    verbose=True,
)

# Multi-turn conversation
response = chat_engine.chat("What is LlamaIndex?")
print(response.response)

response = chat_engine.chat("How does it compare to alternatives?")  # uses history
print(response.response)

chat_engine.reset()  # clear conversation history

# Streaming chat
streaming_response = chat_engine.stream_chat("Summarise the key points.")
for token in streaming_response.response_gen:
    print(token, end="", flush=True)

SECTION 05

Agents in LlamaIndex

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, FunctionTool

# Wrap query engines as tools
rag_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine,
    name="knowledge_base",
    description="Useful for answering questions about LlamaIndex documentation."
)

# Add custom function tools
import requests

def get_github_stars(repo: str) -> str:
    '''Get GitHub star count for a repository (format: owner/repo)'''
    r = requests.get(f"https://api.github.com/repos/{repo}")
    if r.status_code == 200:
        return f"{r.json()['stargazers_count']:,} stars"
    return "Could not fetch star count"

github_tool = FunctionTool.from_defaults(fn=get_github_stars)

# Create ReAct agent with both tools
agent = ReActAgent.from_tools(
    tools=[rag_tool, github_tool],
    verbose=True,
    max_iterations=10,
)

# Agent decides which tools to use and when
response = agent.chat(
    "How many GitHub stars does LlamaIndex have, and what does the documentation say about its main use cases?"
)
print(response.response)

SECTION 06

Multi-document and structured data RAG

from llama_index.core import SummaryIndex, VectorStoreIndex
from llama_index.core.tools import QueryEngineTool
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector

# Multi-document: create per-document indices + a router
docs_by_topic = {
    "architecture": load_topic_docs("architecture"),
    "api_reference": load_topic_docs("api"),
    "tutorials": load_topic_docs("tutorials"),
}

tools = []
for topic, docs in docs_by_topic.items():
    index = VectorStoreIndex.from_documents(docs)
    tool = QueryEngineTool.from_defaults(
        query_engine=index.as_query_engine(),
        name=topic,
        description=f"Answers questions about {topic}"
    )
    tools.append(tool)

# Router directs each query to the most relevant index
router_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=tools,
)
response = router_engine.query("How do I install LlamaIndex?")
# Routes to "tutorials" automatically

# SQL integration — query databases with natural language
from llama_index.core import SQLDatabase
from llama_index.core.query_engine import NLSQLTableQueryEngine
from sqlalchemy import create_engine

engine = create_engine("sqlite:///./sales.db")
sql_database = SQLDatabase(engine, include_tables=["orders", "customers"])

nl_query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["orders", "customers"],
)
response = nl_query_engine.query("What were total sales by region last quarter?")
print(response.response)  # LlamaIndex generates SQL, runs it, explains results

SECTION 07

Gotchas

LlamaIndex's abstractions add indirection that makes debugging hard. When your RAG pipeline returns wrong answers, tracing through LlamaIndex's response modes, postprocessors, and synthesisers to find the problem is non-trivial. Enable verbose mode (Settings.debug = True or verbose=True on query engines) to see intermediate outputs. LlamaIndex integrates with LangSmith and LlamaTrace for production tracing.

The default in-memory index is lost on restart. VectorStoreIndex.from_documents() stores everything in memory. Call storage_context.persist() to save to disk, or connect a real vector database (Qdrant, Weaviate, Pinecone, Chroma) via LlamaIndex's vector store integrations for production use.

Chunking strategy dramatically affects retrieval quality. LlamaIndex defaults to 512-token chunks with 50-token overlap. This works poorly for documents with tables, code, or structured data. Use SentenceSplitter for prose, CodeSplitter for code files, and consider LlamaIndex's SemanticSplitterNodeParser (embeddings-based chunking) for heterogeneous documents.

Component	Purpose	Key Class	When to Customise
Document loader	Ingest files, URLs, DBs into Document objects	SimpleDirectoryReader	Custom metadata extraction needed
Node parser	Split Documents into indexable Nodes	SentenceSplitter	Domain-specific chunking logic
Embedding model	Generate vector representations	OpenAIEmbedding	Cost control or private deployment
Vector store	Store and retrieve embeddings	SimpleVectorStore	Scale beyond memory or multi-tenant
LLM	Synthesise answers from retrieved nodes	OpenAI / Anthropic	Different task types need different models
Response synthesiser	Combine retrieved chunks into final answer	get_response_synthesizer	Map-reduce or refine modes for long docs

LlamaIndex

Table of Contents

LlamaIndex vs LangChain

Core abstractions

Building a basic RAG pipeline

Query engines and chat engines

Agents in LlamaIndex

Multi-document and structured data RAG

Gotchas

LlamaIndex Component Reference

LlamaIndex

Table of Contents

LlamaIndex vs LangChain

Core abstractions

Building a basic RAG pipeline

Query engines and chat engines

Agents in LlamaIndex

Multi-document and structured data RAG

Gotchas

LlamaIndex Component Reference

Related concepts