An open-source Python framework for building production-ready LLM pipelines — RAG systems, agents, and document processing workflows — with a composable component-based architecture.
Haystack (by deepset) is a framework for building modular LLM applications. Its core abstraction is the Pipeline: a directed acyclic graph of Components that process inputs and pass outputs to downstream components. Components include document converters, text splitters, embedders, retrievers, LLM generators, and output parsers. Pipelines are declaratively defined and can be serialised to YAML.
Every component has typed inputs and outputs. The pipeline connects them: component A's output named 'documents' feeds into component B's input named 'documents'. Pipelines can branch (route to different components based on conditions) and merge (aggregate outputs from multiple branches). This makes complex workflows like hybrid retrieval + reranking + generation easy to express.
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.document_stores.in_memory import InMemoryDocumentStore
# Setup
store = InMemoryDocumentStore()
# Index documents (omitted for brevity)
# Build query pipeline
rag = Pipeline()
rag.add_component("retriever", InMemoryBM25Retriever(document_store=store))
rag.add_component("prompt_builder", PromptBuilder(
template="Context: {%for doc in documents%}{{doc.content}}{%endfor%}\nQuestion: {{question}}\nAnswer:"
))
rag.add_component("llm", OpenAIGenerator(model="gpt-4o-mini"))
rag.connect("retriever", "prompt_builder.documents")
rag.connect("prompt_builder", "llm")
result = rag.run({"retriever": {"query": "What is RAG?"}, "prompt_builder": {"question": "What is RAG?"}})
print(result["llm"]["replies"][0])
Haystack integrates with: InMemory (dev/test), OpenSearch, Elasticsearch, Weaviate, Qdrant, Pinecone, pgvector, and Chroma. Switch backends by swapping the DocumentStore and its paired Retriever — the rest of the pipeline is unchanged. Each store supports both sparse (BM25) and dense (vector) retrieval.
Haystack 2.x supports agents via the Agent component that wraps any LLM with tool-calling capabilities. Tools are defined as Python functions with docstrings describing their parameters. The agent loop runs until no more tool calls are needed. Combine with custom components for specialised tool implementations.
Choose Haystack when: you want a mature, production-tested RAG framework, you need flexible pipeline topologies (branching, merging), or you want YAML-configurable deployable pipelines. Avoid if: you need deep LangChain ecosystem integrations, or your workflow is simple enough that a direct API call suffices.
Haystack's core abstraction is a Pipeline: a directed acyclic graph (DAG) of components. Components are building blocks (Retrievers, LLMs, Writers) connected by edges. Pipelines are composable: define once, reuse across applications. Haystack 2.0 uses a decorator-based API for clarity and testability.
# Haystack 2.0 pipeline definition
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.llms import HuggingFaceTGIGenerator
pipeline = Pipeline()
pipeline.add_component("retriever", InMemoryBM25Retriever(document_store=doc_store))
pipeline.add_component("llm", HuggingFaceTGIGenerator(api_type="openai_compatible_server", ...))
# Connect: retriever output → LLM input
pipeline.connect("retriever.documents", "llm.documents")
# Run the pipeline
result = pipeline.run({"retriever": {"query": "What is RAG?"}})
print(result["llm"]["replies"])Each component (Retriever, LLM, etc.) is independently testable. Mock or replace any component without modifying others. This enables rapid prototyping: swap out a BM25 retriever for a dense retriever, or replace a local LLM with a cloud API, with minimal code change.
# Test pipeline components independently
from unittest.mock import Mock
mock_retriever = Mock()
mock_retriever.to_dict.return_value = {"documents": [{"content": "test doc"}]}
pipeline.add_component("retriever", mock_retriever)
result = pipeline.run({"retriever": {"query": "test"}})
# Verify that retriever was called correctly
assert mock_retriever.calledPipelines can be serialized to YAML or JSON, version-controlled, and deployed reproducibly. Serialize your pipeline, check it into git, and deploy the exact same pipeline across staging, prod, and A/B test environments. YAML makes it visible how components are connected and configured.
| Format | Use Case | Version Control | Advantage |
|---|---|---|---|
| YAML | Production pipelines | Git-friendly (human-readable) | Easy diffing, code review |
| JSON | Programmatic generation | Tool-generated, checked in | Strict schema, less ambiguity |
| Python (decorator) | Research & prototyping | Not recommended | Full flexibility, imperative control |
| Pickle/dill | Legacy (avoid) | Binary, not reviewable | Fast serialization, but unsafe |
Common Pitfall: Pipelines with tightly coupled components are brittle. If a retriever output schema changes, the LLM input expectations break silently. Use typed inputs/outputs and validate at component boundaries. Haystack's type hints help catch these issues early.
For production deployments, wrap Haystack pipelines in REST APIs (FastAPI + Haystack) or async task queues (Celery + Haystack). This decouples pipeline execution from the application, enabling independent scaling and monitoring. Use Haystack's built-in observability hooks to log retrieval quality and LLM latency for debugging.
Advanced Haystack: Custom Components & Distribution: Haystack's component API is extensible. Write custom components for domain-specific logic (e.g., custom re-ranker, domain-specific validation). Components are isolated units; test them independently, then integrate into pipelines. For distributed inference, deploy Haystack with Kubernetes: each component type runs in separate pods, connected via message queues. This enables independent scaling: if the LLM is the bottleneck, scale LLM pods without scaling retrievers. Use asynchronous pipelines for high-throughput batch processing: submit jobs to a queue, process in parallel, retrieve results later.
For production monitoring, add observability hooks: log latency per component, cache hit rates for retrievers, LLM token usage, and end-to-end pipeline latency. Alerting on component slowdowns enables rapid diagnosis of bottlenecks. Version your pipelines in git; pipeline changes are code changes and require review, testing, and gradual rollout to avoid breaking production RAG systems.
Monitoring and observability are essential for production systems. Set up comprehensive logging at every layer: API requests, model predictions, database queries, cache hits/misses. Use structured logging (JSON) to enable filtering and aggregation across thousands of servers. For production deployments, track not just errors but also latency percentiles (p50, p95, p99); if p99 latency suddenly doubles, something is wrong even if error rates are normal. Set up alerting based on SLO violations: if a service is supposed to have 99.9% availability and it drops to 99.5%, alert immediately. Use distributed tracing (Jaeger, Lightstep) to track requests across multiple services; a slow end-to-end latency might be hidden in one deep service call, invisible in aggregate metrics.
For long-running ML jobs (training, batch inference), implement checkpoint recovery and graceful degradation. If a training job crashes after 2 weeks, you want to resume from the last checkpoint, not restart from scratch. Implement job orchestration with Kubernetes or Airflow to handle retries, resource allocation, and dependency management. Use feature flags for safe deployment: deploy new model versions behind a flag that's off by default, gradually roll out to 1% of users, 10%, then 100%, monitoring metrics at each step. If something goes wrong, flip the flag back instantly. This approach reduces risk and enables fast rollback.
Finally, build a culture of incident response and post-mortems. When something breaks (and it will), document the incident: timeline, root cause, mitigation steps, and preventive measures. Use incidents as learning opportunities; blameless post-mortems focus on systems, not people. Share findings across teams to prevent repeat incidents. A well-documented incident history is an organization's institutional knowledge about system failures and how to avoid them.
The rapid evolution of AI infrastructure requires continuous learning and adaptation. Teams should establish regular tech talks and knowledge-sharing sessions where engineers present lessons learned from production deployments, performance optimization work, and incident postmortems. Create internal wiki pages documenting best practices specific to your organization: how to debug common failure modes, performance tuning guides for your hardware, and checklists for safe deployments. This prevents repeating mistakes and accelerates onboarding of new team members.
Build relationships with vendors and open-source communities. If you encounter bugs in frameworks (PyTorch, JAX), file detailed reports. If you have questions, ask on forums; community members often have encountered similar issues. For mission-critical infrastructure, consider purchasing support contracts with vendors (PyTorch, HuggingFace, cloud providers). Support gives you direct access to engineers who understand your system and can prioritize fixes. This is insurance against production outages caused by third-party software bugs.
Finally, remember that optimization is a journey, not a destination. Today's cutting-edge technique becomes tomorrow's baseline. Allocate 10-15% of engineering time to exploration and experimentation. Some experiments will fail, but successful ones compound into significant efficiency gains. Foster a culture of continuous improvement: measure, analyze, iterate, and share results. The teams that stay ahead are those that invest in understanding their systems deeply and adapting proactively to new technologies and changing demands.
Key Takeaway: Success in GenAI infrastructure depends on mastering fundamentals: understand your hardware constraints, profile your workloads, measure everything, and iterate. The most sophisticated techniques (dynamic batching, mixed precision, distributed training) build on solid foundations of clear thinking and empirical validation. Avoid cargo-cult engineering: if you don't understand why a technique helps your specific use case, it probably won't. Invest time in understanding root causes, not just applying trendy solutions. Over time, this rigor will compound into significant competitive advantage.