Streamlit turns Python scripts into interactive web apps with no frontend code. The fastest way to demo an LLM application — build a chatbot, RAG interface, or dashboard in 50 lines of Python. st.chat_message, st.chat_input, and streaming support are purpose-built for LLM apps.
Streamlit is a Python library that renders interactive web apps directly from your script — no HTML, CSS, JavaScript, or web framework knowledge required. Every time the user interacts with a widget, Streamlit re-runs the script from top to bottom and updates the displayed output.
For LLM applications, Streamlit provides exactly the primitives you need: st.chat_message (speech bubble UI), st.chat_input (message input box), st.session_state (persist conversation history across script re-runs), and native support for streaming responses token-by-token. A complete LLM chatbot with streaming, conversation history, and a model selector is ~60 lines of Python.
pip install streamlit openai
streamlit run app.py
import streamlit as st
from openai import OpenAI
st.title("My LLM Chatbot")
client = OpenAI()
# Initialize conversation history in session state
if "messages" not in st.session_state:
st.session_state.messages = [
{"role": "system", "content": "You are a helpful assistant."}
]
# Display conversation history
for msg in st.session_state.messages[1:]: # skip system message
with st.chat_message(msg["role"]):
st.markdown(msg["content"])
# Handle new user input
if prompt := st.chat_input("Ask me anything..."):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=st.session_state.messages,
)
reply = response.choices[0].message.content
st.markdown(reply)
st.session_state.messages.append({"role": "assistant", "content": reply})
import streamlit as st
from openai import OpenAI
client = OpenAI()
if "messages" not in st.session_state:
st.session_state.messages = []
for msg in st.session_state.messages:
with st.chat_message(msg["role"]):
st.markdown(msg["content"])
if prompt := st.chat_input("Message..."):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
# Stream tokens as they arrive
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": "You are helpful."}]
+ st.session_state.messages,
stream=True,
)
# st.write_stream handles the OpenAI stream natively
response = st.write_stream(stream)
st.session_state.messages.append({"role": "assistant", "content": response})
st.write_stream (added in Streamlit 1.31) consumes an OpenAI stream and renders tokens progressively — no manual chunk handling needed.
import streamlit as st
# st.session_state persists data across re-runs for each user session
# Perfect for: conversation history, user preferences, loaded models
# Pattern: initialize once, use everywhere
if "chat_history" not in st.session_state:
st.session_state.chat_history = []
if "model" not in st.session_state:
st.session_state.model = "gpt-4o-mini"
# Can store any Python object
if "rag_pipeline" not in st.session_state:
st.session_state.rag_pipeline = build_rag_pipeline() # expensive — only once
# Clear history button
if st.button("Clear conversation"):
st.session_state.chat_history = []
st.rerun() # re-run the script to update the display
# Download chat history
import json
if st.button("Export chat"):
st.download_button(
"Download JSON",
json.dumps(st.session_state.chat_history, indent=2),
"chat_history.json",
"application/json",
)
import streamlit as st
from openai import OpenAI
# Sidebar for configuration
with st.sidebar:
st.header("Settings")
model = st.selectbox("Model", ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"])
temperature = st.slider("Temperature", 0.0, 2.0, 0.7, 0.1)
max_tokens = st.number_input("Max tokens", 100, 4000, 500)
system_prompt = st.text_area("System prompt", "You are a helpful assistant.")
if st.button("Reset conversation"):
st.session_state.messages = []
st.rerun()
st.title(f"Chat with {model}")
client = OpenAI()
# Use sidebar values in the API call
if prompt := st.chat_input("..."):
response = client.chat.completions.create(
model=model,
messages=[{"role": "system", "content": system_prompt}]
+ st.session_state.messages
+ [{"role": "user", "content": prompt}],
temperature=temperature,
max_tokens=max_tokens,
)
# requirements.txt
streamlit>=1.31.0
openai>=1.0.0
langchain>=0.1.0
chromadb>=0.4.0
Streamlit Cloud (streamlit.io/cloud) deploys public GitHub repos for free (with resource limits). Steps:
OPENAI_API_KEY = "sk-..."openai.api_key = st.secrets["OPENAI_API_KEY"]For private apps or higher resources, use Streamlit in Docker on any cloud VM. The official Docker image: FROM python:3.11; RUN pip install streamlit; EXPOSE 8501; CMD streamlit run app.py.
Re-runs and state: Streamlit re-runs the entire script on every interaction. Expensive operations (loading models, building indexes) must be wrapped in @st.cache_resource to avoid re-execution. Without caching, your LLM would reload on every message.
@st.cache_resource # cached across all users and sessions
def load_rag_pipeline():
return build_expensive_rag_pipeline()
pipeline = load_rag_pipeline() # only runs once per Streamlit server start
Multiuser state: st.session_state is per-session, not global. Each browser tab gets its own session state. For shared state across users (admin controls, shared document store), use an external database or Redis, not session state.
Production limitations: Streamlit is excellent for demos and internal tools but not ideal for high-traffic production apps (no built-in auth beyond Streamlit Community Cloud's Google OAuth, limited horizontal scaling). For public-facing production apps with 1000+ daily users, consider FastAPI + React instead.
Streamlit's execution model re-runs the entire script from top to bottom on every user interaction, which requires careful state management for LLM applications. Session state persists data across reruns, enabling conversation history, model configuration, and intermediate results to survive the interaction cycle without expensive recomputation.
| Pattern | Implementation | Best For | Pitfall |
|---|---|---|---|
| Chat history | st.session_state.messages list | Conversational apps | Unbounded growth → token limit |
| Cached model load | @st.cache_resource | Local model serving | Stale model on update |
| Streaming output | st.write_stream(generator) | Responsive chat UI | Generator must be consumed once |
| Sidebar config | st.sidebar sliders/selects | Parameter tuning UIs | Triggers full rerun on change |
The @st.cache_resource decorator is essential for LLM applications that load models locally. Without caching, the model weights would be loaded from disk and transferred to GPU on every script rerun — which happens on every user interaction. Cache_resource stores the model object across sessions and reruns, ensuring the expensive initialization happens only once per application deployment. The clear function can be called programmatically to reload the model when weights are updated.
Streamlit's st.write_stream function accepts a generator and displays tokens as they arrive, implementing the streaming chat experience with a single function call. The generator can come from any LLM API that supports streaming — OpenAI, Anthropic, or a local model endpoint. Wrapping the streaming call in a try/except block with a finally clause that appends the completed response to the session state chat history ensures history is persisted even if the stream is interrupted by a network error or user navigation.
Streamlit's multipage app feature organizes complex LLM applications into separate pages, each with its own Python file in the pages/ directory. Navigation between pages is handled automatically by Streamlit's sidebar, and session state persists across page transitions within the same browser session. For LLM applications with distinct functional areas — a chat interface, a document analyzer, a settings panel, an evaluation dashboard — multipage structure keeps each component focused and independently maintainable without the complexity of routing logic.
Database connections in Streamlit LLM apps should use st.cache_resource to avoid creating a new connection pool on every script rerun. The same pattern applies to vector store clients, embedding model instances, and any other resource with significant initialization cost. Cache_resource stores the resource at the singleton level, shared across all user sessions and script reruns. Parameterized resources — different database connections for different user accounts — can use the hash_funcs argument to cache separate instances per unique parameter combination.
Deploying Streamlit applications with Docker enables consistent production environments that match the local development setup exactly. The Dockerfile installs Python dependencies, copies the application code, and configures Streamlit's server settings (headless mode, port, CORS policy) via environment variables. Streamlit Cloud provides managed hosting that deploys directly from GitHub repositories with automatic dependency installation from requirements.txt, supporting both public and private apps with GitHub-based access control.
Streamlit's fragment feature, introduced in recent versions, allows portions of the UI to rerun independently of the full script, reducing unnecessary recomputation. For LLM chat applications, wrapping the chat message display in a fragment ensures that displaying new messages does not retrigger expensive operations like sidebar metric calculations or chart renders that depend on session data unchanged by the new message. Fragments improve perceived performance by reducing the visual flicker and latency of partial UI updates in response to user interactions.