Streamlit

Why Streamlit for LLM apps
Building a chatbot in 30 lines
Streaming responses
Session state and conversation history
Sidebar controls
Deploying to Streamlit Cloud
Gotchas

SECTION 01

Why Streamlit for LLM apps

Streamlit is a Python library that renders interactive web apps directly from your script — no HTML, CSS, JavaScript, or web framework knowledge required. Every time the user interacts with a widget, Streamlit re-runs the script from top to bottom and updates the displayed output.

For LLM applications, Streamlit provides exactly the primitives you need: st.chat_message (speech bubble UI), st.chat_input (message input box), st.session_state (persist conversation history across script re-runs), and native support for streaming responses token-by-token. A complete LLM chatbot with streaming, conversation history, and a model selector is ~60 lines of Python.

pip install streamlit openai
streamlit run app.py

SECTION 02

Building a chatbot in 30 lines

import streamlit as st
from openai import OpenAI

st.title("My LLM Chatbot")
client = OpenAI()

# Initialize conversation history in session state
if "messages" not in st.session_state:
    st.session_state.messages = [
        {"role": "system", "content": "You are a helpful assistant."}
    ]

# Display conversation history
for msg in st.session_state.messages[1:]:  # skip system message
    with st.chat_message(msg["role"]):
        st.markdown(msg["content"])

# Handle new user input
if prompt := st.chat_input("Ask me anything..."):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    with st.chat_message("assistant"):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=st.session_state.messages,
        )
        reply = response.choices[0].message.content
        st.markdown(reply)
    st.session_state.messages.append({"role": "assistant", "content": reply})

SECTION 03

Streaming responses

import streamlit as st
from openai import OpenAI

client = OpenAI()

if "messages" not in st.session_state:
    st.session_state.messages = []

for msg in st.session_state.messages:
    with st.chat_message(msg["role"]):
        st.markdown(msg["content"])

if prompt := st.chat_input("Message..."):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)

    with st.chat_message("assistant"):
        # Stream tokens as they arrive
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "system", "content": "You are helpful."}]
                     + st.session_state.messages,
            stream=True,
        )
        # st.write_stream handles the OpenAI stream natively
        response = st.write_stream(stream)

    st.session_state.messages.append({"role": "assistant", "content": response})

st.write_stream (added in Streamlit 1.31) consumes an OpenAI stream and renders tokens progressively — no manual chunk handling needed.

SECTION 04

Session state and conversation history

import streamlit as st

# st.session_state persists data across re-runs for each user session
# Perfect for: conversation history, user preferences, loaded models

# Pattern: initialize once, use everywhere
if "chat_history" not in st.session_state:
    st.session_state.chat_history = []
if "model" not in st.session_state:
    st.session_state.model = "gpt-4o-mini"

# Can store any Python object
if "rag_pipeline" not in st.session_state:
    st.session_state.rag_pipeline = build_rag_pipeline()  # expensive — only once

# Clear history button
if st.button("Clear conversation"):
    st.session_state.chat_history = []
    st.rerun()  # re-run the script to update the display

# Download chat history
import json
if st.button("Export chat"):
    st.download_button(
        "Download JSON",
        json.dumps(st.session_state.chat_history, indent=2),
        "chat_history.json",
        "application/json",
    )

SECTION 05

Sidebar controls

import streamlit as st
from openai import OpenAI

# Sidebar for configuration
with st.sidebar:
    st.header("Settings")
    model = st.selectbox("Model", ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"])
    temperature = st.slider("Temperature", 0.0, 2.0, 0.7, 0.1)
    max_tokens = st.number_input("Max tokens", 100, 4000, 500)
    system_prompt = st.text_area("System prompt", "You are a helpful assistant.")
    if st.button("Reset conversation"):
        st.session_state.messages = []
        st.rerun()

st.title(f"Chat with {model}")
client = OpenAI()

# Use sidebar values in the API call
if prompt := st.chat_input("..."):
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "system", "content": system_prompt}]
                 + st.session_state.messages
                 + [{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=max_tokens,
    )

SECTION 06

Deploying to Streamlit Cloud

# requirements.txt
streamlit>=1.31.0
openai>=1.0.0
langchain>=0.1.0
chromadb>=0.4.0

Streamlit Cloud (streamlit.io/cloud) deploys public GitHub repos for free (with resource limits). Steps:

Push your app to GitHub (make sure API keys are not in the code).
Go to share.streamlit.io → New app → Select your repo and file.
Add secrets in Settings → Secrets: OPENAI_API_KEY = "sk-..."
Access secrets in code: openai.api_key = st.secrets["OPENAI_API_KEY"]

For private apps or higher resources, use Streamlit in Docker on any cloud VM. The official Docker image: FROM python:3.11; RUN pip install streamlit; EXPOSE 8501; CMD streamlit run app.py.

SECTION 07

Gotchas

Re-runs and state: Streamlit re-runs the entire script on every interaction. Expensive operations (loading models, building indexes) must be wrapped in @st.cache_resource to avoid re-execution. Without caching, your LLM would reload on every message.

@st.cache_resource  # cached across all users and sessions
def load_rag_pipeline():
    return build_expensive_rag_pipeline()

pipeline = load_rag_pipeline()  # only runs once per Streamlit server start

Multiuser state: st.session_state is per-session, not global. Each browser tab gets its own session state. For shared state across users (admin controls, shared document store), use an external database or Redis, not session state.

Production limitations: Streamlit is excellent for demos and internal tools but not ideal for high-traffic production apps (no built-in auth beyond Streamlit Community Cloud's Google OAuth, limited horizontal scaling). For public-facing production apps with 1000+ daily users, consider FastAPI + React instead.

Streamlit LLM App Architecture Patterns

Streamlit's execution model re-runs the entire script from top to bottom on every user interaction, which requires careful state management for LLM applications. Session state persists data across reruns, enabling conversation history, model configuration, and intermediate results to survive the interaction cycle without expensive recomputation.

Pattern	Implementation	Best For	Pitfall
Chat history	st.session_state.messages list	Conversational apps	Unbounded growth → token limit
Cached model load	@st.cache_resource	Local model serving	Stale model on update
Streaming output	st.write_stream(generator)	Responsive chat UI	Generator must be consumed once
Sidebar config	st.sidebar sliders/selects	Parameter tuning UIs	Triggers full rerun on change

The @st.cache_resource decorator is essential for LLM applications that load models locally. Without caching, the model weights would be loaded from disk and transferred to GPU on every script rerun — which happens on every user interaction. Cache_resource stores the model object across sessions and reruns, ensuring the expensive initialization happens only once per application deployment. The clear function can be called programmatically to reload the model when weights are updated.

Streamlit's st.write_stream function accepts a generator and displays tokens as they arrive, implementing the streaming chat experience with a single function call. The generator can come from any LLM API that supports streaming — OpenAI, Anthropic, or a local model endpoint. Wrapping the streaming call in a try/except block with a finally clause that appends the completed response to the session state chat history ensures history is persisted even if the stream is interrupted by a network error or user navigation.

Streamlit's multipage app feature organizes complex LLM applications into separate pages, each with its own Python file in the pages/ directory. Navigation between pages is handled automatically by Streamlit's sidebar, and session state persists across page transitions within the same browser session. For LLM applications with distinct functional areas — a chat interface, a document analyzer, a settings panel, an evaluation dashboard — multipage structure keeps each component focused and independently maintainable without the complexity of routing logic.

Database connections in Streamlit LLM apps should use st.cache_resource to avoid creating a new connection pool on every script rerun. The same pattern applies to vector store clients, embedding model instances, and any other resource with significant initialization cost. Cache_resource stores the resource at the singleton level, shared across all user sessions and script reruns. Parameterized resources — different database connections for different user accounts — can use the hash_funcs argument to cache separate instances per unique parameter combination.

Deploying Streamlit applications with Docker enables consistent production environments that match the local development setup exactly. The Dockerfile installs Python dependencies, copies the application code, and configures Streamlit's server settings (headless mode, port, CORS policy) via environment variables. Streamlit Cloud provides managed hosting that deploys directly from GitHub repositories with automatic dependency installation from requirements.txt, supporting both public and private apps with GitHub-based access control.

Streamlit's fragment feature, introduced in recent versions, allows portions of the UI to rerun independently of the full script, reducing unnecessary recomputation. For LLM chat applications, wrapping the chat message display in a fragment ensures that displaying new messages does not retrigger expensive operations like sidebar metric calculations or chart renders that depend on session data unchanged by the new message. Fragments improve perceived performance by reducing the visual flicker and latency of partial UI updates in response to user interactions.