Dev Tools

Gradio

Gradio builds ML demos with a component-based API. gr.ChatInterface creates a full chatbot UI in one function call. Deep HuggingFace integration, Spaces hosting, and a JavaScript/React component library make it the standard for sharing ML demos.

gr.ChatInterface
One-line chatbot
HF Spaces
Free hosting
Blocks API
Custom layouts

Table of Contents

SECTION 01

Gradio vs Streamlit

Both Gradio and Streamlit build interactive ML demos from Python, but they have different mental models:

Gradio: Component-based and function-oriented. You define a Python function (input → output) and Gradio wraps it in a UI. The gr.ChatInterface is particularly strong — one function call creates a complete chatbot with history, streaming, and retry buttons. Deep HuggingFace integration means deploying to Spaces is trivial. Better for: model demos, HuggingFace-centric workflows, sharing via Spaces links.

Streamlit: Script-based and imperative. You write a script that builds the UI procedurally. More flexible for complex layouts, dashboards, and multi-page apps. Better for: internal tools, dashboards, apps with complex state management.

For LLM chatbots specifically: Gradio's ChatInterface requires less boilerplate and includes more features out of the box (retry, undo, like/dislike buttons). Streamlit gives more layout control.

SECTION 02

gr.ChatInterface in 10 lines

import gradio as gr
from openai import OpenAI

client = OpenAI()

def chat(message: str, history: list) -> str:
    messages = [{"role": "system", "content": "You are a helpful assistant."}]
    for user_msg, assistant_msg in history:
        messages.append({"role": "user", "content": user_msg})
        messages.append({"role": "assistant", "content": assistant_msg})
    messages.append({"role": "user", "content": message})

    response = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
    return response.choices[0].message.content

demo = gr.ChatInterface(
    fn=chat,
    title="My LLM Chatbot",
    description="Powered by GPT-4o-mini",
    examples=["Tell me a joke", "Explain attention in transformers"],
    retry_btn=True,
    undo_btn=True,
)
demo.launch()

Gradio auto-generates: chat bubbles, conversation history display, submit/clear/retry/undo buttons, and example prompts. The history parameter is automatically managed by Gradio — your function receives the current conversation history on each call.

SECTION 03

Gradio Blocks for custom layouts

import gradio as gr
from openai import OpenAI

client = OpenAI()

with gr.Blocks(theme=gr.themes.Soft()) as demo:
    gr.Markdown("# RAG-Powered Q&A")
    with gr.Row():
        with gr.Column(scale=3):
            chatbot = gr.Chatbot(height=500)
            msg = gr.Textbox(placeholder="Ask about your documents...", show_label=False)
            with gr.Row():
                submit_btn = gr.Button("Send", variant="primary")
                clear_btn = gr.Button("Clear")
        with gr.Column(scale=1):
            gr.Markdown("### Settings")
            model = gr.Dropdown(["gpt-4o", "gpt-4o-mini"], value="gpt-4o-mini", label="Model")
            top_k = gr.Slider(1, 10, value=5, label="Retrieved chunks")
            show_sources = gr.Checkbox(value=True, label="Show sources")

    def respond(message, history, model_choice, k, show_src):
        # Your RAG pipeline here
        response, sources = rag_pipeline(message, model=model_choice, top_k=k)
        if show_src:
            response += f"

**Sources**: {', '.join(sources)}"
        history.append((message, response))
        return "", history

    submit_btn.click(respond, [msg, chatbot, model, top_k, show_sources], [msg, chatbot])
    clear_btn.click(lambda: [], outputs=chatbot)

demo.launch(share=True)  # share=True creates a public tunnel URL
SECTION 04

Streaming and async

import gradio as gr
from openai import OpenAI

client = OpenAI()

def chat_stream(message: str, history: list):
    messages = [{"role": "system", "content": "You are helpful."}]
    for user, assistant in history:
        messages += [{"role": "user", "content": user},
                     {"role": "assistant", "content": assistant}]
    messages.append({"role": "user", "content": message})

    # Yield partial responses for streaming
    partial = ""
    stream = client.chat.completions.create(
        model="gpt-4o-mini", messages=messages, stream=True
    )
    for chunk in stream:
        if chunk.choices[0].delta.content:
            partial += chunk.choices[0].delta.content
            yield partial  # Gradio renders each yielded string as the current response

demo = gr.ChatInterface(
    fn=chat_stream,
    title="Streaming Chat",
)
demo.launch()

For generator functions (using yield), Gradio automatically detects streaming and updates the UI token-by-token. No special configuration needed.

SECTION 05

HuggingFace Spaces integration

# Deploy to HuggingFace Spaces in minutes
# 1. Create a Space at huggingface.co/new-space (select Gradio SDK)
# 2. Clone the Space repo
git clone https://huggingface.co/spaces/your-username/your-space
cd your-space

# 3. Add your app files
cp app.py requirements.txt .

# 4. Add secrets via HF Space settings (Settings → Repository secrets)
# OPENAI_API_KEY=sk-...

# 5. Push to deploy
git add . && git commit -m "Initial deploy" && git push

# Your app is live at: huggingface.co/spaces/your-username/your-space

HuggingFace Spaces is free for public demos (with compute limits). CPU spaces run Gradio apps fine for non-GPU tasks. For GPU-accelerated demos, GPU spaces are available (paid). Each Space gets a public URL that you can embed in your blog, portfolio, or share directly.

SECTION 06

Custom themes and styling

import gradio as gr

# Built-in themes
demo = gr.Blocks(theme=gr.themes.Base())    # minimal
demo = gr.Blocks(theme=gr.themes.Soft())    # rounded, friendly
demo = gr.Blocks(theme=gr.themes.Monochrome())  # professional

# Custom theme
custom_theme = gr.themes.Default(
    primary_hue=gr.themes.colors.blue,
    secondary_hue=gr.themes.colors.sky,
    font=gr.themes.GoogleFont("Inter"),
)

# Custom CSS
with gr.Blocks(css=".gradio-container {max-width: 900px; margin: auto}") as demo:
    gr.Markdown("## My App")

# Programmatic theme builder
theme = gr.themes.Soft().set(
    body_background_fill="#0f1117",     # dark background
    block_background_fill="#1a1c23",
    button_primary_background_fill="#7c3aed",
)
SECTION 07

Gotchas

History format: Gradio's ChatInterface passes history as a list of (user_message, assistant_message) tuples. When converting to OpenAI message format, flatten this carefully — it's easy to accidentally skip the assistant's last message or duplicate messages.

State between requests: Like Streamlit, Gradio is stateless by default. Use gr.State for per-session state: state = gr.State([]). Global Python variables ARE shared across sessions — don't use them for per-user state.

Queue for concurrent users: Enable the queue for multi-user deployments: demo.queue(max_size=10).launch(). Without the queue, concurrent requests can interfere with streaming output.

Large file uploads: Gradio's file upload component has a default 100MB limit. Increase via: gr.File(max_size=500). For very large files, stream them directly rather than uploading through Gradio's UI.

Gradio Interface Types Compared

Gradio provides multiple interface paradigms for wrapping machine learning models and LLM pipelines as interactive web demos. Each interface type is optimized for a different input/output modality and deployment scenario, from quick function prototyping to complex multi-turn conversational applications.

Interface TypeInput/OutputBest ForKey Feature
gr.InterfaceAny → AnySimple functionsAuto-generates UI from type hints
gr.ChatInterfaceText → TextChatbotsBuilt-in conversation history
gr.BlocksFlexibleComplex multi-component UIsFull layout control
gr.TabbedInterfaceAny → AnyMulti-model comparisonsTab-based navigation

Gradio Blocks is the most powerful and flexible interface type, allowing precise control over layout, component placement, and event handling. Unlike the simpler Interface class that automatically lays out inputs and outputs, Blocks uses a context manager pattern where components are explicitly placed in rows and columns. Event handlers connect components to Python functions with flexible input/output routing — one button click can trigger a sequence of function calls, feeding output from one step as input to the next.

The streaming support in Gradio ChatInterface uses Python generators to yield partial responses as they arrive from the LLM API, providing a typing-indicator experience rather than waiting for the full response before displaying anything. This is implemented by yielding strings inside the chat function; Gradio handles the incremental UI updates via WebSocket without requiring any frontend code changes. Streaming significantly improves perceived responsiveness for long-form generation tasks.

Gradio's authentication system supports both simple username/password auth and OAuth-based login for Spaces deployments. For internal tools, the auth parameter accepts a list of (username, password) tuples, protecting the app behind a login screen with a single configuration line. For production applications deployed on HuggingFace Spaces, OAuth integration allows users to log in with their HuggingFace account, and the application can access the authenticated user's identity to implement per-user rate limiting, personalization, or access control.

Custom CSS and JavaScript can be injected into Gradio applications to extend the default styling and add client-side behavior. The css parameter on Blocks accepts a CSS string that overrides or extends the default Gradio styles, enabling brand-consistent styling, custom font loading, and layout adjustments. The js parameter accepts a JavaScript string that executes on page load, enabling custom analytics, keyboard shortcuts, or integration with browser APIs that are not natively supported by Gradio components.

Gradio's queue system is critical for production LLM deployments where inference is slow and multiple users may be interacting simultaneously. Without the queue, concurrent requests can overwhelm the Python process and cause timeouts. With queue() enabled, requests are serialized and processed one at a time (or in configurable concurrency batches), with users receiving real-time status updates showing their position in the queue. For streaming responses, the queue allows multiple users to receive their streamed tokens concurrently without blocking each other.

Gradio's flagging feature allows users to mark responses that seem incorrect or problematic during a demo session. Flagged examples are saved to a local CSV file (or a configurable backend) with the input, output, and optional user comment. For LLM evaluation workflows, this creates a lightweight human feedback collection mechanism that accumulates real-world failure cases during beta testing, which can then be used as negative examples in fine-tuning or as seed inputs for adversarial evaluation datasets.