APPLICATIONS & SYSTEMS

Text-to-Data

Convert natural language questions into SQL, pandas code, and API calls. Enable non-technical users to query databases without SQL knowledge.

schema → LLM → query the pattern
SQL + NoSQL + KGs all query languages
safety & guardrails critical for production
Contents
  1. Why Text-to-SQL matters
  2. How it works
  3. Schema and context
  4. Few-shot learning
  5. Vanna AI approach
  6. Safety and guardrails
  7. Production challenges
01 — Motivation

Why Text-to-SQL Matters

SQL is powerful but requires technical expertise. Non-technical users struggle with JOIN syntax, aggregations, subqueries. Text-to-SQL bridges this: ask a natural language question, get back results. Democratizes data access.

Use Cases

Business analytics: Execs query sales data without consulting data engineers. Customer support: Agents pull customer records in real-time. Internal tools: Teams explore logs, metrics, historical data. Data exploration: Data scientists prototype queries faster.

Challenges

SQL is precise; LLMs are probabilistic. A small mistake (wrong table, incorrect JOIN) breaks the query. Safety is critical: you can't let an LLM blindly run DELETE statements. Schema understanding is hard: a thousand-column database requires explaining the LLM which columns matter.

💡 Key insight: Text-to-SQL success depends 80% on schema quality and context, 20% on LLM capability. A clear schema with good descriptions beats a fancy LLM on a bad schema.
02 — Architecture

How Text-to-SQL Works

The basic pipeline: user question → schema context → LLM generates SQL → validate → execute → explain result.

Step-by-Step

1. Receive question: "What were our top 5 products by revenue last quarter?" 2. Schema retrieval: Identify relevant tables. Use sparse search (keyword matching) or dense (embedding-based) or both. 3. Context assembly: Build prompt with schema description, example queries, business rules. 4. LLM generation: Prompt the LLM with context + question. 5. Validation: Parse SQL, check syntax, enforce guardrails (no DELETE, no cross-db access). 6. Execution: Run on prod DB with timeout, row limits. 7. Explanation: Summarize results in natural language.

Prompt assembly example: You are a SQL expert. Generate a single SQL query based on: Database: sales_db Tables: - orders (id, user_id, product_id, order_date, amount) - products (id, name, category, price) - users (id, name, email, country) Example queries: Q: "Top 3 customers by spend" A: SELECT u.name, SUM(o.amount) as spend FROM orders o JOIN users u ON o.user_id = u.id GROUP BY u.id ORDER BY spend DESC LIMIT 3 Question: What were our top 5 products by revenue last quarter? Your response (SQL only, no markdown):
Best practice: Always parse before execution. Never run untrusted SQL directly. Implement query timeout (5s) and row limits (10k) to prevent runaway queries.
03 — Schema Representation

Schema and Context

LLM performance on Text-to-SQL is dominated by schema clarity. A vague schema ("customer_id INT") loses to a clear one ("customer_id INT — Foreign key to users.id, required").

Schema Description Best Practices

1

Include column descriptions — Context helps

  • Bad: user_id INT
  • Good: user_id INT, foreign key to users.id
  • Better: user_id INT, FK→users(id), created_date TIMESTAMP
2

Note enum/categorical values — Constrain choices

  • status VARCHAR, enum: 'pending', 'approved', 'rejected'
  • Helps LLM use correct WHERE clauses
3

Highlight important columns — Priority hints

  • Mark frequently-queried columns
  • Mention denormalization or composite keys
4

Provide sample data — Concrete examples

  • Show 2–3 rows for each table
  • Helps LLM reason about data format

Schema Compression

Large schemas overwhelm LLMs. For a 100-table database, LLMs only see ~20 most relevant tables (selected by keyword matching). Techniques: semantic chunking (split by business domain), relevance ranking (use embedding similarity to rank tables), dynamic selection (parse question first, retrieve relevant tables).

⚠️ Schema context token cost: A detailed schema for 50 tables = ~2k tokens. Multiply that across queries. For high volume, cache schema context in prompt templates or use shorter schema digests.
04 — Learning Techniques

Few-Shot Learning for SQL

Providing example queries (few-shot) dramatically improves LLM SQL generation. Examples show style, common patterns, and business logic.

Few-Shot Example Strategy

Include 3–5 diverse examples: simple (SELECT from single table), moderate (JOIN 2–3 tables), complex (subqueries, aggregation). Show realistic patterns: how do we typically filter by date? Which tables JOIN together? Use actual question/answer pairs from prod: log successful queries and use them as examples. Group by domain: separate examples for finance queries vs. product queries.

Few-Shot qualityLLM SQL accuracyExample
No examples50–65%Baseline, schema only
Generic examples (simple CRUD)65–75%SELECT, JOIN, GROUP BY
Domain-specific examples75–85%Real questions from your biz
Dynamic example selection80–90%Retrieve similar past queries
💡 Pro tip: Maintain a corpus of high-quality queries from your analysts. Use embedding similarity to find similar past queries when new questions arrive. Include those in the few-shot context. This leverages domain knowledge automatically.
05 — Vanna AI

The Vanna Approach: RAG + Fine-Tuning

Vanna AI is a commercial Text-to-SQL platform that combines RAG (retrieve past successful queries) with optional fine-tuning. Instead of relying purely on prompting, it learns from your data.

How Vanna Works

Indexing: Vanna ingests your schema + successful SQL queries from prod. Retrieval: When a new question arrives, it retrieves similar past queries (by embedding or semantic search). Augmentation: Includes retrieved examples in prompt. Optional fine-tuning: With enough examples (100+), fine-tune a smaller model on your domain.

Vanna vs LangChain SQL Agent

AspectVannaLangChain SQL Agent
Learning mechanismRAG + optional FTPrompt engineering
Accuracy (typical)75–85%60–75%
Setup complexityMedium (index queries)Low (plug and play)
Fine-tuning supportYes (proprietary)No (DIY)
CostManaged serviceFree (OSS)
⚠️ Vanna lock-in: Vanna's fine-tuning is proprietary; your data is on their servers. For sensitive data, stick with open-source (LangChain + local examples).
# Text-to-SQL with Vanna: RAG + fine-tuning approach # pip install vanna[openai,chromadb] import vanna from vanna.openai import OpenAI_Chat from vanna.chromadb import ChromaDB_VectorStore class MyVanna(ChromaDB_VectorStore, OpenAI_Chat): def __init__(self, config=None): ChromaDB_VectorStore.__init__(self, config=config) OpenAI_Chat.__init__(self, config=config) vn = MyVanna(config={"api_key": "sk-...", "model": "gpt-4o-mini"}) vn.connect_to_postgres(host="localhost", dbname="sales", user="analyst", password="...") # Train: feed it your schema, DDL, and example Q&A pairs vn.train(ddl=""" CREATE TABLE orders (order_id INT, customer_id INT, product_sku VARCHAR, quantity INT, unit_price DECIMAL, order_date DATE, status VARCHAR); """) vn.train(question="Top 10 products by revenue last month", sql="SELECT product_sku, SUM(quantity*unit_price) as revenue " "FROM orders WHERE order_date >= CURRENT_DATE - INTERVAL '30 days' " "GROUP BY product_sku ORDER BY revenue DESC LIMIT 10;") # Query in natural language sql = vn.generate_sql("Show me customers who ordered more than 5 times this year") print(sql) df = vn.run_sql(sql) print(df.head())
06 — Safety

Safety and Guardrails

LLMs can generate dangerous SQL (DELETE, DROP) or inefficient queries (Cartesian products). Guardrails are non-negotiable in production.

Safety Checks

Blocklist dangerous operations: Reject queries with DELETE, DROP, ALTER, INSERT. Whitelist tables: Only allow queries against safe tables. Row limits: Cap results at 10k rows; prevent full-table scans. Timeout: Kill queries exceeding 5–10s. Cost limits: For cloud DBs, cap query cost. Validation: Parse SQL, check syntax before execution. Audit: Log all generated queries, results, user, timestamp.

GuardrailImplementationFalse positive risk
Blocklist (DELETE, DROP)Regex or AST parseNone
Whitelist tablesCheck table names in queryLow (schema dependent)
Row limitAppend LIMIT clauseNone
TimeoutExecute with timeout_ms=5000None (slow queries rejected)
Cost cap (cloud)Cloud provider APINone
⚠️ Guardrail layering: Use multiple defenses (blocklist + timeout + row limit). No single check is bulletproof. An attacker might evade regex but not AST parsing.
# Safety: validate and sanitise generated SQL before execution import re, sqlparse from sqlparse.sql import Statement from sqlparse.tokens import Keyword, DML ALLOWED_STATEMENTS = {"SELECT"} BLOCKED_PATTERNS = [r'DROP', r'DELETE', r'TRUNCATE', r'INSERT', r'UPDATE', r'EXEC', r';\s*SELECT', # stacked queries r'--', # SQL comments (injection vector) r'/\*'] # block comments def validate_sql(sql: str) -> tuple[bool, str]: """Returns (is_safe, reason). Only allows single SELECT statements.""" # Strip and normalise sql = sql.strip().rstrip(';') # Check for blocked patterns (case-insensitive) for pat in BLOCKED_PATTERNS: if re.search(pat, sql, re.IGNORECASE): return False, f"Blocked pattern detected: {pat}" # Parse and check statement type parsed = sqlparse.parse(sql) if len(parsed) != 1: return False, "Only single statements allowed" stmt: Statement = parsed[0] stmt_type = stmt.get_type() if stmt_type not in ALLOWED_STATEMENTS: return False, f"Only SELECT allowed, got {stmt_type}" return True, "OK" generated_sql = "SELECT customer_id, SUM(total) FROM orders GROUP BY 1 ORDER BY 2 DESC LIMIT 20" safe, reason = validate_sql(generated_sql) print(f"Safe: {safe} — {reason}")
07 — Production

Production Challenges

Moving Text-to-SQL from prototype to prod surfaces real issues: schema changes, query errors, monitoring, user expectations.

Key Challenges

Schema versioning: Your schema evolves. How do you handle renamed tables/columns? Query errors: LLM generates invalid SQL. Retry with error feedback? Incorrect results: Query succeeds but answer is wrong. How do you detect this? Latency: Text-to-SQL adds ~500ms–1s over direct SQL. User trust: Users trust SQL but distrust LLM-generated SQL. Requires good UX (show query, allow editing).

Best Practices

Show the query: Always display generated SQL to users. Let them edit or reject. Validate results: Check that row counts are reasonable. Alert on suspiciously high/low results. Cache results: Same question → same SQL. Cache results for re-runs. Error recovery: If SQL fails, retry with error message in prompt. Monitoring: Track success rate, query execution time, user satisfaction. Feedback loop: Log user corrections; use them to improve examples and schema clarity.

MVP approach: Start with read-only queries (SELECT only). Add complexity (JOINs, aggregates) incrementally. Move to write queries (INSERT, UPDATE) only once you're confident. Measure success by user adoption and query success rate.
08 — Further Reading

References and Related Concepts

Child Concepts
Related Concepts
Resources