Text-to-Data

Contents

Why Text-to-SQL matters
How it works
Schema and context
Few-shot learning
Vanna AI approach
Safety and guardrails
Production challenges

01 — Motivation

Why Text-to-SQL Matters

SQL is powerful but requires technical expertise. Non-technical users struggle with JOIN syntax, aggregations, subqueries. Text-to-SQL bridges this: ask a natural language question, get back results. Democratizes data access.

Use Cases

Business analytics: Execs query sales data without consulting data engineers. Customer support: Agents pull customer records in real-time. Internal tools: Teams explore logs, metrics, historical data. Data exploration: Data scientists prototype queries faster.

Challenges

SQL is precise; LLMs are probabilistic. A small mistake (wrong table, incorrect JOIN) breaks the query. Safety is critical: you can't let an LLM blindly run DELETE statements. Schema understanding is hard: a thousand-column database requires explaining the LLM which columns matter.

💡 Key insight: Text-to-SQL success depends 80% on schema quality and context, 20% on LLM capability. A clear schema with good descriptions beats a fancy LLM on a bad schema.

02 — Architecture

How Text-to-SQL Works

The basic pipeline: user question → schema context → LLM generates SQL → validate → execute → explain result.

Step-by-Step

1. Receive question: "What were our top 5 products by revenue last quarter?" 2. Schema retrieval: Identify relevant tables. Use sparse search (keyword matching) or dense (embedding-based) or both. 3. Context assembly: Build prompt with schema description, example queries, business rules. 4. LLM generation: Prompt the LLM with context + question. 5. Validation: Parse SQL, check syntax, enforce guardrails (no DELETE, no cross-db access). 6. Execution: Run on prod DB with timeout, row limits. 7. Explanation: Summarize results in natural language.

Prompt assembly example: You are a SQL expert. Generate a single SQL query based on: Database: sales_db Tables: - orders (id, user_id, product_id, order_date, amount) - products (id, name, category, price) - users (id, name, email, country) Example queries: Q: "Top 3 customers by spend" A: SELECT u.name, SUM(o.amount) as spend FROM orders o JOIN users u ON o.user_id = u.id GROUP BY u.id ORDER BY spend DESC LIMIT 3 Question: What were our top 5 products by revenue last quarter? Your response (SQL only, no markdown):

✓ Best practice: Always parse before execution. Never run untrusted SQL directly. Implement query timeout (5s) and row limits (10k) to prevent runaway queries.

03 — Schema Representation

Schema and Context

LLM performance on Text-to-SQL is dominated by schema clarity. A vague schema ("customer_id INT") loses to a clear one ("customer_id INT — Foreign key to users.id, required").

Schema Description Best Practices

Include column descriptions — Context helps

Bad: user_id INT
Good: user_id INT, foreign key to users.id
Better: user_id INT, FK→users(id), created_date TIMESTAMP

Note enum/categorical values — Constrain choices

status VARCHAR, enum: 'pending', 'approved', 'rejected'
Helps LLM use correct WHERE clauses

Highlight important columns — Priority hints

Mark frequently-queried columns
Mention denormalization or composite keys

Provide sample data — Concrete examples

Show 2–3 rows for each table
Helps LLM reason about data format

Schema Compression

Large schemas overwhelm LLMs. For a 100-table database, LLMs only see ~20 most relevant tables (selected by keyword matching). Techniques: semantic chunking (split by business domain), relevance ranking (use embedding similarity to rank tables), dynamic selection (parse question first, retrieve relevant tables).

⚠️ Schema context token cost: A detailed schema for 50 tables = ~2k tokens. Multiply that across queries. For high volume, cache schema context in prompt templates or use shorter schema digests.

04 — Learning Techniques

Few-Shot Learning for SQL

Providing example queries (few-shot) dramatically improves LLM SQL generation. Examples show style, common patterns, and business logic.

Few-Shot Example Strategy

Include 3–5 diverse examples: simple (SELECT from single table), moderate (JOIN 2–3 tables), complex (subqueries, aggregation). Show realistic patterns: how do we typically filter by date? Which tables JOIN together? Use actual question/answer pairs from prod: log successful queries and use them as examples. Group by domain: separate examples for finance queries vs. product queries.

Few-Shot quality	LLM SQL accuracy	Example
No examples	50–65%	Baseline, schema only
Generic examples (simple CRUD)	65–75%	SELECT, JOIN, GROUP BY
Domain-specific examples	75–85%	Real questions from your biz
Dynamic example selection	80–90%	Retrieve similar past queries

💡 Pro tip: Maintain a corpus of high-quality queries from your analysts. Use embedding similarity to find similar past queries when new questions arrive. Include those in the few-shot context. This leverages domain knowledge automatically.

05 — Vanna AI

The Vanna Approach: RAG + Fine-Tuning

Vanna AI is a commercial Text-to-SQL platform that combines RAG (retrieve past successful queries) with optional fine-tuning. Instead of relying purely on prompting, it learns from your data.

How Vanna Works

Indexing: Vanna ingests your schema + successful SQL queries from prod. Retrieval: When a new question arrives, it retrieves similar past queries (by embedding or semantic search). Augmentation: Includes retrieved examples in prompt. Optional fine-tuning: With enough examples (100+), fine-tune a smaller model on your domain.

Vanna vs LangChain SQL Agent

Aspect	Vanna	LangChain SQL Agent
Learning mechanism	RAG + optional FT	Prompt engineering
Accuracy (typical)	75–85%	60–75%
Setup complexity	Medium (index queries)	Low (plug and play)
Fine-tuning support	Yes (proprietary)	No (DIY)
Cost	Managed service	Free (OSS)

⚠️ Vanna lock-in: Vanna's fine-tuning is proprietary; your data is on their servers. For sensitive data, stick with open-source (LangChain + local examples).

# Text-to-SQL with Vanna: RAG + fine-tuning approach # pip install vanna[openai,chromadb] import vanna from vanna.openai import OpenAI_Chat from vanna.chromadb import ChromaDB_VectorStore class MyVanna(ChromaDB_VectorStore, OpenAI_Chat): def __init__(self, config=None): ChromaDB_VectorStore.__init__(self, config=config) OpenAI_Chat.__init__(self, config=config) vn = MyVanna(config={"api_key": "sk-...", "model": "gpt-4o-mini"}) vn.connect_to_postgres(host="localhost", dbname="sales", user="analyst", password="...") # Train: feed it your schema, DDL, and example Q&A pairs vn.train(ddl=""" CREATE TABLE orders (order_id INT, customer_id INT, product_sku VARCHAR, quantity INT, unit_price DECIMAL, order_date DATE, status VARCHAR); """) vn.train(question="Top 10 products by revenue last month", sql="SELECT product_sku, SUM(quantity*unit_price) as revenue " "FROM orders WHERE order_date >= CURRENT_DATE - INTERVAL '30 days' " "GROUP BY product_sku ORDER BY revenue DESC LIMIT 10;") # Query in natural language sql = vn.generate_sql("Show me customers who ordered more than 5 times this year") print(sql) df = vn.run_sql(sql) print(df.head())

06 — Safety

Safety and Guardrails

LLMs can generate dangerous SQL (DELETE, DROP) or inefficient queries (Cartesian products). Guardrails are non-negotiable in production.

Safety Checks

Blocklist dangerous operations: Reject queries with DELETE, DROP, ALTER, INSERT. Whitelist tables: Only allow queries against safe tables. Row limits: Cap results at 10k rows; prevent full-table scans. Timeout: Kill queries exceeding 5–10s. Cost limits: For cloud DBs, cap query cost. Validation: Parse SQL, check syntax before execution. Audit: Log all generated queries, results, user, timestamp.

Guardrail	Implementation	False positive risk
Blocklist (DELETE, DROP)	Regex or AST parse	None
Whitelist tables	Check table names in query	Low (schema dependent)
Row limit	Append LIMIT clause	None
Timeout	Execute with timeout_ms=5000	None (slow queries rejected)
Cost cap (cloud)	Cloud provider API	None

⚠️ Guardrail layering: Use multiple defenses (blocklist + timeout + row limit). No single check is bulletproof. An attacker might evade regex but not AST parsing.

# Safety: validate and sanitise generated SQL before execution import re, sqlparse from sqlparse.sql import Statement from sqlparse.tokens import Keyword, DML ALLOWED_STATEMENTS = {"SELECT"} BLOCKED_PATTERNS = [r'DROP', r'DELETE', r'TRUNCATE', r'INSERT', r'UPDATE', r'EXEC', r';\s*SELECT', # stacked queries r'--', # SQL comments (injection vector) r'/\*'] # block comments def validate_sql(sql: str) -> tuple[bool, str]: """Returns (is_safe, reason). Only allows single SELECT statements.""" # Strip and normalise sql = sql.strip().rstrip(';') # Check for blocked patterns (case-insensitive) for pat in BLOCKED_PATTERNS: if re.search(pat, sql, re.IGNORECASE): return False, f"Blocked pattern detected: {pat}" # Parse and check statement type parsed = sqlparse.parse(sql) if len(parsed) != 1: return False, "Only single statements allowed" stmt: Statement = parsed[0] stmt_type = stmt.get_type() if stmt_type not in ALLOWED_STATEMENTS: return False, f"Only SELECT allowed, got {stmt_type}" return True, "OK" generated_sql = "SELECT customer_id, SUM(total) FROM orders GROUP BY 1 ORDER BY 2 DESC LIMIT 20" safe, reason = validate_sql(generated_sql) print(f"Safe: {safe} — {reason}")

07 — Production

Production Challenges

Moving Text-to-SQL from prototype to prod surfaces real issues: schema changes, query errors, monitoring, user expectations.

Key Challenges

Schema versioning: Your schema evolves. How do you handle renamed tables/columns? Query errors: LLM generates invalid SQL. Retry with error feedback? Incorrect results: Query succeeds but answer is wrong. How do you detect this? Latency: Text-to-SQL adds ~500ms–1s over direct SQL. User trust: Users trust SQL but distrust LLM-generated SQL. Requires good UX (show query, allow editing).

Best Practices

Show the query: Always display generated SQL to users. Let them edit or reject. Validate results: Check that row counts are reasonable. Alert on suspiciously high/low results. Cache results: Same question → same SQL. Cache results for re-runs. Error recovery: If SQL fails, retry with error message in prompt. Monitoring: Track success rate, query execution time, user satisfaction. Feedback loop: Log user corrections; use them to improve examples and schema clarity.

✓ MVP approach: Start with read-only queries (SELECT only). Add complexity (JOINs, aggregates) incrementally. Move to write queries (INSERT, UPDATE) only once you're confident. Measure success by user adoption and query success rate.

08 — Further Reading

References and Related Concepts

Child Concepts

Text-to-SQL — Core Text-to-SQL techniques and libraries
Vanna AI — Commercial Text-to-SQL platform

Related Concepts

Function Calling — Tool use for structured execution
Few-Shot Prompting — Learning from examples
Output Control — Constraining generation format

Resources

Docs LangChain SQL Agent — docs ↗
Blog Vanna AI — vanna.ai ↗

Text-to-Data

Why Text-to-SQL Matters

Use Cases

Challenges

How Text-to-SQL Works

Step-by-Step

Schema and Context

Schema Description Best Practices

Include column descriptions — Context helps

Note enum/categorical values — Constrain choices

Highlight important columns — Priority hints

Provide sample data — Concrete examples

Schema Compression

Few-Shot Learning for SQL

Few-Shot Example Strategy

The Vanna Approach: RAG + Fine-Tuning

How Vanna Works

Vanna vs LangChain SQL Agent

Safety and Guardrails

Safety Checks

Production Challenges

Key Challenges

Best Practices

References and Related Concepts

Related concepts