Convert natural language questions into SQL, pandas code, and API calls. Enable non-technical users to query databases without SQL knowledge.
SQL is powerful but requires technical expertise. Non-technical users struggle with JOIN syntax, aggregations, subqueries. Text-to-SQL bridges this: ask a natural language question, get back results. Democratizes data access.
Business analytics: Execs query sales data without consulting data engineers. Customer support: Agents pull customer records in real-time. Internal tools: Teams explore logs, metrics, historical data. Data exploration: Data scientists prototype queries faster.
SQL is precise; LLMs are probabilistic. A small mistake (wrong table, incorrect JOIN) breaks the query. Safety is critical: you can't let an LLM blindly run DELETE statements. Schema understanding is hard: a thousand-column database requires explaining the LLM which columns matter.
The basic pipeline: user question → schema context → LLM generates SQL → validate → execute → explain result.
1. Receive question: "What were our top 5 products by revenue last quarter?" 2. Schema retrieval: Identify relevant tables. Use sparse search (keyword matching) or dense (embedding-based) or both. 3. Context assembly: Build prompt with schema description, example queries, business rules. 4. LLM generation: Prompt the LLM with context + question. 5. Validation: Parse SQL, check syntax, enforce guardrails (no DELETE, no cross-db access). 6. Execution: Run on prod DB with timeout, row limits. 7. Explanation: Summarize results in natural language.
LLM performance on Text-to-SQL is dominated by schema clarity. A vague schema ("customer_id INT") loses to a clear one ("customer_id INT — Foreign key to users.id, required").
Large schemas overwhelm LLMs. For a 100-table database, LLMs only see ~20 most relevant tables (selected by keyword matching). Techniques: semantic chunking (split by business domain), relevance ranking (use embedding similarity to rank tables), dynamic selection (parse question first, retrieve relevant tables).
Providing example queries (few-shot) dramatically improves LLM SQL generation. Examples show style, common patterns, and business logic.
Include 3–5 diverse examples: simple (SELECT from single table), moderate (JOIN 2–3 tables), complex (subqueries, aggregation). Show realistic patterns: how do we typically filter by date? Which tables JOIN together? Use actual question/answer pairs from prod: log successful queries and use them as examples. Group by domain: separate examples for finance queries vs. product queries.
| Few-Shot quality | LLM SQL accuracy | Example |
|---|---|---|
| No examples | 50–65% | Baseline, schema only |
| Generic examples (simple CRUD) | 65–75% | SELECT, JOIN, GROUP BY |
| Domain-specific examples | 75–85% | Real questions from your biz |
| Dynamic example selection | 80–90% | Retrieve similar past queries |
Vanna AI is a commercial Text-to-SQL platform that combines RAG (retrieve past successful queries) with optional fine-tuning. Instead of relying purely on prompting, it learns from your data.
Indexing: Vanna ingests your schema + successful SQL queries from prod. Retrieval: When a new question arrives, it retrieves similar past queries (by embedding or semantic search). Augmentation: Includes retrieved examples in prompt. Optional fine-tuning: With enough examples (100+), fine-tune a smaller model on your domain.
| Aspect | Vanna | LangChain SQL Agent |
|---|---|---|
| Learning mechanism | RAG + optional FT | Prompt engineering |
| Accuracy (typical) | 75–85% | 60–75% |
| Setup complexity | Medium (index queries) | Low (plug and play) |
| Fine-tuning support | Yes (proprietary) | No (DIY) |
| Cost | Managed service | Free (OSS) |
LLMs can generate dangerous SQL (DELETE, DROP) or inefficient queries (Cartesian products). Guardrails are non-negotiable in production.
Blocklist dangerous operations: Reject queries with DELETE, DROP, ALTER, INSERT. Whitelist tables: Only allow queries against safe tables. Row limits: Cap results at 10k rows; prevent full-table scans. Timeout: Kill queries exceeding 5–10s. Cost limits: For cloud DBs, cap query cost. Validation: Parse SQL, check syntax before execution. Audit: Log all generated queries, results, user, timestamp.
| Guardrail | Implementation | False positive risk |
|---|---|---|
| Blocklist (DELETE, DROP) | Regex or AST parse | None |
| Whitelist tables | Check table names in query | Low (schema dependent) |
| Row limit | Append LIMIT clause | None |
| Timeout | Execute with timeout_ms=5000 | None (slow queries rejected) |
| Cost cap (cloud) | Cloud provider API | None |
Moving Text-to-SQL from prototype to prod surfaces real issues: schema changes, query errors, monitoring, user expectations.
Schema versioning: Your schema evolves. How do you handle renamed tables/columns? Query errors: LLM generates invalid SQL. Retry with error feedback? Incorrect results: Query succeeds but answer is wrong. How do you detect this? Latency: Text-to-SQL adds ~500ms–1s over direct SQL. User trust: Users trust SQL but distrust LLM-generated SQL. Requires good UX (show query, allow editing).
Show the query: Always display generated SQL to users. Let them edit or reject. Validate results: Check that row counts are reasonable. Alert on suspiciously high/low results. Cache results: Same question → same SQL. Cache results for re-runs. Error recovery: If SQL fails, retry with error message in prompt. Monitoring: Track success rate, query execution time, user satisfaction. Feedback loop: Log user corrections; use them to improve examples and schema clarity.