A query language for LLMs that lets you constrain model outputs using Python control flow and logical constraints.
Ask an LLM "respond with only YES or NO" and sometimes it says "Certainly! The answer is YES." The model knows the constraint exists but still violates it because constraints live in the prompt text, not in the decoding process.
LMQL moves constraints into the decoder. Instead of hoping the model reads your instructions, you inject hard stops, type checks, and value limits at the token-generation level β the model cannot produce output that violates them.
LMQL (Language Model Query Language) is a Python superset where LLM calls are first-class expressions and the runtime enforces structural constraints during beam search, making invalid outputs literally unrepresentable.
import lmql
@lmql.query
async def classify_sentiment(review: str):
'''lmql
"Review: {review}\n"
"Sentiment: [LABEL]" where LABEL in ["positive", "negative", "neutral"]
return LABEL
'''
LABEL can only be one of those three strings β not "mostly positive" or "I'd say negative".
import lmql
@lmql.query
async def structured_review(product: str):
'''lmql
"You are a product reviewer.\n"
"Product: {product}\n\n"
"Rating (1-5): [RATING]" where INT(RATING) and 1 <= int(RATING) <= 5
"\nPros (one sentence): [PROS]" where len(TOKENS(PROS)) < 30
"\nCons (one sentence): [CONS]" where len(TOKENS(CONS)) < 30
return {"rating": int(RATING), "pros": PROS, "cons": CONS}
'''
# Run it
result = lmql.run(structured_review, product="MX Master 3 mouse")
print(result)
# {"rating": 5, "pros": "Ergonomic shape and customisable buttons suit long work sessions.", "cons": "High price point may deter casual users."}
Key primitives:
[VAR] β placeholder the model fills in.where β constraint expression evaluated after each token.INT(VAR), len(TOKENS(VAR)) β built-in predicates.VAR in [...] β set membership (hard classification).LMQL constraints are evaluated eagerly β at each decoding step the runtime prunes token candidates that would make the constraint unsatisfiable. This is faster and more reliable than post-processing.
@lmql.query
async def extract_date(text: str):
'''lmql
"Extract the date from: {text}\n"
"Year: [YEAR]" where INT(YEAR) and 1900 <= int(YEAR) <= 2100
"Month: [MONTH]" where INT(MONTH) and 1 <= int(MONTH) <= 12
"Day: [DAY]" where INT(DAY) and 1 <= int(DAY) <= 31
return {"year": int(YEAR), "month": int(MONTH), "day": int(DAY)}
'''
Useful constraint types:
| Constraint | Meaning |
|---|---|
STOPS_AT(VAR, "\n") | Stop generation at the first newline |
len(TOKENS(VAR)) < N | Hard token budget |
VAR in ["a","b","c"] | Enum enforcement |
REGEX(VAR, r"\d{4}-\d{2}-\d{2}") | Regex match (partial, checked eagerly) |
INT(VAR) | Must decode as a valid integer |
LMQL exposes decoding as a first-class concept:
@lmql.query(decoder="argmax") # deterministic greedy β good for classification
@lmql.query(decoder="sample", temperature=0.7) # stochastic β good for generation
@lmql.query(decoder="beam", n=5) # beam search β good for structured extraction
async def my_query(): ...
Beam search + constraints is LMQL's killer combo: beam search explores multiple parses of the output simultaneously while constraints prune invalid paths, giving you the best valid output without exhaustive retries.
| Tool | Constraint level | Ease of use | Backend support |
|---|---|---|---|
| LMQL | Token-level (hardest) | Medium (new syntax to learn) | OpenAI, HuggingFace, local |
| Instructor | Schema-level (retry loop) | High (pure Python) | Any OpenAI-compatible |
| Outlines | Token-level (regex/CFG) | Medium | HuggingFace, vLLM |
| Guidance | Token-level + interleaving | Medium | Local models best |
| Plain JSON mode | JSON syntax only | Very high | OpenAI, Anthropic |
Use LMQL when you need hard numeric or regex constraints that simple JSON mode can't express. For most schema-validation use cases, Instructor is simpler.
Local models only for token-level constraints: True token-by-token constraint enforcement requires access to the model's logits. For hosted APIs (OpenAI, Anthropic) LMQL falls back to prompt-level constraints + retries β you lose the hard guarantee.
Constraint conflicts can cause infinite loops: If your constraints are contradictory (e.g., require an integer between 1β5 but the context makes any such number improbable) the decoder may loop. Add a max_len guard.
Debugging is harder: Unlike plain string prompts, LMQL queries require understanding the decoding trace. Use lmql.run(..., verbose=True) to inspect the token-level decisions.
Async everywhere: All LMQL queries are async by default. In synchronous contexts use asyncio.run(my_query(...)).
LMQL queries can be deployed as services, but several factors affect performance. Constraint checking happens per-token, so tight constraints on long sequences add latency. For production systems, profile your constraints and consider caching frequent queries or using adaptive constraint relaxation.
Model considerations: Larger models often satisfy constraints more naturally, reducing backtrack cost. Smaller models (7Bβ13B) may need looser constraints or explicit in-context examples to succeed. Temperature matters too: deterministic modes (temperature=0) hit constraints faster, while sampling explores more freely.
| Deployment Pattern | Pros | Cons |
|---|---|---|
| Batch processing (local) | Full constraint power, no API latency | GPU memory, slow for high volume |
| Hosted API fallback | Easy scaling, managed | Constraints become prompt-only |
| vLLM + Outlines | Fast, integrates token constraints | Different API, migration cost |
| Streaming + validation | Responsive UX, post-hoc fix | May need retries, latency tail |
LMQL ecosystem: The project is maintained at Microsoft Research and has active open-source contributions. Community extensions include integrations with additional backends (Hugging Face, local models), custom constraint types, and monitoring tools for production deployments. The design philosophy is to make constraint-driven generation accessible to Python developers without requiring deep knowledge of decoding algorithms.
For teams evaluating constraint-driven generation, LMQL is strongest when you need predictable, schema-compliant outputs from hosted or local models. Its learning curve is moderateβa developer familiar with Python and prompting can be productive in hours. The constraint language feels natural once you understand the distinction between prompt-level (soft) and decoder-level (hard) constraints. For new projects, consider LMQL alongside Instructor and Outlines to pick the best fit.
LMQL real-world use cases: Teams use LMQL for structured extraction (contracts, PDFs, forms), multi-turn dialogue systems with hard validation (customer support bots that must collect required fields), and domain-specific reasoning (medical coding, legal document classification). One advantage: LMQL makes implicit constraints explicit in code, so constraints are version-controlled and auditable alongside your application code. This is critical for regulated industries.
Comparison to hand-coded parsing: if your alternative is regex + post-processing, LMQL is often simpler and more robust. If your alternative is prompt-based retry loops (ask again if output is invalid), LMQL is faster and cheaper because it prunes invalid tokens at generation time rather than starting over. For teams building AI products, adopting LMQL early reduces technical debt from constraint-related bugs.
Scaling LMQL queries in production requires careful attention to error handling. If a constraint becomes unsatisfiable (model can't generate valid output), the decoder hangs. Mitigation: set strict max_len limits, add timeout guards, and log failures for analysis. Many teams wrap LMQL in a fallback layer: try the constrained query, if it fails after timeout, fall back to a simpler unconstrained generation with post-hoc validation.
Advanced LMQL constraint patterns: Beyond simple value constraints, LMQL supports complex logic: nested constraints (constraint A only applies if constraint B is satisfied), context-aware constraints (different rules for different input types), and dynamic constraints (adjust constraints based on previous model outputs). For example, after the model generates a category, you can enforce category-specific field constraints. This flexibility makes LMQL suitable for complex domain-specific applications like form extraction with conditional fields.
Performance tuning: constraint checking adds per-token overhead. For latency-sensitive applications, profile your constraints and consider loosening them (e.g., token budget of 100 instead of 50) if the quality improvement doesn't justify the latency cost. Beam search with constraints is slower than greedy decoding but more robust; for production, choose based on your latency SLA and quality requirements.
Debugging constraint failures is hard because you don't see the hidden pruning process. Use verbose=True to inspect token-by-token decisions, log the final output, and compare to expected results. If a constraint is never satisfied, check: (1) is the constraint mathematically satisfiable? (2) is the model capable of generating valid outputs? (3) is the constraint too strict? Common fix: relax the constraint or add an escapeHatch (fallback unconstrained generation after timeout).
LMQL for enterprise systems: Banks and financial institutions use LMQL to extract structured data from customer communications while ensuring compliance. Insurance companies use LMQL to validate claim forms. The key appeal: hard constraints ensure the model can't produce invalid outputs (e.g., a claim ID that doesn't match the expected format). This reduces downstream data quality issues and auditing overhead. Regulations like HIPAA, SOX, and GDPR require that systems handle data correctly; LMQL helps meet these requirements by making invalid outputs literally impossible. For organizations investing in AI infrastructure, LMQL's constraint enforcement is a significant advantage over free-form LLM calls.