Prompt Engineering

LMQL

A query language for LLMs that lets you constrain model outputs using Python control flow and logical constraints.

SQL-like
Syntax
Token-level
Constraints
50Γ— faster
Than beam search (claimed)

Table of Contents

SECTION 01

Why prompting isn't enough for structured output

Ask an LLM "respond with only YES or NO" and sometimes it says "Certainly! The answer is YES." The model knows the constraint exists but still violates it because constraints live in the prompt text, not in the decoding process.

LMQL moves constraints into the decoder. Instead of hoping the model reads your instructions, you inject hard stops, type checks, and value limits at the token-generation level β€” the model cannot produce output that violates them.

SECTION 02

LMQL in one sentence

LMQL (Language Model Query Language) is a Python superset where LLM calls are first-class expressions and the runtime enforces structural constraints during beam search, making invalid outputs literally unrepresentable.

import lmql

@lmql.query
async def classify_sentiment(review: str):
    '''lmql
    "Review: {review}\n"
    "Sentiment: [LABEL]" where LABEL in ["positive", "negative", "neutral"]
    return LABEL
    '''

LABEL can only be one of those three strings β€” not "mostly positive" or "I'd say negative".

SECTION 03

Core syntax walkthrough

import lmql

@lmql.query
async def structured_review(product: str):
    '''lmql
    "You are a product reviewer.\n"
    "Product: {product}\n\n"
    "Rating (1-5): [RATING]"        where INT(RATING) and 1 <= int(RATING) <= 5
    "\nPros (one sentence): [PROS]" where len(TOKENS(PROS)) < 30
    "\nCons (one sentence): [CONS]" where len(TOKENS(CONS)) < 30
    return {"rating": int(RATING), "pros": PROS, "cons": CONS}
    '''

# Run it
result = lmql.run(structured_review, product="MX Master 3 mouse")
print(result)
# {"rating": 5, "pros": "Ergonomic shape and customisable buttons suit long work sessions.", "cons": "High price point may deter casual users."}

Key primitives:

SECTION 04

Constraints in depth

LMQL constraints are evaluated eagerly β€” at each decoding step the runtime prunes token candidates that would make the constraint unsatisfiable. This is faster and more reliable than post-processing.

@lmql.query
async def extract_date(text: str):
    '''lmql
    "Extract the date from: {text}\n"
    "Year:  [YEAR]"  where INT(YEAR) and 1900 <= int(YEAR) <= 2100
    "Month: [MONTH]" where INT(MONTH) and 1 <= int(MONTH) <= 12
    "Day:   [DAY]"   where INT(DAY)   and 1 <= int(DAY)   <= 31
    return {"year": int(YEAR), "month": int(MONTH), "day": int(DAY)}
    '''

Useful constraint types:

ConstraintMeaning
STOPS_AT(VAR, "\n")Stop generation at the first newline
len(TOKENS(VAR)) < NHard token budget
VAR in ["a","b","c"]Enum enforcement
REGEX(VAR, r"\d{4}-\d{2}-\d{2}")Regex match (partial, checked eagerly)
INT(VAR)Must decode as a valid integer
SECTION 05

Decoding strategies

LMQL exposes decoding as a first-class concept:

@lmql.query(decoder="argmax")      # deterministic greedy β€” good for classification
@lmql.query(decoder="sample", temperature=0.7)  # stochastic β€” good for generation
@lmql.query(decoder="beam", n=5)   # beam search β€” good for structured extraction
async def my_query(): ...

Beam search + constraints is LMQL's killer combo: beam search explores multiple parses of the output simultaneously while constraints prune invalid paths, giving you the best valid output without exhaustive retries.

SECTION 06

LMQL vs. alternatives

ToolConstraint levelEase of useBackend support
LMQLToken-level (hardest)Medium (new syntax to learn)OpenAI, HuggingFace, local
InstructorSchema-level (retry loop)High (pure Python)Any OpenAI-compatible
OutlinesToken-level (regex/CFG)MediumHuggingFace, vLLM
GuidanceToken-level + interleavingMediumLocal models best
Plain JSON modeJSON syntax onlyVery highOpenAI, Anthropic

Use LMQL when you need hard numeric or regex constraints that simple JSON mode can't express. For most schema-validation use cases, Instructor is simpler.

SECTION 07

Gotchas

Local models only for token-level constraints: True token-by-token constraint enforcement requires access to the model's logits. For hosted APIs (OpenAI, Anthropic) LMQL falls back to prompt-level constraints + retries β€” you lose the hard guarantee.

Constraint conflicts can cause infinite loops: If your constraints are contradictory (e.g., require an integer between 1–5 but the context makes any such number improbable) the decoder may loop. Add a max_len guard.

Debugging is harder: Unlike plain string prompts, LMQL queries require understanding the decoding trace. Use lmql.run(..., verbose=True) to inspect the token-level decisions.

Async everywhere: All LMQL queries are async by default. In synchronous contexts use asyncio.run(my_query(...)).

SECTION 08

Production deployment and optimization

LMQL queries can be deployed as services, but several factors affect performance. Constraint checking happens per-token, so tight constraints on long sequences add latency. For production systems, profile your constraints and consider caching frequent queries or using adaptive constraint relaxation.

Model considerations: Larger models often satisfy constraints more naturally, reducing backtrack cost. Smaller models (7B–13B) may need looser constraints or explicit in-context examples to succeed. Temperature matters too: deterministic modes (temperature=0) hit constraints faster, while sampling explores more freely.

Deployment PatternProsCons
Batch processing (local)Full constraint power, no API latencyGPU memory, slow for high volume
Hosted API fallbackEasy scaling, managedConstraints become prompt-only
vLLM + OutlinesFast, integrates token constraintsDifferent API, migration cost
Streaming + validationResponsive UX, post-hoc fixMay need retries, latency tail

LMQL ecosystem: The project is maintained at Microsoft Research and has active open-source contributions. Community extensions include integrations with additional backends (Hugging Face, local models), custom constraint types, and monitoring tools for production deployments. The design philosophy is to make constraint-driven generation accessible to Python developers without requiring deep knowledge of decoding algorithms.

For teams evaluating constraint-driven generation, LMQL is strongest when you need predictable, schema-compliant outputs from hosted or local models. Its learning curve is moderateβ€”a developer familiar with Python and prompting can be productive in hours. The constraint language feels natural once you understand the distinction between prompt-level (soft) and decoder-level (hard) constraints. For new projects, consider LMQL alongside Instructor and Outlines to pick the best fit.

LMQL real-world use cases: Teams use LMQL for structured extraction (contracts, PDFs, forms), multi-turn dialogue systems with hard validation (customer support bots that must collect required fields), and domain-specific reasoning (medical coding, legal document classification). One advantage: LMQL makes implicit constraints explicit in code, so constraints are version-controlled and auditable alongside your application code. This is critical for regulated industries.

Comparison to hand-coded parsing: if your alternative is regex + post-processing, LMQL is often simpler and more robust. If your alternative is prompt-based retry loops (ask again if output is invalid), LMQL is faster and cheaper because it prunes invalid tokens at generation time rather than starting over. For teams building AI products, adopting LMQL early reduces technical debt from constraint-related bugs.

Scaling LMQL queries in production requires careful attention to error handling. If a constraint becomes unsatisfiable (model can't generate valid output), the decoder hangs. Mitigation: set strict max_len limits, add timeout guards, and log failures for analysis. Many teams wrap LMQL in a fallback layer: try the constrained query, if it fails after timeout, fall back to a simpler unconstrained generation with post-hoc validation.

Advanced LMQL constraint patterns: Beyond simple value constraints, LMQL supports complex logic: nested constraints (constraint A only applies if constraint B is satisfied), context-aware constraints (different rules for different input types), and dynamic constraints (adjust constraints based on previous model outputs). For example, after the model generates a category, you can enforce category-specific field constraints. This flexibility makes LMQL suitable for complex domain-specific applications like form extraction with conditional fields.

Performance tuning: constraint checking adds per-token overhead. For latency-sensitive applications, profile your constraints and consider loosening them (e.g., token budget of 100 instead of 50) if the quality improvement doesn't justify the latency cost. Beam search with constraints is slower than greedy decoding but more robust; for production, choose based on your latency SLA and quality requirements.

Debugging constraint failures is hard because you don't see the hidden pruning process. Use verbose=True to inspect token-by-token decisions, log the final output, and compare to expected results. If a constraint is never satisfied, check: (1) is the constraint mathematically satisfiable? (2) is the model capable of generating valid outputs? (3) is the constraint too strict? Common fix: relax the constraint or add an escapeHatch (fallback unconstrained generation after timeout).

LMQL for enterprise systems: Banks and financial institutions use LMQL to extract structured data from customer communications while ensuring compliance. Insurance companies use LMQL to validate claim forms. The key appeal: hard constraints ensure the model can't produce invalid outputs (e.g., a claim ID that doesn't match the expected format). This reduces downstream data quality issues and auditing overhead. Regulations like HIPAA, SOX, and GDPR require that systems handle data correctly; LMQL helps meet these requirements by making invalid outputs literally impossible. For organizations investing in AI infrastructure, LMQL's constraint enforcement is a significant advantage over free-form LLM calls.