Article

Natural Language to SQL in 2026: Why Demo Magic Still Fails in Production

NL-to-SQL is everywhere in 2026, but reliability, governance, and cost control remain unsolved. Here's what production-grade looks like for Postgres and ClickHouse, and why database vendors building native solutions won't save you.

QueryPanel Team
9 min read
nl-to-sqltext-to-sqlpostgresclickhouseproductiongovernancemcpsemantic-layer

Natural Language to SQL in 2026: Why Demo Magic Still Fails in Production

Topic: Production-grade NL-to-SQL, Postgres + ClickHouse challenges, semantic layers, MCP servers, database vendor AI features, governance, cost control.

Keywords: NL-to-SQL, text-to-SQL, Postgres, ClickHouse, production, governance, semantic layer, MCP, BIRD-INTERACT, agentic SQL, cost control, row-level security.

"Just ask your database questions in plain English."

Everyone's selling this now. And if your data lives across Postgres (operational truth) and ClickHouse (high-volume analytics), you've probably tried a few tools. Maybe they worked in the demo. Then you shipped to production and watched them:

  • Generate plausible SQL that's subtly wrong
  • Pick the wrong metric definition ("revenue" vs "net revenue")
  • Blow up costs with table scans and huge joins
  • Bypass governance (or force you to neuter it until the tool is useless)

In 2026, NL-to-SQL is mainstream. Production-grade NL-to-SQL is still rare.

This post covers what changed this year, why database vendors building their own solutions won't save you, and what it actually takes to make NL-to-SQL dependable for Postgres + ClickHouse.

2026 Reality Check: Three Things Changed

1. Database Vendors Started Building Native AI Features

ClickHouse 25.7 shipped with native AI-powered SQL generation. Type ?? in the client, ask a question in English, get SQL. They also launched ClickHouse.ai as a hosted natural language interface.

This matters because it validates the market. But it also means vendors are optimizing for their own database, not for the reality where your data spans multiple systems. ClickHouse's native feature doesn't know about your Postgres tables, your semantic layer, or your business definitions.

The pattern will repeat. Postgres tooling vendors will ship their own AI features. You'll end up with fragmented solutions that don't talk to each other.

2. MCP Servers Went Remote (and Became Infrastructure)

The Model Context Protocol started as a way to connect LLMs to local tools. The March 2025 spec update changed everything: Streamable HTTP transport, OAuth 2.1 authorization, remote server support.

By late 2025, there were 10,000+ MCP servers in production. In December, Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation, with backing from AWS, Google, Microsoft, and OpenAI.

Now MCP servers are everywhere. Database connections, schema introspection, query execution. But here's the problem: research from Knostic scanned nearly 2,000 MCP servers exposed to the internet. Almost all of them had no authentication.

Remote MCP servers solve the integration problem. They don't solve governance.

3. Benchmarks Proved Multi-Turn SQL Is Still Broken

The BIRD-INTERACT benchmark dropped in June 2025. It tests what actually happens in production: multi-turn conversations, ambiguous queries, error recovery, clarifying questions.

The results are sobering:

  • Claude-3.7-Sonnet: 17.78% success rate on agentic tasks
  • o3-mini: 24.4% on conversational tasks
  • GPT-5: 8.67% on full tasks

Even frontier models fail more than 75% of the time on realistic interactive SQL generation. The benchmark also found that follow-up questions are significantly harder than initial queries, because models struggle to maintain context and reason about changed database states.

Single-shot text-to-SQL benchmarks were never the hard part. Production is a loop.

The 2026 Question

It's not "Can an LLM write SQL?"

It's "Can a system reliably answer data questions safely, cheaply, and correctly across real schemas and real governance?"

That's a different problem.

Why Postgres + ClickHouse Breaks Most Tools

If you run both databases, you already know why the combo is common:

  • Postgres holds transactional and authoritative relational data
  • ClickHouse handles real-time analytics at scale, with engine-specific modeling (MergeTree, partitioning, skip indexes)

But this combo exposes every weakness in NL-to-SQL systems.

Dialect and Modeling Differences

ClickHouse isn't "just another SQL database." Performance depends on ClickHouse-specific choices: MergeTree ORDER BY behavior, partition pruning, skipping indexes.

A generator that ignores these produces queries that are technically valid but slow or expensive. The January 2026 ClickHouse newsletter highlights that platform teams are absorbing schema optimization knowledge through semantic layers. Without that layer, the AI guesses wrong.

Governance Is Different (and Must Be Enforced Consistently)

In Postgres, you have RBAC, GRANT, role membership, and Row Level Security (RLS) via CREATE POLICY.

In ClickHouse, you have RBAC plus row policies and column restrictions. But ClickHouse explicitly warns that row policies only make sense for read-only users.

A production NL-to-SQL layer must enforce these rules by construction. Not as an afterthought. Not by trusting the LLM to "be careful."

Cost Controls Are Mandatory

Postgres has statement_timeout to abort runaway queries. ClickHouse has max_execution_time and related limits.

A serious NL-to-SQL system sets these automatically. Otherwise one vague question turns into an incident.

What Production-Grade Actually Means

Here's the minimum bar for teams that successfully deploy NL-to-SQL across Postgres + ClickHouse.

1. Semantic Grounding

Raw schemas aren't enough. You need a semantic layer that defines:

  • Canonical metrics ("active user", "net revenue", "chargeback rate")
  • Join paths and ownership
  • Business time windows and default filters
  • Table/column descriptions that match how humans actually talk

Without this, the model fills gaps with confident nonsense. The result: a dashboard that "looks right" until it's wrong in the meeting.

2. Dialect-Aware Generation

Good systems don't swap a SQL "dialect string." They reason about:

  • Which storage is appropriate (OLTP vs OLAP)
  • ClickHouse engine constraints (MergeTree sorting affects reads)
  • Function differences and performance traps
  • What not to do (cross joins on event-scale tables, unbounded scans)

For Postgres, this includes using EXPLAIN to inspect plans and catch problems before running heavy queries.

3. Verification Loops

SQL is executable. Production systems take advantage of that:

  • Parse/plan checks before execution
  • Dry runs where available
  • Bounded repair iterations on errors
  • Sanity checks on result shape (row counts, null explosions, impossible dates)

The BIRD-INTERACT findings show that error recovery is where most systems fall apart. Multi-turn verification isn't optional.

4. Governance-First by Default

A production layer must respect existing controls:

  • Postgres GRANTs and role membership
  • Postgres RLS policies
  • ClickHouse row/column restrictions

Without relying on users to "ask nicely."

5. Cost and Safety Rails

At minimum:

  • Automatic statement_timeout in Postgres
  • Automatic max_execution_time in ClickHouse
  • Forced LIMITs and sampling defaults
  • Query budgets per user/team
  • Audit logs (prompt → generated SQL → executed SQL → result metadata)

Research shows some text-to-SQL approaches cost up to $0.46 per query with 100+ LLM calls. That's not sustainable. Budget control is a feature.

How QueryPanel Handles This

Most NL-to-SQL tools require you to hand over database credentials or route queries through external servers. QueryPanel takes a fundamentally different approach: a zero-trust SDK architecture where your credentials and data never leave your infrastructure.

Zero-Trust by Design

QueryPanel Cloud only receives schema metadata (table names, column types, relationships). It generates SQL and returns it to the SDK running in your backend. Your callback function executes the query locally. QueryPanel never sees your credentials, never connects to your database, and never receives query results.

This matters for production. You can't have a governance layer if data flows through third parties.

Intent and Risk Classification

The SDK classifies whether a request is:

  • BI-style read query
  • Operational investigation
  • Risky or sensitive
  • Ambiguous (requires clarification)

Ambiguous queries trigger clarifying questions before SQL generation.

Retrieval That Prioritizes Meaning

Instead of dumping 3,000 columns into context, the system retrieves:

  • The smallest relevant schema slice
  • Metric definitions from your semantic layer
  • Query templates your team trusts
  • Documentation that explains "what this table actually means"

Dialect-Aware SQL Synthesis

For Postgres: SQL optimized for transactional correctness and safe resource usage.

For ClickHouse: SQL aligned with analytical modeling patterns. We account for MergeTree realities to avoid "valid but slow" queries.

Client-Side Validation and Execution

Since all SQL execution flows through your callback function, you have complete control:

  • Validate generated SQL using your database's native features (like EXPLAIN)
  • Add custom validation logic or query cost checks
  • Set your own statement_timeout (Postgres) or max_execution_time (ClickHouse)
  • Reject queries that don't meet your security policies

The SDK validates SQL against your allowlisted tables before execution. Queries referencing unauthorized tables are rejected locally.

Explainability

Every response includes:

  • The generated SQL (auditable)
  • The assumptions ("using created_at; last 30 days; excluding test accounts")
  • Which definitions were used ("net revenue = ...")
  • Why tables were joined the way they were

That's how a tool becomes something people rely on.

Example: One Question, Two Engines

User asks: "What changed in chargeback rate after we launched the new onboarding flow?"

A production-grade system will often need data from both:

  • Postgres: Transactional tables (orders, payments, disputes)
  • ClickHouse: High-volume event logs (funnel steps, experiment cohorts)

With QueryPanel's SDK:

  1. The system asks a clarifying question if "launch date" or cohort definition is missing
  2. QueryPanel Cloud generates dialect-appropriate SQL for each engine
  3. Your callback functions execute the queries locally on your infrastructure
  4. Your existing RBAC/RLS policies are respected (QueryPanel never bypasses them)
  5. The SDK returns results with the generated SQL and assumptions visible

Your data never leaves your infrastructure. QueryPanel Cloud only sees the question and schema metadata.

The Build vs Buy Decision

Database vendors are shipping native AI features. Open-source MCP servers are everywhere. Why not build your own?

You can. But you'll end up maintaining:

  • Semantic layer sync across databases
  • Governance enforcement for each database's permission model
  • Multi-turn conversation handling and error recovery
  • Cost controls and query budgeting
  • Audit logging and compliance tracking
  • Dialect-specific SQL optimization

That's a product, not a weekend project.

Evaluation Checklist

If you're evaluating NL-to-SQL for Postgres + ClickHouse:

  • Does it respect your existing Postgres GRANTs and role membership?
  • Does it work with Postgres RLS policies (not bypass them)?
  • Does it work with ClickHouse row/column restrictions?
  • Can you set timeouts and query budgets in your execution layer?
  • Can it explain the query and assumptions?
  • Does it handle multi-turn reality (clarify, repair, iterate)?
  • Can you control the semantic model (metrics, joins, definitions)?
  • Is everything logged and auditable?
  • Do your credentials and data stay on your infrastructure?

Try It

Integrate the QueryPanel SDK with your backend and point it at a read-only replica. Import a semantic model (or we'll help you create one). Run your top 20 "executive questions" end-to-end.

Judge it on outcomes, not vibes.


Further Reading