Skip to content
Malik Hamza Shabbir
AI Engineeringstructured-outputstypescriptzodjson-schema

Reliable JSON From LLMs: Structured Outputs Compared 2026

HSMalik Hamza Shabbir7 min read

In short

As of June 2026, all three major LLM providers ship true constrained decoding: OpenAI through strict json_schema, Anthropic through output_config and strict tool use (public beta since November 2025), and Google through Gemini's improved responseSchema. The practical gap is large: plain JSON mode fails to match a non-trivial schema 8-15% of the time, while strict structured outputs hold roughly 99.9% compliance. Every AI auto-reply in my production reputation SaaS now flows through a strict schema, and parse failures are effectively zero.

Reliable JSON From LLMs: Structured Outputs Compared 2026 - branded cover card by Hamza Shabbir
On this page

What are structured outputs and how do they differ from JSON mode?

Structured outputs compile your JSON Schema into a grammar that constrains token sampling at inference, so the model cannot emit a token that would violate the schema. JSON mode only promises syntactically valid JSON. In my testing, that gap is the difference between 8-15% failures and fewer than one failure per thousand requests.

The one-sentence version I keep in my notes: constrained decoding means the provider compiles your schema into a formal grammar and masks every candidate token that would break it before sampling, so invalid output is not filtered after generation, it is impossible during generation.

JSON mode, the older feature, guarantees the result tokenizes as JSON and nothing more. The failures I used to see weekly: missing required fields, a string where I asked for a number, hallucinated keys, enums replaced by synonyms, and the classic markdown fence wrapped around otherwise valid JSON. Across my own pipelines, JSON mode without schema enforcement failed between 8% and 15% of the time depending on schema complexity, while OpenAI's strict structured outputs hold about 99.9% schema compliance, under 0.1% failure. That one stat settles the architecture question. As of June 2026, the provider docs themselves treat plain JSON mode as legacy.

How does each provider implement structured outputs in 2026?

OpenAI uses response_format with a json_schema and strict: true. Anthropic shipped native Structured Outputs in public beta in November 2025: output_config.format for response shape and strict: true for tool inputs, gated by the anthropic-beta: structured-outputs-2025-11-13 header on Sonnet 4.5, Opus 4.1, and newer. Gemini uses responseSchema, with noticeably better complex-type handling since its 2026 update.













Three dialect quirks cost me real debugging time. OpenAI's strict mode requires every key to appear in required, so an optional field becomes a null union, which your Zod schema must mirror with .nullable() instead of .optional(). Claude's dialect rejects recursive schemas, and the SDK silently strips numeric and string-length constraints, so z.number().min(1).max(5) still needs client-side validation to enforce the bounds. Gemini grew up on an OpenAPI subset, and even after the 2026 update I see occasional coercion oddities on deeply nested unions.

The Claude path in TypeScript is short:

TYPESCRIPT
import { zodOutputFormat } from "@anthropic-ai/sdk/helpers/zod";

const response = await client.messages.parse(
  {
    model: "claude-sonnet-4-5",
    max_tokens: 2048,
    messages: [{ role: "user", content: prompt }],
    output_config: { format: zodOutputFormat(ReplySchema) },
  },
  { headers: { "anthropic-beta": "structured-outputs-2025-11-13" } },
);
// response.parsed_output is typed and already validated

How do I make one Zod schema the single source of truth in TypeScript?

Diagram comparing constrained decoding for structured outputs across OpenAI, Claude, and Gemini with one shared Zod schema
Diagram comparing constrained decoding for structured outputs across OpenAI, Claude, and Gemini with one shared Zod schema

Define the schema once in Zod, derive the provider payload from it, and validate the response with the same object. Zod 4 ships z.toJSONSchema() natively, so the schema that types my TypeScript code is byte-for-byte the schema the API enforces, and safeParse failures drive a bounded repair loop.

My production pattern in five steps:

  1. Define the schema once in Zod 4, with .describe() on every field. Descriptions reach the model and measurably improve fill quality.

  2. Derive the wire schema with z.toJSONSchema(schema). The zod-to-json-schema package is the fallback if you are stuck on Zod 3.

  3. Send it as response_format (OpenAI), output_config.format (Claude), or responseSchema (Gemini).

  4. Run schema.safeParse on the way back, even though the provider enforced the grammar, because client-side validation catches the bounds the dialects do not support.

  5. On failure, feed the Zod error paths into a correction prompt and retry, capped at two attempts.


The wrapper I use across providers:

TYPESCRIPT
import { z } from "zod";

type ChatMessage = { role: "user" | "assistant"; content: string };
type CallFn = (
  jsonSchema: Record<string, unknown>,
  messages: ChatMessage[],
) => Promise<string>;

export async function safeStructured<T>(
  schema: z.ZodType<T>,
  messages: ChatMessage[],
  call: CallFn,
  maxRetries = 2,
): Promise<T> {
  const jsonSchema = z.toJSONSchema(schema) as Record<string, unknown>;
  let convo = [...messages];

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const raw = await call(jsonSchema, convo);

    let parsed: unknown;
    try {
      parsed = JSON.parse(raw);
    } catch {
      // With constrained decoding this branch almost never fires
      convo = [...convo, { role: "assistant", content: raw },
        { role: "user", content: "That was not valid JSON. Return only corrected JSON." }];
      continue;
    }

    const result = schema.safeParse(parsed);
    if (result.success) return result.data;

    const issues = result.error.issues
      .map((i) => `${i.path.join(".") || "(root)"}: ${i.message}`)
      .join("; ");

    convo = [...convo, { role: "assistant", content: raw },
      { role: "user", content: `The JSON failed validation on these paths: ${issues}. Return only corrected JSON.` }];
  }

  throw new Error(`safeStructured: schema not satisfied after ${maxRetries + 1} attempts`);
}

This is the same contract-first discipline I apply in API development : the schema is the contract, and both sides of the wire validate against it. The wrapper also feeds every extraction pipeline I run, including the fact extraction that populates agent memory. If you are choosing that layer, I compared Mem0, Letta, and Zep for agent memory separately.

When does constrained decoding hurt?

Constrained decoding hurts when the extraction itself requires reasoning. Forcing the model to emit schema-shaped tokens from the first token narrows its working space, and on hard tasks I have measured accuracy drops even though every response parses. The fix is the two-call pattern: reason in prose first, then extract.

On a sarcasm-heavy subset of review data, direct-to-schema extraction scored about four points worse on aspect-level sentiment than letting the model think in free prose and extracting from that prose in a second call. The grammar does not make the model less capable, but it removes the scratch space where intermediate reasoning would normally happen.

There is also a latency tax. A new schema pays a one-time grammar compilation cost on the first request, and both OpenAI and Anthropic cache the compiled schema for roughly 24 hours, so the first call after a schema change can add noticeable delay while steady-state overhead is negligible. The two-call pattern doubles request count, but the extraction call can run on a small, cheap model since the reasoning already happened. Pair it with the techniques from my guide on reducing LLM API costs with caching and routing and the overhead stays minor.

For simple extraction (contact details, classification, my review replies) a single strict call is fine. Reach for two calls only when the answer requires multi-step judgment.

What failure rates did I measure on the same 50-field schema?

I ran an identical 50-field review-analysis schema against all three providers, 500 requests each, in May 2026. OpenAI strict mode failed zero times, Claude failed once (a max_tokens truncation, my misconfiguration), Gemini failed three times, and the JSON-mode baseline without schema enforcement failed 58 times out of 500.

The schema covered a reviewer profile, per-aspect sentiment with eight enums, a suggested reply, escalation flags, and confidence scores, nested four levels deep. Same prompts, default settings, May 2026 model snapshots.






DimensionOpenAIAnthropic (Claude)Google (Gemini)
Request parameterresponse_format: { type: "json_schema", strict: true }output_config.format (type: "json_schema") plus strict: true on toolsgenerationConfig.responseSchema with JSON MIME type
Opt-in statusGAPublic beta, header anthropic-beta: structured-outputs-2025-11-13GA, improved handling rolled out 2026
additionalPropertiesMust be false on every objectMust be false on every objectNot part of the dialect
Optional fieldsEvery key in required; optional means a nullable unionOmit the key from requirednullable: true, OpenAPI style
UnionsanyOf supported, oneOf rejectedanyOf supportedanyOf reliable since 2026, shaky beyond ~3 levels deep
RecursionSupported via $defs and $refNot supportedDepth-limited
Numeric and length boundsNot enforced at decode timeNot enforced; SDK strips themPartially enforced
EnumsSupportedSupportedSupported
StreamingYesYesYes
Typical retry needAlmost never (<0.1%)Almost never; watch max_tokens truncationRare; deep unions occasionally
Model coverageGPT-4o and newerSonnet 4.5, Opus 4.1, and newerGemini 2.x
Provider and modeRequestsParse or schema failuresFailure rate
OpenAI, strict json_schema50000.0%
Claude, output_config + beta header50010.2%
Gemini, responseSchema50030.6%
JSON mode, no schema (baseline)5005811.6%

Claude's single failure was a truncation mid-array; the grammar held right up to the cutoff, so I count it against my config, not the feature. Gemini's three failures were union coercions on the deepest nested object. The baseline lands inside the published 8-15% band, which matches what I see across client projects.

In my reputation SaaS, every AI auto-reply passes through a strict schema: reply text capped at four lines, a tone enum, an escalation boolean. After moving off JSON mode plus regex repair, parse failures went to effectively zero across roughly 6,000 generated replies a month, and the two-retry repair loop has fired twice in the last quarter. What changed most was monitoring.

Once parse failures hit zero, every failure that remains is semantic. The model returns a perfectly valid object with a wrong sentiment label or an invented order ID, and no schema will catch that. Strict schemas do not remove the need for monitoring, they relocate it.

That relocation is why I trace every structured call end to end, using the setup from my guide to AI agent observability in Node.js with OpenTelemetry . The alerts now watch field-level distributions, not parse errors.

My verdict: in 2026 you should never parse free-text JSON from an LLM in production: one Zod schema should drive the request, the validation, and the retry. This schema-first pattern is the backbone of the AI systems I build for clients , from RAG MVPs to agent backends.

Key takeaways

  • Plain JSON mode fails 8-15% of the time on non-trivial schemas; strict structured outputs hold roughly 99.9% compliance. Treat JSON mode as legacy.

  • All three majors now do true constrained decoding: OpenAI response_format with strict: true, Claude output_config behind the structured-outputs-2025-11-13 beta header, Gemini responseSchema.

  • One Zod schema should be the single source of truth: it generates the wire schema, validates the response, and drives a max-two-attempt repair loop.

  • Constrained decoding can degrade reasoning-heavy extraction. Reason in prose first, then extract with a second cheap call.

  • With syntax solved, monitoring moves up a level: track semantic correctness in traces, not parse success.

FAQ

Is JSON mode the same as structured outputs?

No. JSON mode only guarantees output that tokenizes as valid JSON; it does not enforce your schema, so fields go missing and types drift, failing 8-15% of the time on complex shapes. Structured outputs compile the schema into a decoding grammar, which pushes compliance to roughly 99.9%. In 2026, JSON mode is legacy.

Does Claude support structured outputs natively in 2026?

Yes. Anthropic shipped Structured Outputs in public beta in November 2025. You send `output_config.format` with a JSON schema, or set `strict: true` on tool definitions, plus the `anthropic-beta: structured-outputs-2025-11-13` header. It works on Sonnet 4.5, Opus 4.1, and newer models, and the TypeScript SDK ships a `zodOutputFormat` helper.

Do structured outputs work with streaming responses?

Yes, on all three providers as of June 2026. The grammar constrains each token as it streams, so partial output is always a prefix of valid JSON. You still validate the complete document with `safeParse` at the end; for incremental UI rendering, run a partial-JSON parser over the accumulating buffer.

Working on something like this?

I build web apps, AI features, and mobile products for clients. If this article matches a problem you have, tell me about it.

Start a conversation
HS

Malik Hamza Shabbir · Full-Stack & AI Engineer

I build full-stack and AI products solo: a reputation SaaS in production, RAG pipelines, and React Native apps. I write from what I ship, not from documentation summaries.

Related articles