Every Way To Get Structured Output From LLMs

Update (Nov 26): Added some more details to a few providers

Update (Jun 18): check out the discussion on Hacker News and /r/LocalLLaMA! Thanks for all the feedback and comments, folks- keep it coming!

This post will be interesting to you if:

you're trying to get structured output from an LLM,
you've tried response_format: "json" and function calling and been disappointed by the results,
you're tired of stacking regex on regex on regex to extract JSON from an LLM,
you're trying to figure out what your options are.

Everyone using LLMs in production runs into this problem sooner or later: what we really want is a magical black box that returns JSON in exactly the format we want. Unfortunately, LLMs return English, not JSON, and it turns out that converting English to JSON is kinda hard.

Here's every framework we could find that solves this problem, and how they compare.

(Disclaimer: as a player in this space, we're a little biased! We're the creators of BAML.)

Comparison

Framework	Language Support	Does it handle or prevent malformed JSON?	How do I build the prompt?	Do I have full control over the prompt?	How do I see the final prompt?	Supported Model Providers	API flavors	How do I define output types?	Test Framework?
BAML	Python example code	✅ Yes, using a new Rust-based error-tolerant parser (e.g. can parse `{"foo": "bar}`)	Jinja templates	✅ Yes	VSCode extension	✅ OpenAI ✅ Azure ✅ Bedrock ✅ Gemini ✅ Anthropic ✅ Ollama ✅ OpenAI-compatible models	✅ sync ✅ async ✅ streaming	BAML schemas, transpiled to Pydantic	✅ VSCode Extension 🚧 CLI
	TypeScript example code						✅ sync ✅ async ✅ streaming	BAML schemas, transpiled to TS
	Ruby (beta) example code						✅ sync ❌ async ❌ streaming	BAML schemas, transpiled to Sorbet
	All other languages (via REST server + OpenAPI adapter)						❌ sync ✅ async ❌ streaming	BAML schemas, hosted on REST server
Instructor	Python example code	⚠️ Supports LLM-based retries (none by default)	Build the `messages` array	❌ No (feature request)	No supported mechanism	✅ OpenAI ✅ Anthropic ✅ Cohere ✅ Gemini ✅ LiteLLM	✅ sync ✅ async ✅ streaming	Pydantic	Via the Parea platform
Instructor	TypeScript example code	⚠️ Supports LLM-based retries (none by default)	Build the `messages` array	❌ No	No supported mechanism	✅ OpenAI ⚠️ support for others in beta	❌ sync ✅ async ✅ streaming	Zod	Via the Parea platform
TypeChat	Python (not on PyPI) example code	⚠️ Automatic LLM-based retries	pass in a string	❌ No	n/a	✅ OpenAI ✅ Azure OpenAI bring-your-own	❌ sync ✅ async ❌ streaming	Pydantic	None
	TypeScript example code		pass in a string	❌ No				Zod	None
	C#/.NET example code		pass in a string	❌ No				C# class	None
Marvin	Python example code	⚠️ Supports LLM-based retries (none by default)	Jinja templates	❌ No	No supported mechanism	OpenAI	✅ sync ✅ async ✅ streaming	Pydantic	None
Outlines	Python (Example pending)	❌ OpenAI ✅️️ Self-hosted models can be constrained (structured generation)	pass in a string	✅ Yes	n/a	⚠️️ OpenAI¹ ✅ Transformers² ✅ llama.cpp ⚠️ .txt (private beta)	✅ sync ✅ async ⚠️ streaming³	Pydantic JSON schema EBNF grammar	None
Guidance	Python (Example pending)	⚠️ OpenAI ✅️ Self-hosted models can be constrained (token healing)	pass in a string	✅ Yes	n/a	✅ llama.cpp ⚠️ Anthropic ⚠️ Azure OpenAI ⚠️ Cohere ⚠️ Google AI ⚠️ LiteLLM ⚠️ OpenAI¹ ✅ Transformers² ⚠️ Vertex AI	✅ sync ✅ async ⚠️ streaming³	Enums Regex Pydantic JSON schema	None
LMQL	Python (Example pending)	❌ OpenAI ✅️ Self-hosted models can be constrained (token masking)	pass in a string	✅ Yes	n/a	OpenAI¹ Transformers² Azure OpenAI llama.cpp Replicate	✅ sync ✅ async ⚠️ streaming³	LMQL constraints	None
JSONformer	Python (Example pending)	❌ OpenAI ✅️ Self-hosted models can be constrained (content tokens)	pass in a string	✅ Yes	n/a	Transformers²	✅ sync ❌ async ❌ streaming	JSON schema	None
Firebase Genkit	TypeScript (Example pending)	❌ No	TODO	TODO	TODO	⚠️ Google AI	TODO	Zod	None
SGLang	Python	❌ OpenAI ✅ Self-hosted models can be constrained (regex)	TODO	TODO	TODO	TODO	TODO	Regex	None
lm-format-enforcer	Python	❌ OpenAI ✅ Self-hosted models can be constrained (token filtering)	TODO	TODO	TODO	TODO	TODO	JSON schema JSON Regex	None

^**: Honorable mention to Microsoft's AICI, which is working on creating a shim for cooperative constraints implemented in Python/JS using a WASM runtime. Haven't included it in the list because it seems more low-level than the others, and setup is very involved.

¹: Applying constraints to OpenAI models can be very error-prone, because the OpenAI API does not expose sufficient information about the underlying model operations for the framework to actually apply constraints effectively. See this discussion about limitations from the LMQL documentation.

²: Transformers refers to "HuggingFace Transformers"

³: Constrained streaming generation produces partial objects, but no good ways of interacting with the partial objects, since they are not yet parse-able. We only consider a framework to support streaming if it allows interacting with the partial objects (e.g. if streaming back an object with properties foo and bar, you can access obj.foo before bar has been streamed to the client).

Criteria

Most of our criteria are pretty self-explanatory, but there are two that we want to call out:

Does it handle/prevent malformed JSON? If so, how?

LLMs make a lot of the same mistakes that humans do when producing JSON (e.g. a } in the wrong place or a missing comma), so it's important that the framework can help you handle these errors.

A lot of frameworks "solve" this by feeding the malformed JSON back into the LLM and asking it to repair the JSON. This kinda works, but it's also slow and expensive. If your LLM calls individually take multiple seconds already, you don't really want to make that even slower!

There are two techniques that exist for handling or preventing this: actually parse the malformed JSON (BAML takes this approach) or constrain the LLM's token generation to guarantee that valid JSON is produced (this is what Outlines, Guidance, and a few others do).

Parsing the malformed JSON is our preferred approach: it most closely aligns with what the LLM was designed to do (emit tokens), is fast (takes microseconds), and flexible (allows working with any LLM). It does have limitations: it can't magically make sense of completely nonsensical JSON, after all.

Applying constraints to LLM token generation, by contrast, can be robust, but has its own issues: doing this efficiently requires applying runtime transforms to the model itself, so this only works with self-hosted models (e.g. Llama, Transformers) and does not work with models like OpenAI's ChatGPT or Anthropic's Claude.

Can you see the actual prompt? Do you have full control over the prompt?

You might remember this from "Fuck You, Show Me The Prompt".

Prompts are how we "program" LLMs to give us output.

The best way to get an LLM to return structured data is to craft a prompt designed to return data matching your specific schema. To do that, you need to

see the prompt actually getting sent to ChatGPT, and
try different prompts.

Most frameworks, unfortunately, have hardcoded templates baked in which prevent doing this.

Example code

For each framework listed above, we've included example code, from the framework's documentation, provides for how you would use it.

BAML (Python)

Loading preview...

No tests running

From baml-examples/fastapi-starter/baml_src/extract_resume.baml

BAML (TS)

Loading preview...

No tests running

From baml-examples/nextjs-starter/baml_src/classify_message.baml

BAML (Ruby)

From baml-ruby-starter/examples.rb:

require_relative "baml_client/client"

b = Baml::BamlClient.from_directory("baml_src")

input = "Can't access my account using my usual login credentials"
classified = b.ClassifyMessage(input: input)

puts classified.categories

From baml-ruby-starter/baml_src/classify_message.baml:

enum Category {
    Refund
    CancelOrder
    TechnicalSupport
    AccountIssue
    Question
}

class MessageFeatures {
    categories Category[]
}

function ClassifyMessage(input: string) -> MessageFeatures {
  client GPT4Turbo

  prompt #"
    {# _.role("system") starts a system message #}
    {{ _.role("system") }}

    Classify the following INPUT.

    {{ ctx.output_format }}

    {# This starts a user message #}
    {{ _.role("user") }}

    INPUT: {{ input }}

    Response:
  "#
}

Instructor (Python)

From simple_prediction.py:

class Labels(str, enum.Enum):
    SPAM = "spam"
    NOT_SPAM = "not_spam"

class SinglePrediction(BaseModel):
    """
    Correct class label for the given text
    """

    class_label: Labels

def classify(data: str) -> SinglePrediction:
    return client.chat.completions.create(
        model="gpt-3.5-turbo-0613",
        response_model=SinglePrediction,
        messages=[
            {
                "role": "user",
                "content": f"Classify the following text: {data}",
            },
        ],
    )  # type: ignore

prediction = classify("Hello there I'm a nigerian prince and I want to give you money")
assert prediction.class_label == Labels.SPAM

instructor-js

From simple_prediction/index.ts:

import { z } from "zod"

enum CLASSIFICATION_LABELS {
  "SPAM" = "SPAM",
  "NOT_SPAM" = "NOT_SPAM"
}

const SimpleClassificationSchema = z.object({
  class_label: z.nativeEnum(CLASSIFICATION_LABELS)
})

const createClassification = async (data: string) => {
  const classification = await client.chat.completions.create({
    messages: [{ role: "user", content: `"Classify the following text: ${data}` }],
    model: "gpt-3.5-turbo",
    response_model: { schema: SimpleClassificationSchema, name: "SimpleClassification" },
    max_retries: 3,
    seed: 1
  })

  return classification
}

const classification = await createClassification(
  "Hello there I'm a nigerian prince and I want to give you money"
)
// OUTPUT: { class_label: 'SPAM' }

console.log({ classification })

assert(
  classification?.class_label === CLASSIFICATION_LABELS.SPAM,
  `Expected ${classification?.class_label} to be ${CLASSIFICATION_LABELS.SPAM}`
)

TypeChat (Python)

From examples/sentiment/demo.py:

import asyncio

import sys
from dotenv import dotenv_values
import schema as sentiment
from typechat import Failure, TypeChatJsonTranslator, TypeChatValidator, create_language_model, process_requests

async def main():
    env_vals = dotenv_values()
    model = create_language_model(env_vals)
    validator = TypeChatValidator(sentiment.Sentiment)
    translator = TypeChatJsonTranslator(model, validator, sentiment.Sentiment)

    async def request_handler(message: str):
        result = await translator.translate(message)
        if isinstance(result, Failure):
            print(result.message)
        else:
            result = result.value
            print(f"The sentiment is {result.sentiment}")

    file_path = sys.argv[1] if len(sys.argv) == 2 else None
    await process_requests("😀> ", file_path, request_handler)


if __name__ == "__main__":
    asyncio.run(main())

From examples/sentiment/schema.py:

from dataclasses import dataclass
from typing_extensions import Literal, Annotated, Doc

@dataclass
class Sentiment:
    """
    The following is a schema definition for determining the sentiment of a some user input.
    """

    sentiment: Annotated[Literal["negative", "neutral", "positive"],
                         Doc("The sentiment for the text")]

TypeChat (TypeScript)

From examples/sentiment/src/main.ts:

import { createJsonTranslator, createLanguageModel } from "typechat";
import { processRequests } from "typechat/interactive";
import { createTypeScriptJsonValidator } from "typechat/ts";
import { SentimentResponse } from "./sentimentSchema";

const dotEnvPath = findConfig(".env");
assert(dotEnvPath, ".env file not found!");
dotenv.config({ path: dotEnvPath });

const model = createLanguageModel(process.env);
const schema = fs.readFileSync(path.join(__dirname, "sentimentSchema.ts"), "utf8");
const validator = createTypeScriptJsonValidator<SentimentResponse>(schema, "SentimentResponse");
const translator = createJsonTranslator(model, validator);

// Process requests interactively or from the input file specified on the command line
processRequests("😀> ", process.argv[2], async (request) => {
    const response = await translator.translate(request);
    if (!response.success) {
        console.log(response.message);
        return;
    }
    console.log(`The sentiment is ${response.data.sentiment}`);
});

From examples/sentiment/src/sentimentSchema.ts:

export interface SentimentResponse {
    sentiment: "negative" | "neutral" | "positive";  // The sentiment of the text
}

TypeChat (C#/.NET)

From examples/Sentiment/Program.cs:

using Microsoft.TypeChat;

namespace Sentiment;

public class SentimentApp : ConsoleApp
{
    JsonTranslator<SentimentResponse> _translator;

    public SentimentApp()
    {
        OpenAIConfig config = Config.LoadOpenAI();
        // Although this sample uses config files, you can also load config from environment variables
        // OpenAIConfig config = OpenAIConfig.LoadFromJsonFile("your path");
        // OpenAIConfig config = OpenAIConfig.FromEnvironment();
        _translator = new JsonTranslator<SentimentResponse>(new LanguageModel(config));
    }

    public override async Task ProcessInputAsync(string input, CancellationToken cancelToken)
    {
        SentimentResponse response = await _translator.TranslateAsync(input, cancelToken);
        Console.WriteLine($"The sentiment is {response.Sentiment}");
    }
}

From examples/Sentiment/SentimentSchema.cs:

using System.Text.Json.Serialization;
using Microsoft.TypeChat.Schema;

namespace Sentiment;

public class SentimentResponse
{
    [JsonPropertyName("sentiment")]
    [JsonVocab("negative | neutral | positive")]
    public string Sentiment { get; set; }
}

Marvin

From the Marvin docs:

import marvin
from pydantic import BaseModel

class Recipe(BaseModel):
    name: str
    cook_time_minutes: int
    ingredients: list[str]
    steps: list[str]

@marvin.fn
def recipe(
    ingredients: list[str],
    max_cook_time: int = 15,
    cuisine: str = "North Italy",
    experience_level:str = "beginner"
) -> Recipe:
    """
    Returns a complete recipe that uses all the `ingredients` and
    takes less than `max_cook_time`  minutes to prepare. Takes
    `cuisine` style and the chef's `experience_level` into account
    as well.
    """

Last thoughts

This is a living document, and we'll be updating it as we learn more about other frameworks.

If you have any questions, comments, or suggestions, feel free to reach out to us on Discord or Twitter at @boundaryml. We're happy to also meet and help with any prompting / AI engineering questions you might have.