May 30, 2024

Every Way To Get Structured Output From LLMs

By Sam Lijin - @sxlijin

Update (Jun 18): check out the discussion on Hacker News and /r/LocalLLaMA! Thanks for all the feedback and comments, folks- keep it coming!

This post will be interesting to you if:

  • you're trying to get structured output from an LLM,
  • you've tried response_format: "json" and function calling and been disappointed by the results,
  • you're tired of stacking regex on regex on regex to extract JSON from an LLM,
  • you're trying to figure out what your options are.

Everyone using LLMs in production runs into this problem sooner or later: what we really want is a magical black box that returns JSON in exactly the format we want. Unfortunately, LLMs return English, not JSON, and it turns out that converting English to JSON is kinda hard.

Here's every framework we could find that solves this problem, and how they compare.

(Disclaimer: as a player in this space, we're a little biased!)

Comparison

FrameworkLanguage SupportDoes it handle or prevent malformed JSON?How do I build the prompt?Do I have full control over the prompt?How do I see the final prompt?Supported Model ProvidersAPI flavorsHow do I define output types?Test Framework?

BAML

Python example code

Yes, using a new Rust-based error-tolerant parser (e.g. can parse {"foo": "bar})Jinja templates YesVSCode extension

OpenAI
✅ Azure OpenAI
Anthropic
✅ Ollama

 sync
✅ async
 streaming

BAML schemas, transpiled to Pydantic VSCode Extension
🚧 CLI

TypeScript example code

 sync
✅ async
 streaming

BAML schemas, transpiled to TS

Ruby (beta) example code

 sync
❌ async
 streaming

BAML schemas, transpiled to Sorbet

Instructor

Python example code

⚠️ Supports LLM-based retries (none by default)Build the messages array No (feature request)No supported mechanism

OpenAI
✅ Anthropic
Cohere
✅ Gemini
LiteLLM

 sync
✅ async
 streaming

PydanticVia the Parea platform

TypeScript example code

Build the messages array No

OpenAI
⚠️ support for others in beta

 sync
✅ async
 streaming

Zod
TypeChat

Python (not on PyPI)
example code

⚠️ Automatic LLM-based retriespass in a string Non/a

OpenAI
✅ Azure OpenAI
bring-your-own

 sync
✅ async
 streaming

PydanticNone

TypeScript example code

pass in a string NoZodNone

C#/.NET example code

pass in a string NoC# classNone

Marvin

Python example code

⚠️ Supports LLM-based retries (none by default)Jinja templates NoNo supported mechanismOpenAI

 sync
✅ async
 streaming

PydanticNone

Outlines

Python (Example pending)

 OpenAI
✅️️ Self-hosted models can be constrained (structured generation)

pass in a string Yesn/a

⚠️️ OpenAI1
Transformers2
✅ llama.cpp
⚠️ .txt (private beta)

 sync
✅ async
⚠️ streaming3

Pydantic
JSON schema
EBNF grammar

None

Guidance

Python (Example pending)

⚠️ OpenAI
✅️ Self-hosted models can be constrained (token healing)

pass in a string Yesn/a

llama.cpp
⚠️ Anthropic
⚠️ Azure OpenAI
⚠️ Cohere
⚠️ Google AI
⚠️ LiteLLM
⚠️ OpenAI1
Transformers2
⚠️ Vertex AI

 sync
✅ async
⚠️ streaming3

Enums
Regex
Pydantic
JSON schema

None

LMQL

Python (Example pending)

 OpenAI
✅️ Self-hosted models can be constrained (token masking)

pass in a string Yesn/a

OpenAI1
Transformers2
Azure OpenAI
llama.cpp
Replicate

 sync
✅ async
⚠️ streaming3

LMQL constraints

None

JSONformer

Python (Example pending)

 OpenAI
✅️ Self-hosted models can be constrained (content tokens)

pass in a string Yesn/a

Transformers2

 sync
❌ async
 streaming

JSON schema

None

Firebase Genkit

TypeScript (Example pending)

 No

TODOTODOTODO

⚠️ Google AI

TODO

Zod

None

SGLang

Python

 OpenAI
✅ Self-hosted models can be constrained (regex)

TODOTODOTODO

TODO

TODO

Regex

None

lm-format-enforcer

Python

 OpenAI
✅ Self-hosted models can be constrained (token filtering)

TODOTODOTODO

TODO

TODO

JSON schema
JSON
Regex

None

*: we've omitted LangChain from this list because we haven't heard of anyone using it in production - look no further than the top posts of all time on /r/LangChain.

**: Honorable mention to Microsoft's AICI, which is working on creating a shim for cooperative constraints implemented in Python/JS using a WASM runtime. Haven't included it in the list because it seems more low-level than the others, and setup is very involved.

1: Applying constraints to OpenAI models can be very error-prone, because the OpenAI API does not expose sufficient information about the underlying model operations for the framework to actually apply constraints effectively. See this discussion about limitations from the LMQL documentation.

2: Transformers refers to "HuggingFace Transformers"

3: Constrained streaming generation produces partial objects, but no good ways of interacting with the partial objects, since they are not yet parse-able. We only consider a framework to support streaming if it allows interacting with the partial objects (e.g. if streaming back an object with properties foo and bar, you can access obj.foo before bar has been streamed to the client).

Criteria

Most of our criteria are pretty self-explanatory, but there are two that we want to call out:

Does it handle/prevent malformed JSON? If so, how?

LLMs make a lot of the same mistakes that humans do when producing JSON (e.g. a } in the wrong place or a missing comma), so it's important that the framework can help you handle these errors.

A lot of frameworks "solve" this by feeding the malformed JSON back into the LLM and asking it to repair the JSON. This kinda works, but it's also slow and expensive. If your LLM calls individually take multiple seconds already, you don't really want to make that even slower!

There are two techniques that exist for handling or preventing this: actually parse the malformed JSON (BAML takes this approach) or constrain the LLM's token generation to guarantee that valid JSON is produced (this is what Outlines, Guidance, and a few others do).

Parsing the malformed JSON is our preferred approach: it most closely aligns with what the LLM was designed to do (emit tokens), is fast (takes microseconds), and flexible (allows working with any LLM). It does have limitations: it can't magically make sense of completely nonsensical JSON, after all.

Applying constraints to LLM token generation, by contrast, can be robust, but has its own issues: doing this efficiently requires applying runtime transforms to the model itself, so this only works with self-hosted models (e.g. Llama, Transformers) and does not work with models like OpenAI's ChatGPT or Anthropic's Claude.

Can you see the actual prompt? Do you have full control over the prompt?

You might remember this from "Fuck You, Show Me The Prompt".

Prompts are how we "program" LLMs to give us output.

The best way to get an LLM to return structured data is to craft a prompt designed to return data matching your specific schema. To do that, you need to

  1. see the prompt actually getting sent to ChatGPT, and
  2. try different prompts.

Most frameworks, unfortunately, have hardcoded templates baked in which prevent doing this.

Example code

For each framework listed above, we've included example code, from the framework's documentation, provides for how you would use it.


BAML (Python)

From baml-examples/fastapi-starter/fast_api_starter/app.py:

from baml_client import b
 
resume = """John Doe [...] Experience: Software Engineer Intern [...]"""
 
async def async_call():
  parsed = await b.ExtractResume(resume)
 
async def streamed_call():
  stream = b.stream.ExtractResume(resume)
  async for partial in stream:
    print(partial) # This is an object with auto complete for the partial Resume type
  response = await stream.get_final_result() # auto complete here to the full Resume type
 

From baml-examples/fastapi-starter/baml_src/extract_resume.baml:

class Resume {
  name string
  education Education[]
  skills string[]
}
 
class Education {
  school string
  degree string
  year int
}
 
function ExtractResume(raw_text: string) -> Resume {
  client GPT4
  prompt #"
    Parse the following resume and return a structured representation of the data in the schema below.
 
    Resume:
    ---
    {{raw_text}}
    ---
 
    Output JSON format (only include these fields, and no others):
    {{ ctx.output_format(prefix=null) }}
 
    Output JSON:
  "#
}

BAML (TS)

From baml-examples/nextjs-starter/app/api/example_baml/route.ts:

import b from './baml_client'
import { Role } from './baml_client/types';
 
// Async call
const result = await b.ClassifyMessage({
    convo: [
        {
            role: Role.Customer,
            content: "I want to cancel my subscription"
        }
    ]
});
 
// Streamed call
const stream = b.stream.ClassifyMessage({
    convo: [
        {
            role: Role.Customer,
            content: "I want to cancel my subscription"
        }
    ]
});
 
for await (const partial of stream) {
    console.log(partial); // Autocompletes to a Category[]
}
const final = await stream.get_final_result(); // Autocompletes to a Category[]

From baml-examples/nextjs-starter/baml_src/classify_message.baml:

enum Category {
    Refund
    CancelOrder
    TechnicalSupport
    AccountIssue
    Question
}
 
class Message {
  role Role
  content string
}
 
enum Role {
  Customer
  Assistant
}
 
template_string PrintMessage(msg: Message, prefix: string?) #"
  {{ _.role('user' if msg.role == "Customer" else 'assistant') }}
  {% if prefix %}
  {{ prefix }}
  {% endif %}
  {{ msg.content }}
"#
 
function ClassifyMessage(convo: Message[]) -> Category[] {
  client GPT4
  prompt #"
    {# 
      Prompts are auto-dedented and trimmed.
      We use JINJA for our prompt syntax
      (but we added some static analysis to make sure it's valid!)
    #}
 
    {{ ctx.output_format(prefix="Classify with the following json:") }}
 
    {% for c in convo %}
    {{ PrintMessage(c, 
      'This is the message to classify:' if loop.last and convo|length > 1 else null
    ) }}
    {% endfor %}
 
    {{ _.role('assistant') }}
    JSON array of categories that match:
  "#
}

BAML (Ruby)

From baml-ruby-starter/examples.rb:

require_relative "baml_client/client"
 
b = Baml::BamlClient.from_directory("baml_src")
 
input = "Can't access my account using my usual login credentials"
classified = b.ClassifyMessage(input: input)
 
puts classified.categories

From baml-ruby-starter/baml_src/classify_message.baml:

enum Category {
    Refund
    CancelOrder
    TechnicalSupport
    AccountIssue
    Question
}
 
class MessageFeatures {
    categories Category[]
}
 
function ClassifyMessage(input: string) -> MessageFeatures {
  client GPT4Turbo
 
  prompt #"
    {# _.role("system") starts a system message #}
    {{ _.role("system") }}
 
    Classify the following INPUT.
 
    {{ ctx.output_format }}
 
    {# This starts a user message #}
    {{ _.role("user") }}
 
    INPUT: {{ input }}
 
    Response:
  "#
}

Instructor (Python)

From simple_prediction.py:

class Labels(str, enum.Enum):
    SPAM = "spam"
    NOT_SPAM = "not_spam"
 
class SinglePrediction(BaseModel):
    """
    Correct class label for the given text
    """
 
    class_label: Labels
 
def classify(data: str) -> SinglePrediction:
    return client.chat.completions.create(
        model="gpt-3.5-turbo-0613",
        response_model=SinglePrediction,
        messages=[
            {
                "role": "user",
                "content": f"Classify the following text: {data}",
            },
        ],
    )  # type: ignore
 
prediction = classify("Hello there I'm a nigerian prince and I want to give you money")
assert prediction.class_label == Labels.SPAM

instructor-js

From simple_prediction/index.ts:

import { z } from "zod"
 
enum CLASSIFICATION_LABELS {
  "SPAM" = "SPAM",
  "NOT_SPAM" = "NOT_SPAM"
}
 
const SimpleClassificationSchema = z.object({
  class_label: z.nativeEnum(CLASSIFICATION_LABELS)
})
 
const createClassification = async (data: string) => {
  const classification = await client.chat.completions.create({
    messages: [{ role: "user", content: `"Classify the following text: ${data}` }],
    model: "gpt-3.5-turbo",
    response_model: { schema: SimpleClassificationSchema, name: "SimpleClassification" },
    max_retries: 3,
    seed: 1
  })
 
  return classification
}
 
const classification = await createClassification(
  "Hello there I'm a nigerian prince and I want to give you money"
)
// OUTPUT: { class_label: 'SPAM' }
 
console.log({ classification })
 
assert(
  classification?.class_label === CLASSIFICATION_LABELS.SPAM,
  `Expected ${classification?.class_label} to be ${CLASSIFICATION_LABELS.SPAM}`
)

TypeChat (Python)

From examples/sentiment/demo.py:

import asyncio
 
import sys
from dotenv import dotenv_values
import schema as sentiment
from typechat import Failure, TypeChatJsonTranslator, TypeChatValidator, create_language_model, process_requests
 
async def main():    
    env_vals = dotenv_values()
    model = create_language_model(env_vals)
    validator = TypeChatValidator(sentiment.Sentiment)
    translator = TypeChatJsonTranslator(model, validator, sentiment.Sentiment)
 
    async def request_handler(message: str):
        result = await translator.translate(message)
        if isinstance(result, Failure):
            print(result.message)
        else:
            result = result.value
            print(f"The sentiment is {result.sentiment}")
 
    file_path = sys.argv[1] if len(sys.argv) == 2 else None
    await process_requests("😀> ", file_path, request_handler)
 
 
if __name__ == "__main__":
    asyncio.run(main())

From examples/sentiment/schema.py:

from dataclasses import dataclass
from typing_extensions import Literal, Annotated, Doc
 
@dataclass
class Sentiment:
    """
    The following is a schema definition for determining the sentiment of a some user input.
    """
 
    sentiment: Annotated[Literal["negative", "neutral", "positive"],
                         Doc("The sentiment for the text")]

TypeChat (TypeScript)

From examples/sentiment/src/main.ts:

import { createJsonTranslator, createLanguageModel } from "typechat";
import { processRequests } from "typechat/interactive";
import { createTypeScriptJsonValidator } from "typechat/ts";
import { SentimentResponse } from "./sentimentSchema";
 
const dotEnvPath = findConfig(".env");
assert(dotEnvPath, ".env file not found!");
dotenv.config({ path: dotEnvPath });
 
const model = createLanguageModel(process.env);
const schema = fs.readFileSync(path.join(__dirname, "sentimentSchema.ts"), "utf8");
const validator = createTypeScriptJsonValidator<SentimentResponse>(schema, "SentimentResponse");
const translator = createJsonTranslator(model, validator);
 
// Process requests interactively or from the input file specified on the command line
processRequests("😀> ", process.argv[2], async (request) => {
    const response = await translator.translate(request);
    if (!response.success) {
        console.log(response.message);
        return;
    }
    console.log(`The sentiment is ${response.data.sentiment}`);
});

From examples/sentiment/src/sentimentSchema.ts:

export interface SentimentResponse {
    sentiment: "negative" | "neutral" | "positive";  // The sentiment of the text
}

TypeChat (C#/.NET)

From examples/Sentiment/Program.cs:

using Microsoft.TypeChat;
 
namespace Sentiment;
 
public class SentimentApp : ConsoleApp
{
    JsonTranslator<SentimentResponse> _translator;
 
    public SentimentApp()
    {
        OpenAIConfig config = Config.LoadOpenAI();
        // Although this sample uses config files, you can also load config from environment variables
        // OpenAIConfig config = OpenAIConfig.LoadFromJsonFile("your path");
        // OpenAIConfig config = OpenAIConfig.FromEnvironment();
        _translator = new JsonTranslator<SentimentResponse>(new LanguageModel(config));
    }
 
    public override async Task ProcessInputAsync(string input, CancellationToken cancelToken)
    {
        SentimentResponse response = await _translator.TranslateAsync(input, cancelToken);
        Console.WriteLine($"The sentiment is {response.Sentiment}");
    }
}

From examples/Sentiment/SentimentSchema.cs:

using System.Text.Json.Serialization;
using Microsoft.TypeChat.Schema;
 
namespace Sentiment;
 
public class SentimentResponse
{
    [JsonPropertyName("sentiment")]
    [JsonVocab("negative | neutral | positive")]
    public string Sentiment { get; set; }
}

Marvin

From the Marvin docs:

import marvin
from pydantic import BaseModel
 
class Recipe(BaseModel):
    name: str
    cook_time_minutes: int
    ingredients: list[str]
    steps: list[str]
 
@marvin.fn
def recipe(
    ingredients: list[str], 
    max_cook_time: int = 15, 
    cuisine: str = "North Italy", 
    experience_level:str = "beginner"
) -> Recipe:
    """
    Returns a complete recipe that uses all the `ingredients` and 
    takes less than `max_cook_time`  minutes to prepare. Takes 
    `cuisine` style and the chef's `experience_level` into account 
    as well.
    """

Last thoughts

This is a living document, and we'll be updating it as we learn more about other frameworks.

If you have any questions, comments, or suggestions, feel free to reach out to us on Discord or Twitter at @boundaryml. We're happy to also meet and help with any prompting / AI engineering questions you might have.


Thanks for reading!