Every Way To Get Structured Output From LLMs
By Sam Lijin - @sxlijin
Update (Jun 18): check out the discussion on Hacker News and /r/LocalLLaMA! Thanks for all the feedback and comments, folks- keep it coming!
This post will be interesting to you if:
- you're trying to get structured output from an LLM,
- you've tried
response_format: "json"
and function calling and been disappointed by the results, - you're tired of stacking regex on regex on regex to extract JSON from an LLM,
- you're trying to figure out what your options are.
Everyone using LLMs in production runs into this problem sooner or later: what we really want is a magical black box that returns JSON in exactly the format we want. Unfortunately, LLMs return English, not JSON, and it turns out that converting English to JSON is kinda hard.
Here's every framework we could find that solves this problem, and how they compare.
(Disclaimer: as a player in this space, we're a little biased!)
Comparison
Framework | Language Support | Does it handle or prevent malformed JSON? | How do I build the prompt? | Do I have full control over the prompt? | How do I see the final prompt? | Supported Model Providers | API flavors | How do I define output types? | Test Framework? |
---|---|---|---|---|---|---|---|---|---|
✅ Yes, using a new Rust-based error-tolerant parser (e.g. can parse {"foo": "bar} ) | Jinja templates | ✅ Yes | VSCode extension | ✅ OpenAI | ❌ sync | BAML schemas, transpiled to Pydantic | ✅ VSCode Extension 🚧 CLI | ||
❌ sync | BAML schemas, transpiled to TS | ||||||||
✅ sync | BAML schemas, transpiled to Sorbet | ||||||||
⚠️ Supports LLM-based retries (none by default) | Build the messages array | ❌ No (feature request) | No supported mechanism | ✅ OpenAI | ✅ sync | Pydantic | Via the Parea platform | ||
Build the messages array | ❌ No | ✅ OpenAI | ❌ sync | Zod | |||||
TypeChat | ⚠️ Automatic LLM-based retries | pass in a string | ❌ No | n/a | ✅ OpenAI | ❌ sync | Pydantic | None | |
pass in a string | ❌ No | Zod | None | ||||||
pass in a string | ❌ No | C# class | None | ||||||
⚠️ Supports LLM-based retries (none by default) | Jinja templates | ❌ No | No supported mechanism | OpenAI | ✅ sync | Pydantic | None | ||
Python (Example pending) | ❌ OpenAI | pass in a string | ✅ Yes | n/a | ⚠️️ OpenAI1 | ✅ sync | Pydantic | None | |
Python (Example pending) | ⚠️ OpenAI | pass in a string | ✅ Yes | n/a | ✅ llama.cpp | ✅ sync | None | ||
Python (Example pending) | ❌ OpenAI | pass in a string | ✅ Yes | n/a | OpenAI1 | ✅ sync | None | ||
Python (Example pending) | ❌ OpenAI | pass in a string | ✅ Yes | n/a | Transformers2 | ✅ sync | JSON schema | None | |
TypeScript (Example pending) | ❌ No | TODO | TODO | TODO | ⚠️ Google AI | TODO | Zod | None | |
❌ OpenAI | TODO | TODO | TODO | TODO | TODO | Regex | None | ||
❌ OpenAI | TODO | TODO | TODO | TODO | TODO | JSON schema | None |
*: we've omitted LangChain from this list because we haven't heard of anyone using it in production - look no further than the top posts of all time on /r/LangChain.
**: Honorable mention to Microsoft's AICI, which is working on creating a shim for cooperative constraints implemented in Python/JS using a WASM runtime. Haven't included it in the list because it seems more low-level than the others, and setup is very involved.
1: Applying constraints to OpenAI models can be very error-prone, because the OpenAI API does not expose sufficient information about the underlying model operations for the framework to actually apply constraints effectively. See this discussion about limitations from the LMQL documentation.
2: Transformers refers to "HuggingFace Transformers"
3: Constrained streaming generation produces partial objects, but
no good ways of interacting with the partial objects, since they are not yet
parse-able. We only consider a framework to support streaming if it allows
interacting with the partial objects (e.g. if streaming back an object with
properties foo
and bar
, you can access obj.foo
before bar
has been
streamed to the client).
Criteria
Most of our criteria are pretty self-explanatory, but there are two that we want to call out:
Does it handle/prevent malformed JSON? If so, how?
LLMs make a lot of the same mistakes that humans do when producing JSON (e.g. a } in the wrong place or a missing comma), so it's important that the framework can help you handle these errors.
A lot of frameworks "solve" this by feeding the malformed JSON back into the LLM and asking it to repair the JSON. This kinda works, but it's also slow and expensive. If your LLM calls individually take multiple seconds already, you don't really want to make that even slower!
There are two techniques that exist for handling or preventing this: actually parse the malformed JSON (BAML takes this approach) or constrain the LLM's token generation to guarantee that valid JSON is produced (this is what Outlines, Guidance, and a few others do).
Parsing the malformed JSON is our preferred approach: it most closely aligns with what the LLM was designed to do (emit tokens), is fast (takes microseconds), and flexible (allows working with any LLM). It does have limitations: it can't magically make sense of completely nonsensical JSON, after all.
Applying constraints to LLM token generation, by contrast, can be robust, but has its own issues: doing this efficiently requires applying runtime transforms to the model itself, so this only works with self-hosted models (e.g. Llama, Transformers) and does not work with models like OpenAI's ChatGPT or Anthropic's Claude.
Can you see the actual prompt? Do you have full control over the prompt?
Prompts are how we "program" LLMs to give us output.
The best way to get an LLM to return structured data is to craft a prompt designed to return data matching your specific schema. To do that, you need to
- see the prompt actually getting sent to ChatGPT, and
- try different prompts.
Most frameworks, unfortunately, have hardcoded templates baked in which prevent doing this.
Example code
For each framework listed above, we've included example code, from the framework's documentation, provides for how you would use it.
BAML (Python)
From baml-examples/fastapi-starter/fast_api_starter/app.py
:
From baml-examples/fastapi-starter/baml_src/extract_resume.baml
:
BAML (TS)
From baml-examples/nextjs-starter/app/api/example_baml/route.ts
:
From baml-examples/nextjs-starter/baml_src/classify_message.baml
:
BAML (Ruby)
From baml-ruby-starter/examples.rb
:
From baml-ruby-starter/baml_src/classify_message.baml
:
Instructor (Python)
From simple_prediction.py
:
instructor-js
From simple_prediction/index.ts
:
TypeChat (Python)
From examples/sentiment/demo.py
:
From examples/sentiment/schema.py
:
TypeChat (TypeScript)
From examples/sentiment/src/main.ts
:
From examples/sentiment/src/sentimentSchema.ts
:
TypeChat (C#/.NET)
From examples/Sentiment/Program.cs
:
From examples/Sentiment/SentimentSchema.cs
:
Marvin
From the Marvin docs:
Last thoughts
This is a living document, and we'll be updating it as we learn more about other frameworks.
If you have any questions, comments, or suggestions, feel free to reach out to us on Discord or Twitter at @boundaryml. We're happy to also meet and help with any prompting / AI engineering questions you might have.