Inversion: fast, reliable structured LLMs

Today we're excited to announce Inversion - our family of structured language models designed to solve the speed, reliability, and reasoning issues in traditional AI systems.

Speed

50%

90%

99%

Inference speed in characters per second, 33rd/66th/99th percentile across 600 extraction & reasoning tests. (higher is better)

Our first generation models are state of the art in structured tasks such as extraction and function calling while running up to 100× faster, with 10× lower latency, outputting 100% reliable structure with 10,000× less overhead than the best alternatives, and boasting the deepest support for typed JSON output available anywhere.*

Inversion models do more with less - they use less compute, less time, and less data to produce outputs with higher quality, reliability, and reasoning.

100× faster typed LLM inference

We started building what became Inversion back in February 2023, inspired by the newfound abilities of general-purpose large language models to understand systems and glue together human intent and machine action through natural language and structured data.

As we built products on top of these models, we found that the models were far too unreliable, expensive, and slow for production use in structured data tasks. We needed to create a new kind of model that could handle our workloads in the real world at scale.

The key insight is that structured inference is fundamentally accelerative - and that if we build models that can always reliably output structured data with constraints, we can massively improve both the speed and quality of the outputs.

Latency

min

avg

max

Time-to-first-token in milliseconds, min/avg/max across 600 extraction & reasoning tests. (lower is better)

We set ourselves to the task of taking the quality of outputs from the best available LLMs for workloads like function calling - or actions/workflows and dynamic UI generation - down from around one minute to under 200 milliseconds, which is roughly the time it takes for a human to perceive a response as instant.

We also wanted to ensure that the outputs would always match the data types we expected, in our case this meant being validated against a JSON schema for function arguments or component props.

Such a system would unlock a cambrian explosion of new viable applications for AI, with reliable real-time feel - everything from humanoid robot assistants and game NPCs that react to complex dynamic environments, to natural language interfaces and agents that can understand and act on complex human intent in the blink of an eye.

This means:

We needed to process schemas/grammars in nearly no time,
We needed to bring down time-to-first-token to nearly no time,
We needed to accelerate inference to over 10,000 char/s.

Type compilation

min

avg

max

Time to compile output constraints in microseconds, min/avg/max across 400 JSON schema tests. No caching, fully dynamic, identical hardware. Ratios are average of test ratios, not ratio of test averages. (lower is better)

The first set of components we built were the systems we use to process data structures and constrain model outputs with them. We invented a new kind of compiler and physics-based projection model that achieves stricter constraints than the best comparable libraries for typed JSON generation, with around 10,000× faster compilation.

The Inversion compiler processes a typical never-before-seen JSON schema in around 400 μs (microseconds) and samples model constraints at runtime in around 20 μs, supporting up to 50,000 tokens per second inference with perfectly structured output.

Type errors

valid

error

Percentage of invalid output data structure across 600 extraction & reasoning tests. (lower is better)

Read more about the supported types and constraints in the docs.

The developer experience of knowing you're going to get exactly the type you ask for is bliss.

Always-valid outputs are a game changer for structured workloads, dramatically improving the reliability and reasoning level of LLMs across most tasks. Inversion models often match or beat all other models we've tested against, even compared to models with around 10× or 100× as many params.

Ability

action

extraction

generation

Percentage of correct outputs across 600 tests, grouped by: actions/functions, extraction, and typed data generation. (higher is better)

We're expanding access to the first generation of Inversion models shortly, and have begun building the next generation of models targeting on the order of 100,000 char/s inference.

Accelerating structured inference

You might be wondering - how does any of this actually work?

Why is Inversion so far ahead on multiple fronts that might otherwise seem at odds with each other, demanding a trade-off, like constraint and speed?

When we started with Inversion, it was actually slightly slower than similar models - barely faster than the strongest model and barely more reliable, as of early last spring. The first step was to achieve guaranteed structure on common JSON types, which is where the compiler originated.

Next, we leveraged the compiled structures to "invert" the inference process, using the constraint of the output to dynamically scale up and down the amount of compute required to produce each token.

You can think of this as switching on and off large clusters of individual neurons in the model based on the position within the output structure, instead of only modifying token sampling. Mathematically, this is a projection from the full model onto a smaller model with undesired sublayers pruned away.

Throughput over time

third-party

rysana

Throughput in characters per second across tests in Inversion models. Third party models to compare in grey. (higher is better)

We built a simple version of this last summer that gained a ~10× inference speed boost, but to achieve the current level of performance we had to completely rewrite our inference and learning systems from scratch, and train new kinds of neural networks to augment the transformers with dynamic acceleration.

Today, our models combine dozens of major improvements and hundreds of minor optimizations over the status quo at every addressable level of the stack, and we have been consistently improving their efficiency month by month.

Inversion v2 & onwards

We're also working on a fundamentally new class of models for the next generation of Inversion - they are not ready yet, but so far we expect another several orders of magnitude improvement across the board, with as many heavy workloads as possible completing in single or double digit milliseconds for fractions of a cent in compute cost and at unprecedented reliability and quality.

One particular improvement we're excited about is in attention, where we're building towards processing especially large input prompts in milliseconds instead of minutes.

We're also excited about:

Breaking the trend of pre-trained and generic models, by moving to an architecture that can adapt to every user on the fly and constantly improve every day with updated information.
Generative UI composition on the fly, with typed sandboxed code generation in real time.
Improved multilingual support, with deep understanding of dialect and personal nuance.
Much, much more!

We've made incredibly promising advances so far towards these next generation systems, and we're excited to share more about it in the coming months.

Built for developers

We're aiming to deliver the best possible developer experience, see an example of various approaches to making a request through JS/TS (with zod), Python (with pydantic), cURL (with JSON Schema):

import { z } from 'zod'

const schema = z.object({
  name: z.string(),
  age: z.number().int(),
})
await ai.completions.create({
  schema,
  prompt: 'extract json derulo is 32'
})

Output:

{ "name": "json derulo", "age": 32 }

Our shared future

We see a path into the future where AI evolves with humans, shaping our shared future as powerful & helpful allies. As AI progresses, we will work to ensure that it remains accessible & beneficial for all.

Together we'll create a future where technology augments human genius, trivializes mundane tasks, and empowers everyone to live better lives & pursue their passions.

Join us on this journey as we share insights & access to the technology we're building.

Reach us on X @RysanaAI.

Subscribe to our newsletter

Receive an email when we publish a new post. No spam, just the good stuff.

Summary

We've created Inversion - a family of structured language models designed to solve the speed, reliability, and reasoning issues in traditional AI systems.

We're expanding access to the first generation of Inversion models shortly, and have begun building the next generation of models targeting on the order of 100,000 char/s inference.

Be among the first to try Inversion. Sign up for early access.

*All approximate numbers based on 1000 tests as of March-April 2024. Results may vary. Experimental models not yet finalized. Models are tested by a single client making sequential requests to each model API server in rapid succession. Inference speed is calculated as the total output characters divided by the total request time. Throughput is calculated as the total output characters divided by the total request time minus the time to first tokens. Latency is calculated as the time from start of request to receiving the first tokens on the requesting client. Type error rate is the percentage of outputs that fail to parse according to the requested schema.