Efficient Guided Generation for Large Language Models: Abstract and Intro

2 Jun 2024

Author:

(1) Brandon T. Willard, Normal Computing;

(2) R´emi Louf, Normal Computing.

Table of Links

Abstract

In this article we show how the problem of neural text generation can be constructively reformulated in terms of transitions between the states of a finite-state machine. This framework leads to an efficient approach to guiding text generation with regular expressions and context-free grammars by allowing the construction of an index over a language model’s vocabulary. The approach is model agnostic, allows one to enforce domain-specific knowledge and constraints, and enables the construction of reliable interfaces by guaranteeing the structure of the generated text. It adds little overhead to the token sequence generation process and significantly outperforms existing solutions. An implementation is provided in the open source Python library Outlines [Louf and Willard].

1. Introduction

We are concerned with the problem of generating sequences of tokens from a large language model (LLM) [Vaswani et al., 2017, Radford et al., 2019] that conform to regular expressions or context-free grammars (CFGs). This kind of guided LLM generation is used to make LLM model output usable under rigid formatting requirements that are either hard or costly to capture through fine-tuning alone [Beurer-Kellner et al., 2023, Scholak et al., 2021, Poesia et al., 2022a, Rabinovich et al., 2017, Weng, 2021, Dong et al., 2023, Poesia et al., 2022b, Geng et al., 2023, Wang et al., 2023]. Such features have recently been generalized in prompting libraries and interfaces [Microsoft, 2023, Beurer-Kellner et al., 2023, Rickard, 2023a,b], but their applicability can be limited by their scaling costs.

Most implementations of guided generation bias the score values used to determine the probabilities of the tokens in an LLM’s vocabulary. A common and sufficient approach involves repeated evaluations over the entire vocabulary in order to determine which tokens are valid–according to the constraints and previously sampled tokens–and setting the probabilities of invalid tokens to zero. This approach entails a fixed O(N) cost for each token generated, where N is the size of the LLM’s vocabulary.

We propose an approach that uses the finite state machine (FSM) formulation of regular expressions to both arbitrarily start and stop guided generation and allow the construction of an index with which the set of nonzero-probability tokens can be obtained efficiently at each step. The result is an algorithm that costs O(1) on average.

For the regular expression case, our approach shares the most similarity with Kuchnik et al. [2023], which uses a transducer formulation to obtain FSMs defined over a language model’s vocabulary, and these FSMs contain much of the same information and scaling benefits as the indices described here. Our approach does not require the complete transducer abstraction and can be used to more easily extend existing, efficient regular expression libraries without modifying the underlying automatons and their implementations.

More importantly, our indexing approach can also be extended to CFGs and LALR(1) parsers to allow for efficient guided generation according to popular data formats and programming languages (e.g. JSON, Python, SQL, etc.). The transition to parsing is made by way of augmentations to traditional LALR(1) parser components and operations, making it–again–an approach that can be used to extend existing parser implementations.

This paper is available on arxiv under CC 4.0 license.