xgrammar¶
Classes:
This is the primary object to store compiled grammar. |
|
|
This class represents a grammar object in XGrammar, and can be used later in the grammar-guided generation. |
|
The compiler for grammars. |
|
Match the output of the LLM to the specified grammar, then generate the mask for the next token. |
|
The tokenizer info contains the vocabulary, the type of the vocabulary, and necessary information for the grammar-guided generation. |
|
The type of the vocabulary. |
Functions:
|
Allocate the bitmask for the next token prediction. |
|
Apply the bitmask to the logits in-place. |
|
Return the shape of the bitmask: (batch_size, ceil(vocab_size / 32)) |
|
Reset the bitmask to the full mask. |
- class xgrammar.CompiledGrammar¶
This is the primary object to store compiled grammar.
A CompiledGrammar can be used to construct GrammarMatcher to generate token masks efficiently.
Note
Do not construct this class directly, instead use
GrammarCompiler
to construct the object.Attributes:
The original grammar.
The tokenizer info associated with the compiled grammar.
- property tokenizer_info: TokenizerInfo¶
The tokenizer info associated with the compiled grammar.
- class xgrammar.Grammar¶
This class represents a grammar object in XGrammar, and can be used later in the grammar-guided generation.
The Grammar object supports context-free grammar (CFG). EBNF (extended Backus-Naur Form) is used as the format of the grammar. There are many specifications for EBNF in the literature, and we follow the specification of GBNF (GGML BNF) in https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md.
When printed, the grammar will be converted to GBNF format.
Methods:
Get the grammar of standard JSON.
from_ebnf
(ebnf_string, *[, root_rule_name])Construct a grammar from EBNF string.
from_json_schema
(schema, *[, ...])Construct a grammar from JSON schema.
- static builtin_json_grammar() Grammar ¶
Get the grammar of standard JSON. This is compatible with the official JSON grammar specification in https://www.json.org/json-en.html.
- Returns:
grammar – The JSON grammar.
- Return type:
- static from_ebnf(ebnf_string: str, *, root_rule_name: str = 'root') Grammar ¶
Construct a grammar from EBNF string. The EBNF string should follow the format in https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md.
- Parameters:
ebnf_string (str) – The grammar string in EBNF format.
root_rule_name (str, default: "root") – The name of the root rule in the grammar.
- static from_json_schema(schema: Union[str, Type[BaseModel]], *, any_whitespace: bool = True, indent: Optional[int] = None, separators: Optional[Tuple[str, str]] = None, strict_mode: bool = True) Grammar ¶
Construct a grammar from JSON schema. Pydantic model or JSON schema string can be used to specify the schema.
It allows any whitespace by default. If user want to specify the format of the JSON, set any_whitespace to False and use the indent and separators parameters. The meaning and the default values of the parameters follows the convention in json.dumps().
It internally converts the JSON schema to a EBNF grammar.
- Parameters:
schema (Union[str, Type[BaseModel]]) – The schema string or Pydantic model.
any_whitespace (bool, default: True) – Whether to use any whitespace. If True, the generated grammar will ignore the indent and separators parameters, and allow any whitespace.
indent (Optional[int], default: None) –
The number of spaces for indentation. If None, the output will be in one line.
Note that specifying the indentation means forcing the LLM to generate JSON strings strictly formatted. However, some models may tend to generate JSON strings that are not strictly formatted. In this case, forcing the LLM to generate strictly formatted JSON strings may degrade the generation quality. See <https://github.com/sgl-project/sglang/issues/2216#issuecomment-2516192009> for more details.
separators (Optional[Tuple[str, str]], default: None) – Two separators used in the schema: comma and colon. Examples: (“,”, “:”), (”, “, “: “). If None, the default separators will be used: (“,”, “: “) when the indent is not None, and (”, “, “: “) otherwise.
strict_mode (bool, default: True) –
Whether to use strict mode. In strict mode, the generated grammar will not allow properties and items that is not specified in the schema. This is equivalent to setting unevaluatedProperties and unevaluatedItems to false. It also disallows empty JSON objects and arrays.
This helps LLM to generate accurate output in the grammar-guided generation with JSON schema.
- Returns:
grammar – The constructed grammar.
- Return type:
- class xgrammar.GrammarCompiler(tokenizer_info: TokenizerInfo, *, max_threads: int = 8, cache_enabled: bool = True)¶
The compiler for grammars. It is associated with a certain tokenizer info, and compiles grammars into CompiledGrammar with the tokenizer info. It allows parallel compilation with multiple threads, and has a cache to store the compilation result, avoiding compiling the same grammar multiple times.
- Parameters:
tokenizer_info (TokenizerInfo) – The tokenizer info.
max_threads (int, default: 8) – The maximum number of threads used to compile the grammar.
cache_enabled (bool, default: True) – Whether to enable the cache.
Methods:
Clear all cached compiled grammars.
Get CompiledGrammar from the standard JSON.
compile_json_schema
(schema, *[, ...])Get CompiledGrammar from the specified JSON schema and format.
- clear_cache() None ¶
Clear all cached compiled grammars.
- compile_builtin_json_grammar() CompiledGrammar ¶
Get CompiledGrammar from the standard JSON.
- Returns:
compiled_grammar – The compiled grammar.
- Return type:
- compile_json_schema(schema: Union[str, Type[BaseModel]], *, any_whitespace: bool = True, indent: Optional[int] = None, separators: Optional[Tuple[str, str]] = None, strict_mode: bool = True) CompiledGrammar ¶
Get CompiledGrammar from the specified JSON schema and format. The indent and separators parameters follow the same convention as in json.dumps().
- Parameters:
schema (Union[str, Type[BaseModel]]) – The schema string or Pydantic model.
indent (Optional[int], default: None) – The number of spaces for indentation. If None, the output will be in one line.
separators (Optional[Tuple[str, str]], default: None) – Two separators used in the schema: comma and colon. Examples: (“,”, “:”), (”, “, “: “). If None, the default separators will be used: (“,”, “: “) when the indent is not None, and (”, “, “: “) otherwise.
strict_mode (bool, default: True) – Whether to use strict mode. In strict mode, the generated grammar will not allow properties and items that is not specified in the schema. This is equivalent to setting unevaluatedProperties and unevaluatedItems to false.
- Returns:
compiled_grammar – The compiled grammar.
- Return type:
- class xgrammar.GrammarMatcher(compiled_grammar: CompiledGrammar, *, override_stop_tokens: Optional[Union[List[int], int]] = None, terminate_without_stop_token: bool = False, max_rollback_tokens: int = 0)¶
Match the output of the LLM to the specified grammar, then generate the mask for the next token. This is the core class in the grammar-guided generation.
This class maintains a stateful matcher that can accept tokens and strings, then match them to the specified grammar. The matcher can provide a bitmask for the next token prediction, so that the output of the LLM follows the specified grammar. Its state can be reset and rolled back by tokens. It also provides utilities for jump-forward decoding.
After matching the whole grammar, the matcher will accept a stop token. The token mask at this time will only allow stop tokens. After accepting the stop token, the matcher will terminate, then it cannot accept any new token or generate a new token mask, meaning the generation is finished.
Under the hood, it utilizes a pushdown automaton with backtracking to match the grammar, with optimizations specific to LLM token mask generation.
- Parameters:
compiled_grammar (CompiledGrammar) – The initialization context for the grammar matcher.
override_stop_tokens (Optional[Union[int, List[int]]], default: None) – If not None, the stop tokens to override the ones in the grammar.
terminate_without_stop_token (bool, default: False) – Whether to terminate the matcher without accepting a stop token.
max_rollback_tokens (int, default: 0) – The maximum number of rollback tokens allowed. The rollback operation is useful for jump-forward decoding and speculative decoding.
Methods:
accept_token
(token_id, *[, debug_print])Accept one token and update the state of the matcher.
fill_next_token_bitmask
(bitmask[, index])Fill the bitmask for the next token prediction.
Find the jump-forward string for jump-forward decoding.
Check if the matcher has terminated.
reset
()Reset the matcher to the initial state.
rollback
([num_tokens])Rollback the matcher to a previous state by several tokens.
Attributes:
Get the maximum number of rollback tokens allowed.
The ids of the stop tokens used in the matcher.
- accept_token(token_id: int, *, debug_print: bool = False) bool ¶
Accept one token and update the state of the matcher.
- Parameters:
token_id (int) – The id of the token to accept.
debug_print (bool, default: False) – Whether to print information about the internal state of the matcher. Helpful for debugging.
- Returns:
accepted – Whether the token is accepted.
- Return type:
bool
- fill_next_token_bitmask(bitmask: Tensor, index: int = 0) None ¶
Fill the bitmask for the next token prediction. The input bitmask can be generated by allocate_token_bitmask, and must be on CPU. bitmask[index] will be filled with the next token bitmask.
This method does not change the matcher state.
- Parameters:
bitmask (torch.Tensor) – The bitmask for the next token prediction.
index (int, default: 0) – The batch id of the bitmask.
- find_jump_forward_string() str ¶
Find the jump-forward string for jump-forward decoding. This is the longest string that certainly conforms with the current grammar from the current matcher state. This string can become the output of the LLM without requiring LLM decoding.
This method does not change the matcher state.
- Returns:
jump_forward_string – The jump-forward string.
- Return type:
str
- is_terminated() bool ¶
Check if the matcher has terminated. If terminate_without_stop_token is False, the matcher will terminate if it has accepted the stop token. Otherwise, the matcher will terminate after matching the whole grammar.
- Returns:
terminated – Whether the matcher has terminated.
- Return type:
bool
- property max_rollback_tokens: int¶
Get the maximum number of rollback tokens allowed.
- Returns:
max_rollback_tokens – The maximum number of rollback tokens.
- Return type:
int
- reset() None ¶
Reset the matcher to the initial state.
- rollback(num_tokens: int = 1) None ¶
Rollback the matcher to a previous state by several tokens.
- Parameters:
num_tokens (int, default: 1) – The number of tokens to rollback. It cannot exceed the current number of steps, nor can it exceed the specified maximum number of rollback tokens.
- property stop_token_ids: List[int]¶
The ids of the stop tokens used in the matcher. If specified, the provided stop tokens will be used. Otherwise, the stop tokens will be detected from the vocabulary.
- Returns:
stop_token_ids – The ids of the stop tokens.
- Return type:
List[int]
- class xgrammar.TokenizerInfo(encoded_vocab: Union[List[bytes], List[str]], vocab_type: VocabType = VocabType.RAW, *, vocab_size: Optional[int] = None, stop_token_ids: Optional[Union[List[int], int]] = None, prepend_space_in_tokenization: bool = False)¶
The tokenizer info contains the vocabulary, the type of the vocabulary, and necessary information for the grammar-guided generation.
Note that although some tokenizers will encode the tokens in a special format, e.g. “<0x1B>” for “” in the ByteFallback tokenizer, and “Ġ” for ” ” in the Byte-Level BPE tokenizer, TokenizerInfo always decodes the vocabulary to the original format (e.g. “” and ” “).
Also note that some models (e.g. Phi-3 and Deepseek-V2) may pad the vocabulary to a multiple of 32. In this case, the model’s vocab_size is larger than the tokenizer’s vocabulary size. Please pass the model’s vocab_size to the vocab_size parameter in the constructor, because this information is used to determine the size of the token mask.
- Parameters:
encoded_vocab (Union[List[bytes], List[str]]) – The encoded vocabulary of the tokenizer.
vocab_type (VocabType, default: VocabType.RAW) – The type of the vocabulary. See also VocabType.
vocab_size (Optional[int], default: None) – The size of the vocabulary. If not provided, the vocabulary size will be len(encoded_vocab).
stop_token_ids (Optional[List[int]], default: None) – The stop token ids. If not provided, the stop token ids will be auto detected (but may not be correct).
prepend_space_in_tokenization (bool, default: False) – Whether the tokenizer will prepend a space before the text in the tokenization process.
Attributes:
The decoded vocabulary of the tokenizer.
Whether the tokenizer will prepend a space before the text in the tokenization process.
The special token ids.
The stop token ids.
The size of the vocabulary.
The type of the vocabulary.
Methods:
Dump the metadata of the tokenizer to a json string.
from_huggingface
(tokenizer, *[, vocab_size, ...])Construct the tokenizer info from the huggingface tokenizer.
from_vocab_and_metadata
(encoded_vocab, metadata)Construct the tokenizer info from the vocabulary and the metadata string in json format.
- property decoded_vocab: List[bytes]¶
The decoded vocabulary of the tokenizer. This converts the tokens in the LLM’s vocabulary back to the original format of the input text. E.g. for type ByteFallback, the token <0x1B> is converted back to “”.
- dump_metadata() str ¶
Dump the metadata of the tokenizer to a json string. It can be used to construct the tokenizer info from the vocabulary and the metadata string.
- static from_huggingface(tokenizer: PreTrainedTokenizerBase, *, vocab_size: Optional[int] = None, stop_token_ids: Optional[Union[List[int], int]] = None) TokenizerInfo ¶
Construct the tokenizer info from the huggingface tokenizer. This constructor supports various tokenizer backends, including the huggingface fast tokenizer and tiktoken tokenizer. Necessary information is automatically detected from the tokenizer.
Note that some models (e.g. Phi-3 and Deepseek-V2) may pad the vocabulary to a multiple of 32. In this case, the model’s vocab_size is larger than the tokenizer’s vocabulary size. Please pass the model’s vocab_size (this should be defined in the model config) to the vocab_size parameter in the constructor, because this information is used to determine the size of the token mask.
Some models can have more than one stop token ids, and auto detection may not find all of them. In this case, you can specify the stop token ids manually.
- Parameters:
tokenizer (PreTrainedTokenizerBase) – The huggingface tokenizer.
vocab_size (Optional[int], default: None) – The size of the vocabulary. If not provided, the vocabulary size will be len(encoded_vocab).
stop_token_ids (Optional[List[int]], default: None) – The stop token ids. If not provided, the stop token ids will be auto detected (but may not be correct).
- Returns:
tokenizer_info – The tokenizer info.
- Return type:
- static from_vocab_and_metadata(encoded_vocab: List[Union[bytes, str]], metadata: str) TokenizerInfo ¶
Construct the tokenizer info from the vocabulary and the metadata string in json format.
- Parameters:
encoded_vocab (List[Union[bytes, str]]) – The encoded vocabulary of the tokenizer.
metadata (str) – The metadata string in json format.
- property prepend_space_in_tokenization: bool¶
Whether the tokenizer will prepend a space before the text in the tokenization process.
- property special_token_ids: List[int]¶
The special token ids. Special tokens include control tokens, reserved tokens, padded tokens, etc. Now it is automatically detected from the vocabulary.
- property stop_token_ids: List[int]¶
The stop token ids.
- property vocab_size: int¶
The size of the vocabulary.
- class xgrammar.VocabType(value)¶
The type of the vocabulary. Used in TokenizerInfo. XGrammar supports three types of vocabularies:
- RAW
The vocabulary is in the raw format. The tokens in the vocabulary are kept in their original form without any processing. This kind of tokenizer includes the tiktoken tokenizer, e.g. microsoft/Phi-3-small-8k-instruct, Qwen/Qwen-7B-Chat, etc.
- BYTE_FALLBACK
The vocabulary used in the byte fallback BPE tokenizer. The tokens are encoded through the byte-fallback conversion. E.g. “” -> “<0x1B>”, ” apple” -> “▁apple”. This kind of tokenizer includes meta-llama/Llama-2-7b-chat, microsoft/Phi-3.5-mini-instruct, etc.
- BYTE_LEVEL
The vocabulary used in the byte level BPE tokenizer. The tokens are encoded through the byte-to-unicode conversion, as in https://github.com/huggingface/transformers/blob/87be06ca77166e6a6215eee5a990ab9f07238a18/src/transformers/models/gpt2/tokenization_gpt2.py#L38-L59
This kind of tokenizer includes meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3.1-8B-Instruct, etc.
- xgrammar.allocate_token_bitmask(batch_size: int, vocab_size: int) Tensor ¶
Allocate the bitmask for the next token prediction. The bitmask is an int32 tensor on CPU with shape (batch_size, ceil(vocab_size / 32)). Users who have their own needs to manage CUDA memory can construct the tensor with get_bitmask_shape and bitmask_dtype themselves.
The reason why we use int32 instead of uint32 is that old versions of PyTorch do not support uint32.
- Parameters:
batch_size (int) – The batch size of the bitmask.
vocab_size (int) – The size of the vocabulary.
- Returns:
bitmask – The shape of the bitmask.
- Return type:
torch.Tensor
- xgrammar.apply_token_bitmask_inplace(logits: Tensor, bitmask: Tensor, *, indices: Optional[List[int]] = None) None ¶
Apply the bitmask to the logits in-place. The bitmask is a 01 bitwise compressed tensor, where 0 means the token is masked and 1 means the token is not masked. It can be generated by allocate_token_bitmask and filled by fill_next_token_bitmask. After applying the bitmask, the masked logits will be set to -inf.
The shape of logits and bitmask should be (batch_size, vocab_size) and (batch_size, bitmask_size) respectively. bitmask_size = ceil(vocab_size / 32). The operation is:
for i in range(batch_size): for j in range(vocab_size): if get_bitmask_value(bitmask, i, j) == 0: logits[i, j] = -inf
get_bitmask_value(bitmask, i, j) gets the j-th bit of the i-th row of the bitmask.
Indices can be used to specify which logits in the batch to apply the bitmask to. It is especially useful when there are structured requests and unstructured requests mixed in the same batch by skipping masking the logits in the unstructured requests. When specified, the operation will be
for batch_id in indices: for j in range(vocab_size): if get_bitmask_value(bitmask, batch_id, j) == 0: logits[batch_id, j] = -inf
The logits and bitmask should be on the same device. If both them are on CUDA, we launch a CUDA kernel to apply bitmask. If both them are on CPU, we use a CPU implementation. The CUDA kernel is optimized and should be preferred.
In practice, the bitmask is allocated on CPU, and the logits is usually on GPU, so users should manually copy the bitmask to GPU before calling this function.
- Parameters:
logits (torch.Tensor) – The tensor to apply the bitmask to.
bitmask (torch.Tensor) – The bitmask to apply.
indices (Optional[List[int]], default: None) – A list of indices to specify which logits in the batch to apply the bitmask to. If None, apply the bitmask to all logits in the batch.
- xgrammar.get_bitmask_shape(batch_size: int, vocab_size: int) Tuple[int, int] ¶
Return the shape of the bitmask: (batch_size, ceil(vocab_size / 32))
- xgrammar.reset_token_bitmask(bitmask: Tensor) None ¶
Reset the bitmask to the full mask.