xgr.TokenizerInfo¶
- class xgrammar.VocabType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
Enum
The type of the vocabulary. Used in TokenizerInfo. XGrammar supports three types of vocabularies: RAW, BYTE_FALLBACK, BYTE_LEVEL.
Attributes:
The vocabulary is in the raw format.
The vocabulary used in the byte fallback BPE tokenizer.
The vocabulary used in the byte level BPE tokenizer.
- RAW = 0¶
The vocabulary is in the raw format.
The tokens in the vocabulary are kept in their original form without any processing. This kind of tokenizer includes the tiktoken tokenizer, e.g. microsoft/Phi-3-small-8k-instruct, Qwen/Qwen-7B-Chat, etc.
- BYTE_FALLBACK = 1¶
The vocabulary used in the byte fallback BPE tokenizer.
The tokens are encoded through the byte-fallback conversion. E.g. “” -> “<0x1B>”, ” apple” -> “▁apple”. This kind of tokenizer includes meta-llama/Llama-2-7b-chat, microsoft/Phi-3.5-mini-instruct, etc.
- BYTE_LEVEL = 2¶
The vocabulary used in the byte level BPE tokenizer.
The tokens are encoded through the byte-to-unicode conversion, as in https://github.com/huggingface/transformers/blob/87be06ca77166e6a6215eee5a990ab9f07238a18/src/transformers/models/gpt2/tokenization_gpt2.py#L38-L59
This kind of tokenizer includes meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3.1-8B-Instruct, etc.
- class xgrammar.TokenizerInfo(encoded_vocab: Union[List[bytes], List[str]], vocab_type: VocabType = VocabType.RAW, *, vocab_size: Optional[int] = None, stop_token_ids: Optional[Union[List[int], int]] = None, add_prefix_space: bool = False)[source]¶
The tokenizer info contains the vocabulary, the type of the vocabulary, and necessary information for the grammar-guided generation.
Note that although some tokenizers will encode the tokens in a special format, e.g. “<0x1B>” for “” in the ByteFallback tokenizer, and “Ġ” for ” ” in the Byte-Level BPE tokenizer, TokenizerInfo always decodes the vocabulary to the original format (e.g. “” and ” “).
Also note that some models (e.g. Phi-3 and Deepseek-V2) may pad the vocabulary to a multiple of 32. In this case, the model’s vocab_size is larger than the tokenizer’s vocabulary size. Please pass the model’s vocab_size to the vocab_size parameter in the constructor, because this information is used to determine the size of the token mask.
Methods:
__init__
(encoded_vocab[, vocab_type, ...])Construct the tokenizer info.
from_huggingface
(tokenizer, *[, vocab_size, ...])Construct the tokenizer info from the huggingface tokenizer.
Dump the metadata of the tokenizer to a json string.
from_vocab_and_metadata
(encoded_vocab, metadata)Construct the tokenizer info from the vocabulary and the metadata string in json format.
Serialize the tokenizer_info to a JSON string.
deserialize_json
(json_string, encoded_vocab)Deserialize a grammar from a JSON string.
Attributes:
The type of the vocabulary.
The size of the vocabulary.
Whether the tokenizer will prepend a space before the text in the tokenization process.
Whether the tokenizer will prepend a space before the text in the tokenization process.
The decoded vocabulary of the tokenizer.
The stop token ids.
The special token ids.
- __init__(encoded_vocab: Union[List[bytes], List[str]], vocab_type: VocabType = VocabType.RAW, *, vocab_size: Optional[int] = None, stop_token_ids: Optional[Union[List[int], int]] = None, add_prefix_space: bool = False) None [source]¶
Construct the tokenizer info.
- Parameters:
encoded_vocab (Union[List[bytes], List[str]]) – The encoded vocabulary of the tokenizer.
vocab_type (VocabType, default: VocabType.RAW) – The type of the vocabulary. See also VocabType.
vocab_size (Optional[int], default: None) – The size of the vocabulary. If not provided, the vocabulary size will be len(encoded_vocab).
stop_token_ids (Optional[List[int]], default: None) – The stop token ids. If not provided, the stop token ids will be auto detected (but may not be correct).
add_prefix_space (bool, default: False) – Whether the tokenizer will prepend a space before the text in the tokenization process.
- static from_huggingface(tokenizer: PreTrainedTokenizerBase, *, vocab_size: Optional[int] = None, stop_token_ids: Optional[Union[List[int], int]] = None) TokenizerInfo [source]¶
Construct the tokenizer info from the huggingface tokenizer. This constructor supports various tokenizer backends, including the huggingface fast tokenizer and tiktoken tokenizer. Necessary information is automatically detected from the tokenizer.
The vocab_size parameter is introduced to handle the misalignment between the model’s vocab_size and the tokenizer’s vocabulary size. User should pass the model’s vocab_size (could be defined in the model config) here. See docs of vocab_size for more details.
The stop token ids is by default the eos_token_id of the tokenizer. If there are other stop tokens, you can specify them manually.
- Parameters:
tokenizer (PreTrainedTokenizerBase) – The huggingface tokenizer.
vocab_size (Optional[int], default: None) –
The vocabulary size defined by the model (not the tokenizer). This equals to the vocab dimention of the model’s lm_head. This is the size of the token mask.
It can be:
the same as the tokenizer’s vocabulary size. This is the most common case.
larger than the tokenizer’s vocabulary size. This happens when the model has padding to lm_head, possibly due to aligning lm_head to the power of 2. E.g. Phi-3 and Deepseek-V2.
smaller than the tokenizer’s vocabulary size. This happens when the tokenizer has some added tokens that will not supported by the model. E.g. Llama-3.2 Vision and Molmo-72B-0924 has padded <|image|> tokens, but they will not be considered in lm_head or generated by the model.
model_vocab_size need to be provided for case 2 and 3. If not provided, it will be set to the tokenizer’s vocabulary size.
stop_token_ids (Optional[List[int]], default: None) – The stop token ids. If not provided, the eos_token_id of the tokenizer will be used.
- Returns:
tokenizer_info – The tokenizer info.
- Return type:
- property add_prefix_space: bool¶
Whether the tokenizer will prepend a space before the text in the tokenization process.
- property prepend_space_in_tokenization: bool¶
Whether the tokenizer will prepend a space before the text in the tokenization process.
This property is deprecated. Use add_prefix_space instead.
- property decoded_vocab: List[bytes]¶
The decoded vocabulary of the tokenizer. This converts the tokens in the LLM’s vocabulary back to the original format of the input text. E.g. for type ByteFallback, the token <0x1B> is converted back to “”.
- property special_token_ids: List[int]¶
The special token ids. Special tokens include control tokens, reserved tokens, padded tokens, etc. Now it is automatically detected from the vocabulary.
- dump_metadata() str [source]¶
Dump the metadata of the tokenizer to a json string. It can be used to construct the tokenizer info from the vocabulary and the metadata string.