xgrammar.VocabType¶
- class xgrammar.VocabType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
Enum
The type of the vocabulary. Used in TokenizerInfo. XGrammar supports three types of vocabularies:
- RAW
The vocabulary is in the raw format. The tokens in the vocabulary are kept in their original form without any processing. This kind of tokenizer includes the tiktoken tokenizer, e.g. microsoft/Phi-3-small-8k-instruct, Qwen/Qwen-7B-Chat, etc.
- BYTE_FALLBACK
The vocabulary used in the byte fallback BPE tokenizer. The tokens are encoded through the byte-fallback conversion. E.g. “” -> “<0x1B>”, ” apple” -> “▁apple”. This kind of tokenizer includes meta-llama/Llama-2-7b-chat, microsoft/Phi-3.5-mini-instruct, etc.
- BYTE_LEVEL
The vocabulary used in the byte level BPE tokenizer. The tokens are encoded through the byte-to-unicode conversion, as in https://github.com/huggingface/transformers/blob/87be06ca77166e6a6215eee5a990ab9f07238a18/src/transformers/models/gpt2/tokenization_gpt2.py#L38-L59
This kind of tokenizer includes meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3.1-8B-Instruct, etc.
- __init__(*args, **kwds)¶
Attributes
- BYTE_FALLBACK = 1¶
- BYTE_LEVEL = 2¶
- RAW = 0¶