xgrammar.VocabType

class xgrammar.VocabType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

The type of the vocabulary. Used in TokenizerInfo. XGrammar supports three types of vocabularies:

RAW

The vocabulary is in the raw format. The tokens in the vocabulary are kept in their original form without any processing. This kind of tokenizer includes the tiktoken tokenizer, e.g. microsoft/Phi-3-small-8k-instruct, Qwen/Qwen-7B-Chat, etc.

BYTE_FALLBACK

The vocabulary used in the byte fallback BPE tokenizer. The tokens are encoded through the byte-fallback conversion. E.g. “” -> “<0x1B>”, ” apple” -> “▁apple”. This kind of tokenizer includes meta-llama/Llama-2-7b-chat, microsoft/Phi-3.5-mini-instruct, etc.

BYTE_LEVEL

The vocabulary used in the byte level BPE tokenizer. The tokens are encoded through the byte-to-unicode conversion, as in https://github.com/huggingface/transformers/blob/87be06ca77166e6a6215eee5a990ab9f07238a18/src/transformers/models/gpt2/tokenization_gpt2.py#L38-L59

This kind of tokenizer includes meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Meta-Llama-3.1-8B-Instruct, etc.

__init__(*args, **kwds)

Attributes

RAW

BYTE_FALLBACK

BYTE_LEVEL

BYTE_FALLBACK = 1
BYTE_LEVEL = 2
RAW = 0