Advanced Topics

This section covers advanced topics about XGrammar.

Multi-threaded Grammar Compilation and Cache

To accelerate computation, xgr.GrammarCompiler is multithreaded. It uses multiple threads to process a single grammar and can also compile multiple grammars in parallel. xgr.GrammarCompiler.compile_* functions releases the GIL, so you can use asyncio to compile multiple grammars in parallel.

The max_threads parameter controls the maximum number of threads used. We recommend setting it to half the number of your CPU’s virtual cores for optimal performance.

grammar_compiler = xgr.GrammarCompiler(tokenizer_info, max_threads=8)

# Use asyncio to compile multiple grammars in parallel
async def compile_grammars():
    # Submit two grammars in sequence
    future1 = asyncio.to_thread(grammar_compiler.compile_grammar, grammar1)
    future2 = asyncio.to_thread(grammar_compiler.compile_grammar, grammar2)

    # Wait for both futures to complete
    compiled_grammar1 = await future1
    compiled_grammar2 = await future2

    return compiled_grammar1, compiled_grammar2

compiled_grammar1, compiled_grammar2 = asyncio.run(compile_grammars())

xgr.GrammarCompiler also includes a cache. If the same grammar is compiled again, the cached result is returned directly. Set cache_enabled to True to enable the cache, and cache_limit_bytes to control the maximum memory usage for the cache. The cache uses LRU (Least Recently Used) eviction policy.

The EBNF string, JSON Schema string, regex pattern are used as the cache key for compile_grammar, compile_json_schema, compile_regex, respectively. By caching the input string directly, we further reduce the time spent constructing the grammar.

grammar_compiler = xgr.GrammarCompiler(tokenizer_info, cache_enabled=True, cache_limit_bytes=128 * 1024 * 1024)
compiled_grammar1 = grammar_compiler.compile_grammar(grammar)
# return immidiately
compiled_grammar2 = grammar_compiler.compile_grammar(grammar)
grammar_compiler.clear_cache()

Handle Padding to the LLM Output Logits

Sometimes the shape of the LLM output logits can be larger than the size of the LLM tokenizer’s vocabulary. This is because the LLM pads the output tensor. For example, the tokenizer of DeepSeek-V3 only defines 128,815 tokens, but its output probability distribution has a dimension of 129,280.

Note that XGrammar always treat the size of the model’s output logits as the vocabulary size, because the bitmask operates on the LLM output logits. This is used in xgr.TokenizerInfo and xgr.allocate_token_bitmask:

tokenizer_info = xgr.TokenizerInfo(tokenizer, vocab_size=129280)
token_bitmask = xgr.allocate_token_bitmask(1, tokenizer_info.vocab_size)

For most models, the logits’ vocabulary size can be found in the model config.

config = AutoConfig.from_pretrained(model_path)
vocab_size = config.vocab_size