Bitmask Operations¶

xgrammar.allocate_token_bitmask(batch_size: int, vocab_size: int) → torch.Tensor[source]¶

Allocate the bitmask for the next token prediction. The bitmask is an int32 tensor on CPU with shape (batch_size, ceil(vocab_size / 32)). Users who have their own needs to manage CUDA memory can construct the tensor with get_bitmask_shape and bitmask_dtype themselves.

The reason why we use int32 instead of uint32 is that old versions of PyTorch do not support uint32.

Parameters:

batch_size (int) – The batch size of the bitmask.
vocab_size (int) – The size of the vocabulary.

Returns:

bitmask – The shape of the bitmask.

Return type:

torch.Tensor

xgrammar.apply_token_bitmask_inplace(logits: torch.Tensor, bitmask: torch.Tensor, *, vocab_size: Optional[int] = None, indices: Optional[List[int]] = None) → None[source]¶

Apply the bitmask to the logits in-place. The bitmask is a 01 bitwise compressed tensor, where 0 means the token is masked and 1 means the token is not masked. It can be generated by allocate_token_bitmask and filled by fill_next_token_bitmask. After applying the bitmask, the masked logits will be set to -inf.

The shape of logits and bitmask should be (batch_size, vocab_size) and (batch_size, bitmask_size) respectively. bitmask_size = ceil(vocab_size / 32). The operation is:

for i in range(batch_size):
    for j in range(vocab_size):
        if get_bitmask_value(bitmask, i, j) == 0:
            logits[i, j] = -inf

get_bitmask_value(bitmask, i, j) gets the j-th bit of the i-th row of the bitmask.

Notes

Padding:

This method allows additional padding on the vocabulary dimension of logits or bitmask. If padding exists, provide the real vocab size to the vocab_size parameter, and the operation will be applied to logits[…, :vocab_size] and bitmask[…, :ceil(vocab_size / 32)].

If vocab_size is not provided, the vocab size will be detected as min(logits.shape[-1], bitmask.shape[-1] * 32).

Indices:

Indices can be used to specify which logits in the batch to apply the bitmask to. It is especially useful when there are structured requests and unstructured requests mixed in the same batch by skipping masking the logits in the unstructured requests. When specified, the operation will be

for batch_id in indices:
    for j in range(vocab_size):
        if get_bitmask_value(bitmask, batch_id, j) == 0:
            logits[batch_id, j] = -inf

When indices is specified, the batch sizes of logits and bitmask do not need to be the same. As long as the indices are valid, the operation will be performed.

Device:

The logits and bitmask should be on the same device. If both them are on GPU, we launch a GPU kernel to apply bitmask. If both them are on CPU, we use a CPU implementation. The GPU kernel is optimized and should be preferred.

In practice, the bitmask is allocated on CPU, and the logits is usually on GPU, so users should manually copy the bitmask to GPU before calling this function.

Parameters:

logits (torch.Tensor) – The tensor to apply the bitmask to.
bitmask (torch.Tensor) – The bitmask to apply.
vocab_size (Optional[int], default: None) – The size of the vocabulary. If not provided, the vocab size will be detected as min(logits.shape[-1], bitmask.shape[-1] * 32).
indices (Optional[List[int]], default: None) – A list of indices to specify which logits in the batch to apply the bitmask to. Should be unique. If None, apply the bitmask to all logits in the batch.

xgrammar.reset_token_bitmask(bitmask: torch.Tensor) → None[source]¶: Reset the bitmask to the full mask.

xgrammar.get_bitmask_shape(batch_size: int, vocab_size: int) → Tuple[int, int][source]¶: Return the shape of the bitmask: (batch_size, ceil(vocab_size / 32)).

xgrammar.bitmask_dtype(*args: Any, **kwargs: Any) → Any¶: The dtype of the bitmask: int32.