chore: remove usage of load_tiktoken_bpe

The `load_tiktoken_bpe()` function depends on blobfile to load
tokenizer.model files. However, blobfile brings in pycryptodomex, which
is primarily used for JWT signing in GCP - functionality we don’t
require, as we always load tokenizers from local files. pycryptodomex
implements its own cryptographic primitives, which are known to be
problematic and insecure. While blobfile could potentially switch to the
more secure PyCA cryptography library, the project appears inactive, so
this transition may not happen soon. Fortunately, `load_tiktoken_bpe()`
is a simple function that just reads a BPE file and returns a dictionary
mapping byte sequences to their mergeable ranks. It’s straightforward
enough for us to implement ourselves.

Signed-off-by: Sébastien Han <seb@redhat.com>
This commit is contained in:
Sébastien Han 2025-05-27 10:49:03 +02:00
parent a8f75d3897
commit b45cc42202
No known key found for this signature in database
6 changed files with 234 additions and 17 deletions

View file

@ -4,7 +4,6 @@
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
import os
from collections.abc import Collection, Iterator, Sequence, Set
from logging import getLogger
from pathlib import Path
@ -14,7 +13,8 @@ from typing import (
)
import tiktoken
from tiktoken.load import load_tiktoken_bpe
from llama_stack.models.llama.tokenizer_utils import load_bpe_file
logger = getLogger(__name__)
@ -48,19 +48,20 @@ class Tokenizer:
global _INSTANCE
if _INSTANCE is None:
_INSTANCE = Tokenizer(os.path.join(os.path.dirname(__file__), "tokenizer.model"))
_INSTANCE = Tokenizer(Path(__file__).parent / "tokenizer.model")
return _INSTANCE
def __init__(self, model_path: str):
def __init__(self, model_path: Path):
"""
Initializes the Tokenizer with a Tiktoken model.
Args:
model_path (str): The path to the Tiktoken model file.
"""
assert os.path.isfile(model_path), model_path
if not model_path.exists():
raise FileNotFoundError(f"Tokenizer model file not found: {model_path}")
mergeable_ranks = load_tiktoken_bpe(model_path)
mergeable_ranks = load_bpe_file(model_path)
num_base_tokens = len(mergeable_ranks)
special_tokens = [
"<|begin_of_text|>",
@ -83,7 +84,7 @@ class Tokenizer:
self.special_tokens = {token: num_base_tokens + i for i, token in enumerate(special_tokens)}
self.model = tiktoken.Encoding(
name=Path(model_path).name,
name=model_path.name,
pat_str=self.pat_str,
mergeable_ranks=mergeable_ranks,
special_tokens=self.special_tokens,