mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-03 18:00:36 +00:00
feat(inference): add tokenization utilities for prompt caching
Implement token counting utilities to determine prompt cacheability (≥1024 tokens) with support for OpenAI, Llama, and multimodal content. - Add count_tokens() function with model-specific tokenizers - Support OpenAI models (GPT-4, GPT-4o, etc.) via tiktoken - Support Llama models (3.x, 4.x) via transformers - Fallback to character-based estimation for unknown models - Handle multimodal content (text + images) - LRU cache for tokenizer instances (max 10, <1ms cached calls) - Comprehensive unit tests (34 tests, >95% coverage) - Update tiktoken version constraint to >=0.8.0 This enables future PR to determine which prompts should be cached based on token count threshold. Signed-off-by: William Caban <william.caban@gmail.com>
This commit is contained in:
parent
97f535c4f1
commit
e61572daf0
4 changed files with 902 additions and 1 deletions
7
tests/unit/providers/utils/inference/__init__.py
Normal file
7
tests/unit/providers/utils/inference/__init__.py
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Copyright (c) Meta Platforms, Inc. and affiliates.
|
||||
# All rights reserved.
|
||||
#
|
||||
# This source code is licensed under the terms described in the LICENSE file in
|
||||
# the root directory of this source tree.
|
||||
|
||||
"""Unit tests for inference utilities."""
|
||||
Loading…
Add table
Add a link
Reference in a new issue