llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 09:53:45 +00:00

History

William Caban e61572daf0 feat(inference): add tokenization utilities for prompt caching Implement token counting utilities to determine prompt cacheability (≥1024 tokens) with support for OpenAI, Llama, and multimodal content. - Add count_tokens() function with model-specific tokenizers - Support OpenAI models (GPT-4, GPT-4o, etc.) via tiktoken - Support Llama models (3.x, 4.x) via transformers - Fallback to character-based estimation for unknown models - Handle multimodal content (text + images) - LRU cache for tokenizer instances (max 10, <1ms cached calls) - Comprehensive unit tests (34 tests, >95% coverage) - Update tiktoken version constraint to >=0.8.0 This enables future PR to determine which prompts should be cached based on token count threshold. Signed-off-by: William Caban <william.caban@gmail.com>		2025-11-15 17:27:08 -05:00
..
agents/meta_reference	test: Restore responses unit tests (#4153 )	2025-11-14 13:16:03 -08:00
batches	fix: rename llama_stack_api dir (#4155 )	2025-11-13 15:04:36 -08:00
files	fix: rename llama_stack_api dir (#4155 )	2025-11-13 15:04:36 -08:00
inference	fix: rename llama_stack_api dir (#4155 )	2025-11-13 15:04:36 -08:00
inline	fix: rename llama_stack_api dir (#4155 )	2025-11-13 15:04:36 -08:00
nvidia	fix: rename llama_stack_api dir (#4155 )	2025-11-13 15:04:36 -08:00
utils	feat(inference): add tokenization utilities for prompt caching	2025-11-15 17:27:08 -05:00
vector_io	fix: rename llama_stack_api dir (#4155 )	2025-11-13 15:04:36 -08:00
test_bedrock.py	fix: rename llama_stack_api dir (#4155 )	2025-11-13 15:04:36 -08:00
test_configs.py	chore(rename): move llama_stack.distribution to llama_stack.core (#2975 )	2025-07-30 23:30:53 -07:00