llama-stack-mirror/tests/unit/providers/utils
William Caban e61572daf0 feat(inference): add tokenization utilities for prompt caching
Implement token counting utilities to determine prompt cacheability
(≥1024 tokens) with support for OpenAI, Llama, and multimodal content.

- Add count_tokens() function with model-specific tokenizers
- Support OpenAI models (GPT-4, GPT-4o, etc.) via tiktoken
- Support Llama models (3.x, 4.x) via transformers
- Fallback to character-based estimation for unknown models
- Handle multimodal content (text + images)
- LRU cache for tokenizer instances (max 10, <1ms cached calls)
- Comprehensive unit tests (34 tests, >95% coverage)
- Update tiktoken version constraint to >=0.8.0

This enables future PR to determine which prompts should be cached based on token count threshold.

Signed-off-by: William Caban <william.caban@gmail.com>
2025-11-15 17:27:08 -05:00
..
inference feat(inference): add tokenization utilities for prompt caching 2025-11-15 17:27:08 -05:00
memory fix: rename llama_stack_api dir (#4155) 2025-11-13 15:04:36 -08:00
__init__.py fix: add check for interleavedContent (#1973) 2025-05-06 09:55:07 -07:00
test_form_data.py fix(expires_after): make sure multipart/form-data is properly parsed (#3612) 2025-09-30 16:14:03 -04:00
test_model_registry.py fix: rename llama_stack_api dir (#4155) 2025-11-13 15:04:36 -08:00
test_openai_compat_conversion.py feat(tools)!: substantial clean up of "Tool" related datatypes (#3627) 2025-10-02 15:12:03 -07:00
test_scheduler.py chore: default to pytest asyncio-mode=auto (#2730) 2025-07-11 13:00:24 -07:00