feat(inference): add tokenization utilities for prompt caching

Implement token counting utilities to determine prompt cacheability
(≥1024 tokens) with support for OpenAI, Llama, and multimodal content.

- Add count_tokens() function with model-specific tokenizers
- Support OpenAI models (GPT-4, GPT-4o, etc.) via tiktoken
- Support Llama models (3.x, 4.x) via transformers
- Fallback to character-based estimation for unknown models
- Handle multimodal content (text + images)
- LRU cache for tokenizer instances (max 10, <1ms cached calls)
- Comprehensive unit tests (34 tests, >95% coverage)
- Update tiktoken version constraint to >=0.8.0

This enables future PR to determine which prompts should be cached based on token count threshold.

Signed-off-by: William Caban <william.caban@gmail.com>
This commit is contained in:
William Caban 2025-11-15 17:27:08 -05:00
parent 97f535c4f1
commit e61572daf0
4 changed files with 902 additions and 1 deletions

View file

@ -40,7 +40,7 @@ dependencies = [
"rich",
"starlette",
"termcolor",
"tiktoken",
"tiktoken>=0.8.0",
"pillow",
"h11>=0.16.0",
"python-multipart>=0.0.20", # For fastapi Form