feat(inference): add tokenization utilities for prompt caching

Implement token counting utilities to determine prompt cacheability (≥1024 tokens) with support for OpenAI, Llama, and multimodal content. - Add count_tokens() function with model-specific tokenizers - Support OpenAI models (GPT-4, GPT-4o, etc.) via tiktoken - Support Llama models (3.x, 4.x) via transformers - Fallback to character-based estimation for unknown models - Handle multimodal content (text + images) - LRU cache for tokenizer instances (max 10, <1ms cached calls) - Comprehensive unit tests (34 tests, >95% coverage) - Update tiktoken version constraint to >=0.8.0 This enables future PR to determine which prompts should be cached based on token count threshold. Signed-off-by: William Caban <william.caban@gmail.com>
2025-12-03 18:00:36 +00:00 · 2025-11-15 17:27:08 -05:00 · 2025-11-15 17:27:08 -05:00 · e61572daf0
commit e61572daf0
parent 97f535c4f1
4 changed files with 902 additions and 1 deletions
--- a/pyproject.toml
+++ b/pyproject.toml
@ -40,7 +40,7 @@ dependencies = [
    "rich",
    "starlette",
    "termcolor",
-    "tiktoken",
+    "tiktoken>=0.8.0",
    "pillow",
    "h11>=0.16.0",
    "python-multipart>=0.0.20",                       # For fastapi Form