Litellm dev 12 07 2024 (#7086)

* fix(main.py): support passing max retries to azure/openai embedding integrations Fixes https://github.com/BerriAI/litellm/issues/7003 * feat(team_endpoints.py): allow updating team model aliases Closes https://github.com/BerriAI/litellm/issues/6956 * feat(router.py): allow specifying model id as fallback - skips any cooldown check Allows a default model to be checked if all models in cooldown s/o @micahjsmith * docs(reliability.md): add fallback to specific model to docs * fix(utils.py): new 'is_prompt_caching_valid_prompt' helper util Allows user to identify if messages/tools have prompt caching Related issue: https://github.com/BerriAI/litellm/issues/6784 * feat(router.py): store model id for prompt caching valid prompt Allows routing to that model id on subsequent requests * fix(router.py): only cache if prompt is valid prompt caching prompt prevents storing unnecessary items in cache * feat(router.py): support routing prompt caching enabled models to previous deployments Closes https://github.com/BerriAI/litellm/issues/6784 * test: fix linting errors * feat(databricks/): convert basemodel to dict and exclude none values allow passing pydantic message to databricks * fix(utils.py): ensure all chat completion messages are dict * (feat) Track `custom_llm_provider` in LiteLLMSpendLogs (#7081) * add custom_llm_provider to SpendLogsPayload * add custom_llm_provider to SpendLogs * add custom llm provider to SpendLogs payload * test_spend_logs_payload * Add MLflow to the side bar (#7031) Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> * (bug fix) SpendLogs update DB catch all possible DB errors for retrying (#7082) * catch DB_CONNECTION_ERROR_TYPES * fix DB retry mechanism for SpendLog updates * use DB_CONNECTION_ERROR_TYPES in auth checks * fix exp back off for writing SpendLogs * use _raise_failed_update_spend_exception to ensure errors print as NON blocking * test_update_spend_logs_multiple_batches_with_failure * (Feat) Add StructuredOutputs support for Fireworks.AI (#7085) * fix model cost map fireworks ai "supports_response_schema": true, * fix supports_response_schema * fix map openai params fireworks ai * test_map_response_format * test_map_response_format * added deepinfra/Meta-Llama-3.1-405B-Instruct (#7084) * bump: version 1.53.9 → 1.54.0 * fix deepinfra * litellm db fixes LiteLLM_UserTable (#7089) * ci/cd queue new release * fix llama-3.3-70b-versatile * refactor - use consistent file naming convention `AI21/` -> `ai21` (#7090) * fix refactor - use consistent file naming convention * ci/cd run again * fix naming structure * fix use consistent naming (#7092) --------- Signed-off-by: B-Step62 <yuki.watanabe@databricks.com> Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com> Co-authored-by: Yuki Watanabe <31463517+B-Step62@users.noreply.github.com> Co-authored-by: ali sayyah <ali.sayyah2@gmail.com>
2025-04-25 18:54:30 +00:00 · 2024-12-08 00:30:33 -08:00 · 2024-12-08 00:30:33 -08:00 · 0c0498dd60
commit 0c0498dd60
parent 36e99ebce7
24 changed files with 840 additions and 193 deletions
--- a/litellm/router_utils/prompt_caching_cache.py
+++ b/litellm/router_utils/prompt_caching_cache.py
@ -0,0 +1,193 @@
+"""
+Wrapper around router cache. Meant to store model id when prompt caching supported prompt is called.
+"""
+
+import hashlib
+import json
+import time
+from typing import TYPE_CHECKING, Any, List, Optional, Tuple, TypedDict
+
+import litellm
+from litellm import verbose_logger
+from litellm.caching.caching import Cache, DualCache
+from litellm.caching.in_memory_cache import InMemoryCache
+from litellm.types.llms.openai import AllMessageValues, ChatCompletionToolParam
+
+if TYPE_CHECKING:
+    from opentelemetry.trace import Span as _Span
+
+    from litellm.router import Router
+
+    litellm_router = Router
+    Span = _Span
+else:
+    Span = Any
+    litellm_router = Any
+
+
+class PromptCachingCacheValue(TypedDict):
+    model_id: str
+
+
+class PromptCachingCache:
+    def __init__(self, cache: DualCache):
+        self.cache = cache
+        self.in_memory_cache = InMemoryCache()
+
+    @staticmethod
+    def serialize_object(obj: Any) -> Any:
+        """Helper function to serialize Pydantic objects, dictionaries, or fallback to string."""
+        if hasattr(obj, "dict"):
+            # If the object is a Pydantic model, use its `dict()` method
+            return obj.dict()
+        elif isinstance(obj, dict):
+            # If the object is a dictionary, serialize it with sorted keys
+            return json.dumps(
+                obj, sort_keys=True, separators=(",", ":")
+            )  # Standardize serialization
+
+        elif isinstance(obj, list):
+            # Serialize lists by ensuring each element is handled properly
+            return [PromptCachingCache.serialize_object(item) for item in obj]
+        elif isinstance(obj, (int, float, bool)):
+            return obj  # Keep primitive types as-is
+        return str(obj)
+
+    @staticmethod
+    def get_prompt_caching_cache_key(
+        messages: Optional[List[AllMessageValues]],
+        tools: Optional[List[ChatCompletionToolParam]],
+    ) -> Optional[str]:
+        if messages is None and tools is None:
+            return None
+        # Use serialize_object for consistent and stable serialization
+        data_to_hash = {}
+        if messages is not None:
+            serialized_messages = PromptCachingCache.serialize_object(messages)
+            data_to_hash["messages"] = serialized_messages
+        if tools is not None:
+            serialized_tools = PromptCachingCache.serialize_object(tools)
+            data_to_hash["tools"] = serialized_tools
+
+        # Combine serialized data into a single string
+        data_to_hash_str = json.dumps(
+            data_to_hash,
+            sort_keys=True,
+            separators=(",", ":"),
+        )
+
+        # Create a hash of the serialized data for a stable cache key
+        hashed_data = hashlib.sha256(data_to_hash_str.encode()).hexdigest()
+        return f"deployment:{hashed_data}:prompt_caching"
+
+    def add_model_id(
+        self,
+        model_id: str,
+        messages: Optional[List[AllMessageValues]],
+        tools: Optional[List[ChatCompletionToolParam]],
+    ) -> None:
+        if messages is None and tools is None:
+            return None
+
+        cache_key = PromptCachingCache.get_prompt_caching_cache_key(messages, tools)
+        self.cache.set_cache(
+            cache_key, PromptCachingCacheValue(model_id=model_id), ttl=300
+        )
+        return None
+
+    async def async_add_model_id(
+        self,
+        model_id: str,
+        messages: Optional[List[AllMessageValues]],
+        tools: Optional[List[ChatCompletionToolParam]],
+    ) -> None:
+        if messages is None and tools is None:
+            return None
+
+        cache_key = PromptCachingCache.get_prompt_caching_cache_key(messages, tools)
+        await self.cache.async_set_cache(
+            cache_key,
+            PromptCachingCacheValue(model_id=model_id),
+            ttl=300,  # store for 5 minutes
+        )
+        return None
+
+    async def async_get_model_id(
+        self,
+        messages: Optional[List[AllMessageValues]],
+        tools: Optional[List[ChatCompletionToolParam]],
+    ) -> Optional[PromptCachingCacheValue]:
+        """
+        if messages is not none
+        - check full messages
+        - check messages[:-1]
+        - check messages[:-2]
+        - check messages[:-3]
+
+        use self.cache.async_batch_get_cache(keys=potential_cache_keys])
+        """
+        if messages is None and tools is None:
+            return None
+
+        # Generate potential cache keys by slicing messages
+
+        potential_cache_keys = []
+
+        if messages is not None:
+            full_cache_key = PromptCachingCache.get_prompt_caching_cache_key(
+                messages, tools
+            )
+            potential_cache_keys.append(full_cache_key)
+
+            # Check progressively shorter message slices
+            for i in range(1, min(4, len(messages))):
+                partial_messages = messages[:-i]
+                partial_cache_key = PromptCachingCache.get_prompt_caching_cache_key(
+                    partial_messages, tools
+                )
+                potential_cache_keys.append(partial_cache_key)
+
+        # Perform batch cache lookup
+        cache_results = await self.cache.async_batch_get_cache(
+            keys=potential_cache_keys
+        )
+
+        if cache_results is None:
+            return None
+
+        # Return the first non-None cache result
+        for result in cache_results:
+            if result is not None:
+                return result
+
+        return None
+
+    def get_model_id(
+        self,
+        messages: Optional[List[AllMessageValues]],
+        tools: Optional[List[ChatCompletionToolParam]],
+    ) -> Optional[PromptCachingCacheValue]:
+        if messages is None and tools is None:
+            return None
+
+        cache_key = PromptCachingCache.get_prompt_caching_cache_key(messages, tools)
+        return self.cache.get_cache(cache_key)
+
+    async def async_get_prompt_caching_deployment(
+        self,
+        router: litellm_router,
+        messages: Optional[List[AllMessageValues]],
+        tools: Optional[List[ChatCompletionToolParam]],
+    ) -> Optional[dict]:
+        model_id_dict = await self.async_get_model_id(
+            messages=messages,
+            tools=tools,
+        )
+
+        if model_id_dict is not None:
+            healthy_deployment_pydantic_obj = router.get_deployment(
+                model_id=model_id_dict["model_id"]
+            )
+            if healthy_deployment_pydantic_obj is not None:
+                return healthy_deployment_pydantic_obj.model_dump(exclude_none=True)
+        return None