Litellm Minor Fixes & Improvements (10/12/2024) (#6179)

* build(model_prices_and_context_window.json): add bedrock llama3.2 pricing * build(model_prices_and_context_window.json): add bedrock cross region inference pricing * Revert "(perf) move s3 logging to Batch logging + async [94% faster perf under 100 RPS on 1 litellm instance] (#6165)" This reverts commit 2a5624af47. * add azure/gpt-4o-2024-05-13 (#6174) * LiteLLM Minor Fixes & Improvements (10/10/2024) (#6158) * refactor(vertex_ai_partner_models/anthropic): refactor anthropic to use partner model logic * fix(vertex_ai/): support passing custom api base to partner models Fixes https://github.com/BerriAI/litellm/issues/4317 * fix(proxy_server.py): Fix prometheus premium user check logic * docs(prometheus.md): update quick start docs * fix(custom_llm.py): support passing dynamic api key + api base * fix(realtime_api/main.py): Add request/response logging for realtime api endpoints Closes https://github.com/BerriAI/litellm/issues/6081 * feat(openai/realtime): add openai realtime api logging Closes https://github.com/BerriAI/litellm/issues/6081 * fix(realtime_streaming.py): fix linting errors * fix(realtime_streaming.py): fix linting errors * fix: fix linting errors * fix pattern match router * Add literalai in the sidebar observability category (#6163) * fix: add literalai in the sidebar * fix: typo * update (#6160) * Feat: Add Langtrace integration (#5341) * Feat: Add Langtrace integration * add langtrace service name * fix timestamps for traces * add tests * Discard Callback + use existing otel logger * cleanup * remove print statments * remove callback * add docs * docs * add logging docs * format logging * remove emoji and add litellm proxy example * format logging * format `logging.md` * add langtrace docs to logging.md * sync conflict * docs fix * (perf) move s3 logging to Batch logging + async [94% faster perf under 100 RPS on 1 litellm instance] (#6165) * fix move s3 to use customLogger * add basic s3 logging test * add s3 to custom logger compatible * use batch logger for s3 * s3 set flush interval and batch size * fix s3 logging * add notes on s3 logging * fix s3 logging * add basic s3 logging test * fix s3 type errors * add test for sync logging on s3 * fix: fix to debug log --------- Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com> Co-authored-by: Willy Douhard <willy.douhard@gmail.com> Co-authored-by: yujonglee <yujonglee.dev@gmail.com> Co-authored-by: Ali Waleed <ali@scale3labs.com> * docs(custom_llm_server.md): update doc on passing custom params * fix(pass_through_endpoints.py): don't require headers Fixes https://github.com/BerriAI/litellm/issues/6128 * feat(utils.py): add support for caching rerank endpoints Closes https://github.com/BerriAI/litellm/issues/6144 * feat(litellm_logging.py'): add response headers for failed requests Closes https://github.com/BerriAI/litellm/issues/6159 --------- Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com> Co-authored-by: Willy Douhard <willy.douhard@gmail.com> Co-authored-by: yujonglee <yujonglee.dev@gmail.com> Co-authored-by: Ali Waleed <ali@scale3labs.com>
2025-04-25 10:44:24 +00:00 · 2024-10-12 11:48:34 -07:00 · 2024-10-12 11:48:34 -07:00 · 2acb0c0675
commit 2acb0c0675
parent 2cb65b450d
18 changed files with 533 additions and 82 deletions
--- a/litellm/utils.py
+++ b/litellm/utils.py
@ -60,7 +60,6 @@ from litellm.caching import DualCache
 from litellm.integrations.custom_logger import CustomLogger
 from litellm.litellm_core_utils.core_helpers import map_finish_reason
 from litellm.litellm_core_utils.exception_mapping_utils import (
-    _get_litellm_response_headers,
    _get_response_headers,
    exception_type,
    get_error_message,
@ -82,6 +81,7 @@ from litellm.types.llms.openai import (
    ChatCompletionToolParam,
    ChatCompletionToolParamFunctionChunk,
 )
+from litellm.types.rerank import RerankResponse
 from litellm.types.utils import FileTypes  # type: ignore
 from litellm.types.utils import (
    OPENAI_RESPONSE_HEADERS,
@ -720,6 +720,7 @@ def client(original_function):
            or kwargs.get("atext_completion", False) is True
            or kwargs.get("atranscription", False) is True
            or kwargs.get("arerank", False) is True
+            or kwargs.get("_arealtime", False) is True
        ):
            # [OPTIONAL] CHECK MAX RETRIES / REQUEST
            if litellm.num_retries_per_request is not None:
@ -819,6 +820,8 @@ def client(original_function):
                and kwargs.get("acompletion", False) is not True
                and kwargs.get("aimg_generation", False) is not True
                and kwargs.get("atranscription", False) is not True
+                and kwargs.get("arerank", False) is not True
+                and kwargs.get("_arealtime", False) is not True
            ):  # allow users to control returning cached responses from the completion function
                # checking cache
                print_verbose("INSIDE CHECKING CACHE")
@ -835,7 +838,6 @@ def client(original_function):
                    )
                    cached_result = litellm.cache.get_cache(*args, **kwargs)
                    if cached_result is not None:
-                        print_verbose("Cache Hit!")
                        if "detail" in cached_result:
                            # implies an error occurred
                            pass
@ -867,7 +869,13 @@ def client(original_function):
                                    response_object=cached_result,
                                    response_type="embedding",
                                )
-
+                            elif call_type == CallTypes.rerank.value and isinstance(
+                                cached_result, dict
+                            ):
+                                cached_result = convert_to_model_response_object(
+                                    response_object=cached_result,
+                                    response_type="rerank",
+                                )
                            # LOG SUCCESS
                            cache_hit = True
                            end_time = datetime.datetime.now()
@ -916,6 +924,12 @@ def client(original_function):
                                target=logging_obj.success_handler,
                                args=(cached_result, start_time, end_time, cache_hit),
                            ).start()
+                            cache_key = kwargs.get("preset_cache_key", None)
+                            if (
+                                isinstance(cached_result, BaseModel)
+                                or isinstance(cached_result, CustomStreamWrapper)
+                            ) and hasattr(cached_result, "_hidden_params"):
+                                cached_result._hidden_params["cache_key"] = cache_key  # type: ignore
                            return cached_result
                    else:
                        print_verbose(
@ -991,8 +1005,7 @@ def client(original_function):
            if (
                litellm.cache is not None
                and litellm.cache.supported_call_types is not None
-                and str(original_function.__name__)
-                in litellm.cache.supported_call_types
+                and call_type in litellm.cache.supported_call_types
            ) and (kwargs.get("cache", {}).get("no-store", False) is not True):
                litellm.cache.add_cache(result, *args, **kwargs)

@ -1257,6 +1270,14 @@ def client(original_function):
                                model_response_object=EmbeddingResponse(),
                                response_type="embedding",
                            )
+                        elif call_type == CallTypes.arerank.value and isinstance(
+                            cached_result, dict
+                        ):
+                            cached_result = convert_to_model_response_object(
+                                response_object=cached_result,
+                                model_response_object=None,
+                                response_type="rerank",
+                            )
                        elif call_type == CallTypes.atranscription.value and isinstance(
                            cached_result, dict
                        ):
@ -1460,6 +1481,7 @@ def client(original_function):
                    isinstance(result, litellm.ModelResponse)
                    or isinstance(result, litellm.EmbeddingResponse)
                    or isinstance(result, TranscriptionResponse)
+                    or isinstance(result, RerankResponse)
                ):
                    if (
                        isinstance(result, EmbeddingResponse)
@ -5880,10 +5902,16 @@ def convert_to_streaming_response(response_object: Optional[dict] = None):
 def convert_to_model_response_object(
    response_object: Optional[dict] = None,
    model_response_object: Optional[
-        Union[ModelResponse, EmbeddingResponse, ImageResponse, TranscriptionResponse]
+        Union[
+            ModelResponse,
+            EmbeddingResponse,
+            ImageResponse,
+            TranscriptionResponse,
+            RerankResponse,
+        ]
    ] = None,
    response_type: Literal[
-        "completion", "embedding", "image_generation", "audio_transcription"
+        "completion", "embedding", "image_generation", "audio_transcription", "rerank"
    ] = "completion",
    stream=False,
    start_time=None,
@ -6133,6 +6161,27 @@ def convert_to_model_response_object(
            if _response_headers is not None:
                model_response_object._response_headers = _response_headers

+            return model_response_object
+        elif response_type == "rerank" and (
+            model_response_object is None
+            or isinstance(model_response_object, RerankResponse)
+        ):
+            if response_object is None:
+                raise Exception("Error in response object format")
+
+            if model_response_object is None:
+                model_response_object = RerankResponse(**response_object)
+                return model_response_object
+
+            if "id" in response_object:
+                model_response_object.id = response_object["id"]
+
+            if "meta" in response_object:
+                model_response_object.meta = response_object["meta"]
+
+            if "results" in response_object:
+                model_response_object.results = response_object["results"]
+
            return model_response_object
    except Exception:
        raise Exception(