(Feat) - return x-litellm-attempted-fallbacks in responses from litellm proxy (#8558)

* add_fallback_headers_to_response * test x-litellm-attempted-fallbacks * unit test attempted fallbacks * fix add_fallback_headers_to_response * docs document response headers * fix file name
2025-04-24 18:24:20 +00:00 · 2025-02-15 14:54:23 -08:00 · 2025-02-15 14:54:23 -08:00 · 6b3bfa2b42
commit 6b3bfa2b42
parent a9276f27f9
9 changed files with 200 additions and 117 deletions
--- a/docs/my-website/docs/proxy/response_headers.md
+++ b/docs/my-website/docs/proxy/response_headers.md
@ -1,17 +1,20 @@
-# Rate Limit Headers
+# Response Headers

-When you make a request to the proxy, the proxy will return the following [OpenAI-compatible headers](https://platform.openai.com/docs/guides/rate-limits/rate-limits-in-headers):
+When you make a request to the proxy, the proxy will return the following headers:

- `x-ratelimit-remaining-requests` - Optional[int]: The remaining number of requests that are permitted before exhausting the rate limit.
- `x-ratelimit-remaining-tokens` - Optional[int]: The remaining number of tokens that are permitted before exhausting the rate limit.
- `x-ratelimit-limit-requests` - Optional[int]: The maximum number of requests that are permitted before exhausting the rate limit.
- `x-ratelimit-limit-tokens` - Optional[int]: The maximum number of tokens that are permitted before exhausting the rate limit.
- `x-ratelimit-reset-requests` - Optional[int]: The time at which the rate limit will reset.    
- `x-ratelimit-reset-tokens` - Optional[int]: The time at which the rate limit will reset.
+## Rate Limit Headers
+[OpenAI-compatible headers](https://platform.openai.com/docs/guides/rate-limits/rate-limits-in-headers):

-These headers are useful for clients to understand the current rate limit status and adjust their request rate accordingly.
+| Header | Type | Description |
+|--------|------|-------------|
+| `x-ratelimit-remaining-requests` | Optional[int] | The remaining number of requests that are permitted before exhausting the rate limit |
+| `x-ratelimit-remaining-tokens` | Optional[int] | The remaining number of tokens that are permitted before exhausting the rate limit |
+| `x-ratelimit-limit-requests` | Optional[int] | The maximum number of requests that are permitted before exhausting the rate limit |
+| `x-ratelimit-limit-tokens` | Optional[int] | The maximum number of tokens that are permitted before exhausting the rate limit |
+| `x-ratelimit-reset-requests` | Optional[int] | The time at which the rate limit will reset |
+| `x-ratelimit-reset-tokens` | Optional[int] | The time at which the rate limit will reset |

-## How are these headers calculated?
+### How Rate Limit Headers work

 **If key has rate limits set**

@ -19,6 +22,50 @@ The proxy will return the [remaining rate limits for that key](https://github.co

 **If key does not have rate limits set**

-The proxy returns the remaining requests/tokens returned by the backend provider. 
+The proxy returns the remaining requests/tokens returned by the backend provider. (LiteLLM will standardize the backend provider's response headers to match the OpenAI format)

 If the backend provider does not return these headers, the value will be `None`.
+
+These headers are useful for clients to understand the current rate limit status and adjust their request rate accordingly.
+
+
+## Latency Headers
+| Header | Type | Description |
+|--------|------|-------------|
+| `x-litellm-response-duration-ms` | float | Total duration of the API response in milliseconds |
+| `x-litellm-overhead-duration-ms` | float | LiteLLM processing overhead in milliseconds |
+
+## Retry, Fallback Headers
+| Header | Type | Description |
+|--------|------|-------------|
+| `x-litellm-attempted-retries` | int | Number of retry attempts made |
+| `x-litellm-attempted-fallbacks` | int | Number of fallback attempts made |
+| `x-litellm-max-fallbacks` | int | Maximum number of fallback attempts allowed |
+
+## Cost Tracking Headers
+| Header | Type | Description |
+|--------|------|-------------|
+| `x-litellm-response-cost` | float | Cost of the API call |
+| `x-litellm-key-spend` | float | Total spend for the API key |
+
+## LiteLLM Specific Headers
+| Header | Type | Description |
+|--------|------|-------------|
+| `x-litellm-call-id` | string | Unique identifier for the API call |
+| `x-litellm-model-id` | string | Unique identifier for the model used |
+| `x-litellm-model-api-base` | string | Base URL of the API endpoint |
+| `x-litellm-version` | string | Version of LiteLLM being used |
+| `x-litellm-model-group` | string | Model group identifier |
+
+## Response headers from LLM providers
+
+LiteLLM also returns the original response headers from the LLM provider. These headers are prefixed with `llm_provider-` to distinguish them from LiteLLM's headers.
+
+Example response headers:
+```
+llm_provider-openai-processing-ms: 256
+llm_provider-openai-version: 2020-10-01
+llm_provider-x-ratelimit-limit-requests: 30000
+llm_provider-x-ratelimit-limit-tokens: 150000000
+```
+
--- a/docs/my-website/sidebars.js
+++ b/docs/my-website/sidebars.js
@ -65,8 +65,8 @@ const sidebars = {
          items: [
            "proxy/user_keys",
            "proxy/clientside_auth",
-            "proxy/response_headers",
            "proxy/request_headers",
+            "proxy/response_headers",
          ],
        },
        {
--- a/litellm/router.py
+++ b/litellm/router.py
@ -57,7 +57,10 @@ from litellm.router_strategy.lowest_tpm_rpm import LowestTPMLoggingHandler
 from litellm.router_strategy.lowest_tpm_rpm_v2 import LowestTPMLoggingHandler_v2
 from litellm.router_strategy.simple_shuffle import simple_shuffle
 from litellm.router_strategy.tag_based_routing import get_deployments_for_tag
-from litellm.router_utils.add_retry_headers import add_retry_headers_to_response
+from litellm.router_utils.add_retry_fallback_headers import (
+    add_fallback_headers_to_response,
+    add_retry_headers_to_response,
+)
 from litellm.router_utils.batch_utils import (
    _get_router_metadata_variable_name,
    replace_model_in_jsonl,
@ -2888,6 +2891,10 @@ class Router:
            else:
                response = await self.async_function_with_retries(*args, **kwargs)
            verbose_router_logger.debug(f"Async Response: {response}")
+            response = add_fallback_headers_to_response(
+                response=response,
+                attempted_fallbacks=0,
+            )
            return response
        except Exception as e:
            verbose_router_logger.debug(f"Traceback{traceback.format_exc()}")
--- a/litellm/router_utils/add_retry_fallback_headers.py
+++ b/litellm/router_utils/add_retry_fallback_headers.py
@ -5,24 +5,13 @@ from pydantic import BaseModel
 from litellm.types.utils import HiddenParams


-def add_retry_headers_to_response(
-    response: Any,
-    attempted_retries: int,
-    max_retries: Optional[int] = None,
-) -> Any:
+def _add_headers_to_response(response: Any, headers: dict) -> Any:
    """
-    Add retry headers to the request
+    Helper function to add headers to a response's hidden params
    """
-
    if response is None or not isinstance(response, BaseModel):
        return response

-    retry_headers = {
-        "x-litellm-attempted-retries": attempted_retries,
-    }
-    if max_retries is not None:
-        retry_headers["x-litellm-max-retries"] = max_retries
-
    hidden_params: Optional[Union[dict, HiddenParams]] = getattr(
        response, "_hidden_params", {}
    )
@ -33,8 +22,47 @@ def add_retry_headers_to_response(
        hidden_params = hidden_params.model_dump()

    hidden_params.setdefault("additional_headers", {})
-    hidden_params["additional_headers"].update(retry_headers)
+    hidden_params["additional_headers"].update(headers)

    setattr(response, "_hidden_params", hidden_params)
-
    return response
+
+
+def add_retry_headers_to_response(
+    response: Any,
+    attempted_retries: int,
+    max_retries: Optional[int] = None,
+) -> Any:
+    """
+    Add retry headers to the request
+    """
+    retry_headers = {
+        "x-litellm-attempted-retries": attempted_retries,
+    }
+    if max_retries is not None:
+        retry_headers["x-litellm-max-retries"] = max_retries
+
+    return _add_headers_to_response(response, retry_headers)
+
+
+def add_fallback_headers_to_response(
+    response: Any,
+    attempted_fallbacks: int,
+) -> Any:
+    """
+    Add fallback headers to the response
+
+    Args:
+        response: The response to add the headers to
+        attempted_fallbacks: The number of fallbacks attempted
+
+    Returns:
+        The response with the headers added
+
+    Note: It's intentional that we don't add max_fallbacks in response headers
+    Want to avoid bloat in the response headers for performance.
+    """
+    fallback_headers = {
+        "x-litellm-attempted-fallbacks": attempted_fallbacks,
+    }
+    return _add_headers_to_response(response, fallback_headers)
--- a/litellm/router_utils/fallback_event_handlers.py
+++ b/litellm/router_utils/fallback_event_handlers.py
@ -4,6 +4,9 @@ from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union
 import litellm
 from litellm._logging import verbose_router_logger
 from litellm.integrations.custom_logger import CustomLogger
+from litellm.router_utils.add_retry_fallback_headers import (
+    add_fallback_headers_to_response,
+)
 from litellm.types.router import LiteLLMParamsTypedDict

 if TYPE_CHECKING:
@ -130,12 +133,17 @@ async def run_async_fallback(
            kwargs.setdefault("metadata", {}).update(
                {"model_group": kwargs.get("model", None)}
            )  # update model_group used, if fallbacks are done
-            kwargs["fallback_depth"] = fallback_depth + 1
+            fallback_depth = fallback_depth + 1
+            kwargs["fallback_depth"] = fallback_depth
            kwargs["max_fallbacks"] = max_fallbacks
            response = await litellm_router.async_function_with_fallbacks(
                *args, **kwargs
            )
            verbose_router_logger.info("Successful fallback b/w models.")
+            response = add_fallback_headers_to_response(
+                response=response,
+                attempted_fallbacks=fallback_depth,
+            )
            # callback for successfull_fallback_event():
            await log_success_fallback_event(
                original_model_group=original_model_group,
@ -153,55 +161,6 @@ async def run_async_fallback(
    raise error_from_fallbacks


-def run_sync_fallback(
-    litellm_router: LitellmRouter,
-    *args: Tuple[Any],
-    fallback_model_group: List[str],
-    original_model_group: str,
-    original_exception: Exception,
-    **kwargs,
-) -> Any:
-    """
-    Synchronous version of run_async_fallback.
-    Loops through all the fallback model groups and calls kwargs["original_function"] with the arguments and keyword arguments provided.
-
-    If the call is successful, returns the response.
-    If the call fails, continues to the next fallback model group.
-    If all fallback model groups fail, it raises the most recent exception.
-
-    Args:
-        litellm_router: The litellm router instance.
-        *args: Positional arguments.
-        fallback_model_group: List[str] of fallback model groups. example: ["gpt-4", "gpt-3.5-turbo"]
-        original_model_group: The original model group. example: "gpt-3.5-turbo"
-        original_exception: The original exception.
-        **kwargs: Keyword arguments.
-
-    Returns:
-        The response from the successful fallback model group.
-    Raises:
-        The most recent exception if all fallback model groups fail.
-    """
-    error_from_fallbacks = original_exception
-    for mg in fallback_model_group:
-        if mg == original_model_group:
-            continue
-        try:
-            # LOGGING
-            kwargs = litellm_router.log_retry(kwargs=kwargs, e=original_exception)
-            verbose_router_logger.info(f"Falling back to model_group = {mg}")
-            kwargs["model"] = mg
-            kwargs.setdefault("metadata", {}).update(
-                {"model_group": mg}
-            )  # update model_group used, if fallbacks are done
-            response = litellm_router.function_with_fallbacks(*args, **kwargs)
-            verbose_router_logger.info("Successful fallback b/w models.")
-            return response
-        except Exception as e:
-            error_from_fallbacks = e
-    raise error_from_fallbacks
-
-
 async def log_success_fallback_event(
    original_model_group: str, kwargs: dict, original_exception: Exception
 ):
--- a/proxy_server_config.yaml
+++ b/proxy_server_config.yaml
@ -135,6 +135,13 @@ model_list:
      api_key: my-fake-key
      api_base: https://exampleopenaiendpoint-production.up.railway.app/
      timeout: 1
+  - model_name: badly-configured-openai-endpoint
+    litellm_params:
+      model: openai/my-fake-model
+      api_key: my-fake-key
+      api_base: https://exampleopenaiendpoint-production.up.railway.appxxxx/
+
+
 litellm_settings:
  # set_verbose: True  # Uncomment this if you want to see verbose logs; not recommended in production
  drop_params: True
--- a/tests/local_testing/test_router_fallback_handlers.py
+++ b/tests/local_testing/test_router_fallback_handlers.py
@ -25,7 +25,6 @@ sys.path.insert(0, os.path.abspath("../.."))

 from litellm.router_utils.fallback_event_handlers import (
    run_async_fallback,
-    run_sync_fallback,
    log_success_fallback_event,
    log_failure_fallback_event,
 )
@ -109,44 +108,6 @@ async def test_run_async_fallback(original_function):
        assert isinstance(result, litellm.EmbeddingResponse)


-@pytest.mark.parametrize("original_function", [router._completion, router._embedding])
-def test_run_sync_fallback(original_function):
-    litellm.set_verbose = True
-    fallback_model_group = ["gpt-4"]
-    original_model_group = "gpt-3.5-turbo"
-    original_exception = litellm.exceptions.InternalServerError(
-        message="Simulated error",
-        llm_provider="openai",
-        model="gpt-3.5-turbo",
-    )
-
-    request_kwargs = {
-        "mock_response": "hello this is a test for run_async_fallback",
-        "metadata": {"previous_models": ["gpt-3.5-turbo"]},
-    }
-
-    if original_function == router._embedding:
-        request_kwargs["input"] = "hello this is a test for run_async_fallback"
-    elif original_function == router._completion:
-        request_kwargs["messages"] = [{"role": "user", "content": "Hello, world!"}]
-    result = run_sync_fallback(
-        router,
-        original_function=original_function,
-        num_retries=1,
-        fallback_model_group=fallback_model_group,
-        original_model_group=original_model_group,
-        original_exception=original_exception,
-        **request_kwargs
-    )
-
-    assert result is not None
-
-    if original_function == router._completion:
-        assert isinstance(result, litellm.ModelResponse)
-    elif original_function == router._embedding:
-        assert isinstance(result, litellm.EmbeddingResponse)
-
-
 class CustomTestLogger(CustomLogger):
    def __init__(self):
        super().__init__()
--- a/tests/local_testing/test_router_fallbacks.py
+++ b/tests/local_testing/test_router_fallbacks.py
@ -1604,3 +1604,54 @@ def test_fallbacks_with_different_messages():
    )

    print(resp)
+
+
+@pytest.mark.parametrize("expected_attempted_fallbacks", [0, 1, 3])
+@pytest.mark.asyncio
+async def test_router_attempted_fallbacks_in_response(expected_attempted_fallbacks):
+    """
+    Test that the router returns the correct number of attempted fallbacks in the response
+
+    - Test cases: works on first try, `x-litellm-attempted-fallbacks` is 0
+    - Works on 1st fallback, `x-litellm-attempted-fallbacks` is 1
+    - Works on 3rd fallback, `x-litellm-attempted-fallbacks` is 3
+    """
+    router = Router(
+        model_list=[
+            {
+                "model_name": "working-fake-endpoint",
+                "litellm_params": {
+                    "model": "openai/working-fake-endpoint",
+                    "api_key": "my-fake-key",
+                    "api_base": "https://exampleopenaiendpoint-production.up.railway.app",
+                },
+            },
+            {
+                "model_name": "badly-configured-openai-endpoint",
+                "litellm_params": {
+                    "model": "openai/my-fake-model",
+                    "api_base": "https://exampleopenaiendpoint-production.up.railway.appzzzzz",
+                },
+            },
+        ],
+        fallbacks=[{"badly-configured-openai-endpoint": ["working-fake-endpoint"]}],
+    )
+
+    if expected_attempted_fallbacks == 0:
+        resp = router.completion(
+            model="working-fake-endpoint",
+            messages=[{"role": "user", "content": "Hey, how's it going?"}],
+        )
+        assert (
+            resp._hidden_params["additional_headers"]["x-litellm-attempted-fallbacks"]
+            == expected_attempted_fallbacks
+        )
+    elif expected_attempted_fallbacks == 1:
+        resp = router.completion(
+            model="badly-configured-openai-endpoint",
+            messages=[{"role": "user", "content": "Hey, how's it going?"}],
+        )
+        assert (
+            resp._hidden_params["additional_headers"]["x-litellm-attempted-fallbacks"]
+            == expected_attempted_fallbacks
+        )
--- a/tests/test_fallbacks.py
+++ b/tests/test_fallbacks.py
@ -156,6 +156,29 @@ async def test_chat_completion_with_retries():
        assert headers["x-litellm-max-retries"] == "50"


+@pytest.mark.asyncio
+async def test_chat_completion_with_fallbacks():
+    """
+    make chat completion call with prompt > context window. expect it to work with fallback
+    """
+    async with aiohttp.ClientSession() as session:
+        model = "badly-configured-openai-endpoint"
+        messages = [
+            {"role": "system", "content": text},
+            {"role": "user", "content": "Who was Alexander?"},
+        ]
+        response, headers = await chat_completion(
+            session=session,
+            key="sk-1234",
+            model=model,
+            messages=messages,
+            fallbacks=["fake-openai-endpoint-5"],
+            return_headers=True,
+        )
+        print(f"headers: {headers}")
+        assert headers["x-litellm-attempted-fallbacks"] == "1"
+
+
@pytest.mark.asyncio
 async def test_chat_completion_with_timeout():
    """