mirror of
https://github.com/BerriAI/litellm.git
synced 2025-04-24 18:24:20 +00:00
(Feat) - return x-litellm-attempted-fallbacks
in responses from litellm proxy (#8558)
* add_fallback_headers_to_response * test x-litellm-attempted-fallbacks * unit test attempted fallbacks * fix add_fallback_headers_to_response * docs document response headers * fix file name
This commit is contained in:
parent
a9276f27f9
commit
6b3bfa2b42
9 changed files with 200 additions and 117 deletions
|
@ -1,17 +1,20 @@
|
|||
# Rate Limit Headers
|
||||
# Response Headers
|
||||
|
||||
When you make a request to the proxy, the proxy will return the following [OpenAI-compatible headers](https://platform.openai.com/docs/guides/rate-limits/rate-limits-in-headers):
|
||||
When you make a request to the proxy, the proxy will return the following headers:
|
||||
|
||||
- `x-ratelimit-remaining-requests` - Optional[int]: The remaining number of requests that are permitted before exhausting the rate limit.
|
||||
- `x-ratelimit-remaining-tokens` - Optional[int]: The remaining number of tokens that are permitted before exhausting the rate limit.
|
||||
- `x-ratelimit-limit-requests` - Optional[int]: The maximum number of requests that are permitted before exhausting the rate limit.
|
||||
- `x-ratelimit-limit-tokens` - Optional[int]: The maximum number of tokens that are permitted before exhausting the rate limit.
|
||||
- `x-ratelimit-reset-requests` - Optional[int]: The time at which the rate limit will reset.
|
||||
- `x-ratelimit-reset-tokens` - Optional[int]: The time at which the rate limit will reset.
|
||||
## Rate Limit Headers
|
||||
[OpenAI-compatible headers](https://platform.openai.com/docs/guides/rate-limits/rate-limits-in-headers):
|
||||
|
||||
These headers are useful for clients to understand the current rate limit status and adjust their request rate accordingly.
|
||||
| Header | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `x-ratelimit-remaining-requests` | Optional[int] | The remaining number of requests that are permitted before exhausting the rate limit |
|
||||
| `x-ratelimit-remaining-tokens` | Optional[int] | The remaining number of tokens that are permitted before exhausting the rate limit |
|
||||
| `x-ratelimit-limit-requests` | Optional[int] | The maximum number of requests that are permitted before exhausting the rate limit |
|
||||
| `x-ratelimit-limit-tokens` | Optional[int] | The maximum number of tokens that are permitted before exhausting the rate limit |
|
||||
| `x-ratelimit-reset-requests` | Optional[int] | The time at which the rate limit will reset |
|
||||
| `x-ratelimit-reset-tokens` | Optional[int] | The time at which the rate limit will reset |
|
||||
|
||||
## How are these headers calculated?
|
||||
### How Rate Limit Headers work
|
||||
|
||||
**If key has rate limits set**
|
||||
|
||||
|
@ -19,6 +22,50 @@ The proxy will return the [remaining rate limits for that key](https://github.co
|
|||
|
||||
**If key does not have rate limits set**
|
||||
|
||||
The proxy returns the remaining requests/tokens returned by the backend provider.
|
||||
The proxy returns the remaining requests/tokens returned by the backend provider. (LiteLLM will standardize the backend provider's response headers to match the OpenAI format)
|
||||
|
||||
If the backend provider does not return these headers, the value will be `None`.
|
||||
|
||||
These headers are useful for clients to understand the current rate limit status and adjust their request rate accordingly.
|
||||
|
||||
|
||||
## Latency Headers
|
||||
| Header | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `x-litellm-response-duration-ms` | float | Total duration of the API response in milliseconds |
|
||||
| `x-litellm-overhead-duration-ms` | float | LiteLLM processing overhead in milliseconds |
|
||||
|
||||
## Retry, Fallback Headers
|
||||
| Header | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `x-litellm-attempted-retries` | int | Number of retry attempts made |
|
||||
| `x-litellm-attempted-fallbacks` | int | Number of fallback attempts made |
|
||||
| `x-litellm-max-fallbacks` | int | Maximum number of fallback attempts allowed |
|
||||
|
||||
## Cost Tracking Headers
|
||||
| Header | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `x-litellm-response-cost` | float | Cost of the API call |
|
||||
| `x-litellm-key-spend` | float | Total spend for the API key |
|
||||
|
||||
## LiteLLM Specific Headers
|
||||
| Header | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `x-litellm-call-id` | string | Unique identifier for the API call |
|
||||
| `x-litellm-model-id` | string | Unique identifier for the model used |
|
||||
| `x-litellm-model-api-base` | string | Base URL of the API endpoint |
|
||||
| `x-litellm-version` | string | Version of LiteLLM being used |
|
||||
| `x-litellm-model-group` | string | Model group identifier |
|
||||
|
||||
## Response headers from LLM providers
|
||||
|
||||
LiteLLM also returns the original response headers from the LLM provider. These headers are prefixed with `llm_provider-` to distinguish them from LiteLLM's headers.
|
||||
|
||||
Example response headers:
|
||||
```
|
||||
llm_provider-openai-processing-ms: 256
|
||||
llm_provider-openai-version: 2020-10-01
|
||||
llm_provider-x-ratelimit-limit-requests: 30000
|
||||
llm_provider-x-ratelimit-limit-tokens: 150000000
|
||||
```
|
||||
|
||||
|
|
|
@ -65,8 +65,8 @@ const sidebars = {
|
|||
items: [
|
||||
"proxy/user_keys",
|
||||
"proxy/clientside_auth",
|
||||
"proxy/response_headers",
|
||||
"proxy/request_headers",
|
||||
"proxy/response_headers",
|
||||
],
|
||||
},
|
||||
{
|
||||
|
|
|
@ -57,7 +57,10 @@ from litellm.router_strategy.lowest_tpm_rpm import LowestTPMLoggingHandler
|
|||
from litellm.router_strategy.lowest_tpm_rpm_v2 import LowestTPMLoggingHandler_v2
|
||||
from litellm.router_strategy.simple_shuffle import simple_shuffle
|
||||
from litellm.router_strategy.tag_based_routing import get_deployments_for_tag
|
||||
from litellm.router_utils.add_retry_headers import add_retry_headers_to_response
|
||||
from litellm.router_utils.add_retry_fallback_headers import (
|
||||
add_fallback_headers_to_response,
|
||||
add_retry_headers_to_response,
|
||||
)
|
||||
from litellm.router_utils.batch_utils import (
|
||||
_get_router_metadata_variable_name,
|
||||
replace_model_in_jsonl,
|
||||
|
@ -2888,6 +2891,10 @@ class Router:
|
|||
else:
|
||||
response = await self.async_function_with_retries(*args, **kwargs)
|
||||
verbose_router_logger.debug(f"Async Response: {response}")
|
||||
response = add_fallback_headers_to_response(
|
||||
response=response,
|
||||
attempted_fallbacks=0,
|
||||
)
|
||||
return response
|
||||
except Exception as e:
|
||||
verbose_router_logger.debug(f"Traceback{traceback.format_exc()}")
|
||||
|
|
|
@ -5,24 +5,13 @@ from pydantic import BaseModel
|
|||
from litellm.types.utils import HiddenParams
|
||||
|
||||
|
||||
def add_retry_headers_to_response(
|
||||
response: Any,
|
||||
attempted_retries: int,
|
||||
max_retries: Optional[int] = None,
|
||||
) -> Any:
|
||||
def _add_headers_to_response(response: Any, headers: dict) -> Any:
|
||||
"""
|
||||
Add retry headers to the request
|
||||
Helper function to add headers to a response's hidden params
|
||||
"""
|
||||
|
||||
if response is None or not isinstance(response, BaseModel):
|
||||
return response
|
||||
|
||||
retry_headers = {
|
||||
"x-litellm-attempted-retries": attempted_retries,
|
||||
}
|
||||
if max_retries is not None:
|
||||
retry_headers["x-litellm-max-retries"] = max_retries
|
||||
|
||||
hidden_params: Optional[Union[dict, HiddenParams]] = getattr(
|
||||
response, "_hidden_params", {}
|
||||
)
|
||||
|
@ -33,8 +22,47 @@ def add_retry_headers_to_response(
|
|||
hidden_params = hidden_params.model_dump()
|
||||
|
||||
hidden_params.setdefault("additional_headers", {})
|
||||
hidden_params["additional_headers"].update(retry_headers)
|
||||
hidden_params["additional_headers"].update(headers)
|
||||
|
||||
setattr(response, "_hidden_params", hidden_params)
|
||||
|
||||
return response
|
||||
|
||||
|
||||
def add_retry_headers_to_response(
|
||||
response: Any,
|
||||
attempted_retries: int,
|
||||
max_retries: Optional[int] = None,
|
||||
) -> Any:
|
||||
"""
|
||||
Add retry headers to the request
|
||||
"""
|
||||
retry_headers = {
|
||||
"x-litellm-attempted-retries": attempted_retries,
|
||||
}
|
||||
if max_retries is not None:
|
||||
retry_headers["x-litellm-max-retries"] = max_retries
|
||||
|
||||
return _add_headers_to_response(response, retry_headers)
|
||||
|
||||
|
||||
def add_fallback_headers_to_response(
|
||||
response: Any,
|
||||
attempted_fallbacks: int,
|
||||
) -> Any:
|
||||
"""
|
||||
Add fallback headers to the response
|
||||
|
||||
Args:
|
||||
response: The response to add the headers to
|
||||
attempted_fallbacks: The number of fallbacks attempted
|
||||
|
||||
Returns:
|
||||
The response with the headers added
|
||||
|
||||
Note: It's intentional that we don't add max_fallbacks in response headers
|
||||
Want to avoid bloat in the response headers for performance.
|
||||
"""
|
||||
fallback_headers = {
|
||||
"x-litellm-attempted-fallbacks": attempted_fallbacks,
|
||||
}
|
||||
return _add_headers_to_response(response, fallback_headers)
|
|
@ -4,6 +4,9 @@ from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union
|
|||
import litellm
|
||||
from litellm._logging import verbose_router_logger
|
||||
from litellm.integrations.custom_logger import CustomLogger
|
||||
from litellm.router_utils.add_retry_fallback_headers import (
|
||||
add_fallback_headers_to_response,
|
||||
)
|
||||
from litellm.types.router import LiteLLMParamsTypedDict
|
||||
|
||||
if TYPE_CHECKING:
|
||||
|
@ -130,12 +133,17 @@ async def run_async_fallback(
|
|||
kwargs.setdefault("metadata", {}).update(
|
||||
{"model_group": kwargs.get("model", None)}
|
||||
) # update model_group used, if fallbacks are done
|
||||
kwargs["fallback_depth"] = fallback_depth + 1
|
||||
fallback_depth = fallback_depth + 1
|
||||
kwargs["fallback_depth"] = fallback_depth
|
||||
kwargs["max_fallbacks"] = max_fallbacks
|
||||
response = await litellm_router.async_function_with_fallbacks(
|
||||
*args, **kwargs
|
||||
)
|
||||
verbose_router_logger.info("Successful fallback b/w models.")
|
||||
response = add_fallback_headers_to_response(
|
||||
response=response,
|
||||
attempted_fallbacks=fallback_depth,
|
||||
)
|
||||
# callback for successfull_fallback_event():
|
||||
await log_success_fallback_event(
|
||||
original_model_group=original_model_group,
|
||||
|
@ -153,55 +161,6 @@ async def run_async_fallback(
|
|||
raise error_from_fallbacks
|
||||
|
||||
|
||||
def run_sync_fallback(
|
||||
litellm_router: LitellmRouter,
|
||||
*args: Tuple[Any],
|
||||
fallback_model_group: List[str],
|
||||
original_model_group: str,
|
||||
original_exception: Exception,
|
||||
**kwargs,
|
||||
) -> Any:
|
||||
"""
|
||||
Synchronous version of run_async_fallback.
|
||||
Loops through all the fallback model groups and calls kwargs["original_function"] with the arguments and keyword arguments provided.
|
||||
|
||||
If the call is successful, returns the response.
|
||||
If the call fails, continues to the next fallback model group.
|
||||
If all fallback model groups fail, it raises the most recent exception.
|
||||
|
||||
Args:
|
||||
litellm_router: The litellm router instance.
|
||||
*args: Positional arguments.
|
||||
fallback_model_group: List[str] of fallback model groups. example: ["gpt-4", "gpt-3.5-turbo"]
|
||||
original_model_group: The original model group. example: "gpt-3.5-turbo"
|
||||
original_exception: The original exception.
|
||||
**kwargs: Keyword arguments.
|
||||
|
||||
Returns:
|
||||
The response from the successful fallback model group.
|
||||
Raises:
|
||||
The most recent exception if all fallback model groups fail.
|
||||
"""
|
||||
error_from_fallbacks = original_exception
|
||||
for mg in fallback_model_group:
|
||||
if mg == original_model_group:
|
||||
continue
|
||||
try:
|
||||
# LOGGING
|
||||
kwargs = litellm_router.log_retry(kwargs=kwargs, e=original_exception)
|
||||
verbose_router_logger.info(f"Falling back to model_group = {mg}")
|
||||
kwargs["model"] = mg
|
||||
kwargs.setdefault("metadata", {}).update(
|
||||
{"model_group": mg}
|
||||
) # update model_group used, if fallbacks are done
|
||||
response = litellm_router.function_with_fallbacks(*args, **kwargs)
|
||||
verbose_router_logger.info("Successful fallback b/w models.")
|
||||
return response
|
||||
except Exception as e:
|
||||
error_from_fallbacks = e
|
||||
raise error_from_fallbacks
|
||||
|
||||
|
||||
async def log_success_fallback_event(
|
||||
original_model_group: str, kwargs: dict, original_exception: Exception
|
||||
):
|
||||
|
|
|
@ -135,6 +135,13 @@ model_list:
|
|||
api_key: my-fake-key
|
||||
api_base: https://exampleopenaiendpoint-production.up.railway.app/
|
||||
timeout: 1
|
||||
- model_name: badly-configured-openai-endpoint
|
||||
litellm_params:
|
||||
model: openai/my-fake-model
|
||||
api_key: my-fake-key
|
||||
api_base: https://exampleopenaiendpoint-production.up.railway.appxxxx/
|
||||
|
||||
|
||||
litellm_settings:
|
||||
# set_verbose: True # Uncomment this if you want to see verbose logs; not recommended in production
|
||||
drop_params: True
|
||||
|
|
|
@ -25,7 +25,6 @@ sys.path.insert(0, os.path.abspath("../.."))
|
|||
|
||||
from litellm.router_utils.fallback_event_handlers import (
|
||||
run_async_fallback,
|
||||
run_sync_fallback,
|
||||
log_success_fallback_event,
|
||||
log_failure_fallback_event,
|
||||
)
|
||||
|
@ -109,44 +108,6 @@ async def test_run_async_fallback(original_function):
|
|||
assert isinstance(result, litellm.EmbeddingResponse)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("original_function", [router._completion, router._embedding])
|
||||
def test_run_sync_fallback(original_function):
|
||||
litellm.set_verbose = True
|
||||
fallback_model_group = ["gpt-4"]
|
||||
original_model_group = "gpt-3.5-turbo"
|
||||
original_exception = litellm.exceptions.InternalServerError(
|
||||
message="Simulated error",
|
||||
llm_provider="openai",
|
||||
model="gpt-3.5-turbo",
|
||||
)
|
||||
|
||||
request_kwargs = {
|
||||
"mock_response": "hello this is a test for run_async_fallback",
|
||||
"metadata": {"previous_models": ["gpt-3.5-turbo"]},
|
||||
}
|
||||
|
||||
if original_function == router._embedding:
|
||||
request_kwargs["input"] = "hello this is a test for run_async_fallback"
|
||||
elif original_function == router._completion:
|
||||
request_kwargs["messages"] = [{"role": "user", "content": "Hello, world!"}]
|
||||
result = run_sync_fallback(
|
||||
router,
|
||||
original_function=original_function,
|
||||
num_retries=1,
|
||||
fallback_model_group=fallback_model_group,
|
||||
original_model_group=original_model_group,
|
||||
original_exception=original_exception,
|
||||
**request_kwargs
|
||||
)
|
||||
|
||||
assert result is not None
|
||||
|
||||
if original_function == router._completion:
|
||||
assert isinstance(result, litellm.ModelResponse)
|
||||
elif original_function == router._embedding:
|
||||
assert isinstance(result, litellm.EmbeddingResponse)
|
||||
|
||||
|
||||
class CustomTestLogger(CustomLogger):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
|
|
|
@ -1604,3 +1604,54 @@ def test_fallbacks_with_different_messages():
|
|||
)
|
||||
|
||||
print(resp)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("expected_attempted_fallbacks", [0, 1, 3])
|
||||
@pytest.mark.asyncio
|
||||
async def test_router_attempted_fallbacks_in_response(expected_attempted_fallbacks):
|
||||
"""
|
||||
Test that the router returns the correct number of attempted fallbacks in the response
|
||||
|
||||
- Test cases: works on first try, `x-litellm-attempted-fallbacks` is 0
|
||||
- Works on 1st fallback, `x-litellm-attempted-fallbacks` is 1
|
||||
- Works on 3rd fallback, `x-litellm-attempted-fallbacks` is 3
|
||||
"""
|
||||
router = Router(
|
||||
model_list=[
|
||||
{
|
||||
"model_name": "working-fake-endpoint",
|
||||
"litellm_params": {
|
||||
"model": "openai/working-fake-endpoint",
|
||||
"api_key": "my-fake-key",
|
||||
"api_base": "https://exampleopenaiendpoint-production.up.railway.app",
|
||||
},
|
||||
},
|
||||
{
|
||||
"model_name": "badly-configured-openai-endpoint",
|
||||
"litellm_params": {
|
||||
"model": "openai/my-fake-model",
|
||||
"api_base": "https://exampleopenaiendpoint-production.up.railway.appzzzzz",
|
||||
},
|
||||
},
|
||||
],
|
||||
fallbacks=[{"badly-configured-openai-endpoint": ["working-fake-endpoint"]}],
|
||||
)
|
||||
|
||||
if expected_attempted_fallbacks == 0:
|
||||
resp = router.completion(
|
||||
model="working-fake-endpoint",
|
||||
messages=[{"role": "user", "content": "Hey, how's it going?"}],
|
||||
)
|
||||
assert (
|
||||
resp._hidden_params["additional_headers"]["x-litellm-attempted-fallbacks"]
|
||||
== expected_attempted_fallbacks
|
||||
)
|
||||
elif expected_attempted_fallbacks == 1:
|
||||
resp = router.completion(
|
||||
model="badly-configured-openai-endpoint",
|
||||
messages=[{"role": "user", "content": "Hey, how's it going?"}],
|
||||
)
|
||||
assert (
|
||||
resp._hidden_params["additional_headers"]["x-litellm-attempted-fallbacks"]
|
||||
== expected_attempted_fallbacks
|
||||
)
|
||||
|
|
|
@ -156,6 +156,29 @@ async def test_chat_completion_with_retries():
|
|||
assert headers["x-litellm-max-retries"] == "50"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_chat_completion_with_fallbacks():
|
||||
"""
|
||||
make chat completion call with prompt > context window. expect it to work with fallback
|
||||
"""
|
||||
async with aiohttp.ClientSession() as session:
|
||||
model = "badly-configured-openai-endpoint"
|
||||
messages = [
|
||||
{"role": "system", "content": text},
|
||||
{"role": "user", "content": "Who was Alexander?"},
|
||||
]
|
||||
response, headers = await chat_completion(
|
||||
session=session,
|
||||
key="sk-1234",
|
||||
model=model,
|
||||
messages=messages,
|
||||
fallbacks=["fake-openai-endpoint-5"],
|
||||
return_headers=True,
|
||||
)
|
||||
print(f"headers: {headers}")
|
||||
assert headers["x-litellm-attempted-fallbacks"] == "1"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_chat_completion_with_timeout():
|
||||
"""
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue