LiteLLM Minor Fixes & Improvements (11/04/2024) (#6572)

* feat: initial commit for watsonx chat endpoint support

Closes https://github.com/BerriAI/litellm/issues/6562

* feat(watsonx/chat/handler.py): support tool calling for watsonx

Closes https://github.com/BerriAI/litellm/issues/6562

* fix(streaming_utils.py): return empty chunk instead of failing if streaming value is invalid dict

ensures streaming works for ibm watsonx

* fix(openai_like/chat/handler.py): ensure asynchttphandler is passed correctly for openai like calls

* fix: ensure exception mapping works well for watsonx calls

* fix(openai_like/chat/handler.py): handle async streaming correctly

* feat(main.py): Make it clear when a user is passing an invalid message

add validation for user content message

 Closes https://github.com/BerriAI/litellm/issues/6565

* fix: cleanup

* fix(utils.py): loosen validation check, to just make sure content types are valid

make litellm robust to future content updates

* fix: fix linting erro

* fix: fix linting errors

* fix(utils.py): make validation check more flexible

* test: handle langfuse list index out of range error

* Litellm dev 11 02 2024 (#6561)

* fix(dual_cache.py): update in-memory check for redis batch get cache

Fixes latency delay for async_batch_redis_cache

* fix(service_logger.py): fix race condition causing otel service logging to be overwritten if service_callbacks set

* feat(user_api_key_auth.py): add parent otel component for auth

allows us to isolate how much latency is added by auth checks

* perf(parallel_request_limiter.py): move async_set_cache_pipeline (from max parallel request limiter) out of execution path (background task)

reduces latency by 200ms

* feat(user_api_key_auth.py): have user api key auth object return user tpm/rpm limits - reduces redis calls in downstream task (parallel_request_limiter)

Reduces latency by 400-800ms

* fix(parallel_request_limiter.py): use batch get cache to reduce user/key/team usage object calls

reduces latency by 50-100ms

* fix: fix linting error

* fix(_service_logger.py): fix import

* fix(user_api_key_auth.py): fix service logging

* fix(dual_cache.py): don't pass 'self'

* fix: fix python3.8 error

* fix: fix init]

* bump: version 1.51.4 → 1.51.5

* build(deps): bump cookie and express in /docs/my-website (#6566)

Bumps [cookie](https://github.com/jshttp/cookie) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.

Updates `cookie` from 0.6.0 to 0.7.1
- [Release notes](https://github.com/jshttp/cookie/releases)
- [Commits](https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.1)

Updates `express` from 4.20.0 to 4.21.1
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.1/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.20.0...4.21.1)

---
updated-dependencies:
- dependency-name: cookie
  dependency-type: indirect
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* docs(virtual_keys.md): update Dockerfile reference (#6554)

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

* (proxy fix) - call connect on prisma client when running setup (#6534)

* critical fix - call connect on prisma client when running setup

* fix test_proxy_server_prisma_setup

* fix test_proxy_server_prisma_setup

* Add 3.5 haiku (#6588)

* feat: add claude-3-5-haiku-20241022 entries

* feat: add claude-3-5-haiku-20241022 and vertex_ai/claude-3-5-haiku@20241022 models

* add missing entries, remove vision

* remove image token costs

* Litellm perf improvements 3 (#6573)

* perf: move writing key to cache, to background task

* perf(litellm_pre_call_utils.py): add otel tracing for pre-call utils

adds 200ms on calls with pgdb connected

* fix(litellm_pre_call_utils.py'): rename call_type to actual call used

* perf(proxy_server.py): remove db logic from _get_config_from_file

was causing db calls to occur on every llm request, if team_id was set on key

* fix(auth_checks.py): add check for reducing db calls if user/team id does not exist in db

reduces latency/call by ~100ms

* fix(proxy_server.py): minor fix on existing_settings not incl alerting

* fix(exception_mapping_utils.py): map databricks exception string

* fix(auth_checks.py): fix auth check logic

* test: correctly mark flaky test

* fix(utils.py): handle auth token error for tokenizers.from_pretrained

* build: fix map

* build: fix map

* build: fix json for model map

* Litellm dev 11 02 2024 (#6561)

* fix(dual_cache.py): update in-memory check for redis batch get cache

Fixes latency delay for async_batch_redis_cache

* fix(service_logger.py): fix race condition causing otel service logging to be overwritten if service_callbacks set

* feat(user_api_key_auth.py): add parent otel component for auth

allows us to isolate how much latency is added by auth checks

* perf(parallel_request_limiter.py): move async_set_cache_pipeline (from max parallel request limiter) out of execution path (background task)

reduces latency by 200ms

* feat(user_api_key_auth.py): have user api key auth object return user tpm/rpm limits - reduces redis calls in downstream task (parallel_request_limiter)

Reduces latency by 400-800ms

* fix(parallel_request_limiter.py): use batch get cache to reduce user/key/team usage object calls

reduces latency by 50-100ms

* fix: fix linting error

* fix(_service_logger.py): fix import

* fix(user_api_key_auth.py): fix service logging

* fix(dual_cache.py): don't pass 'self'

* fix: fix python3.8 error

* fix: fix init]

* Litellm perf improvements 3 (#6573)

* perf: move writing key to cache, to background task

* perf(litellm_pre_call_utils.py): add otel tracing for pre-call utils

adds 200ms on calls with pgdb connected

* fix(litellm_pre_call_utils.py'): rename call_type to actual call used

* perf(proxy_server.py): remove db logic from _get_config_from_file

was causing db calls to occur on every llm request, if team_id was set on key

* fix(auth_checks.py): add check for reducing db calls if user/team id does not exist in db

reduces latency/call by ~100ms

* fix(proxy_server.py): minor fix on existing_settings not incl alerting

* fix(exception_mapping_utils.py): map databricks exception string

* fix(auth_checks.py): fix auth check logic

* test: correctly mark flaky test

* fix(utils.py): handle auth token error for tokenizers.from_pretrained

* fix ImageObject conversion (#6584)

* (fix) litellm.text_completion raises a non-blocking error on simple usage (#6546)

* unit test test_huggingface_text_completion_logprobs

* fix return TextCompletionHandler convert_chat_to_text_completion

* fix hf rest api

* fix test_huggingface_text_completion_logprobs

* fix linting errors

* fix importLiteLLMResponseObjectHandler

* fix test for LiteLLMResponseObjectHandler

* fix test text completion

* fix allow using 15 seconds for premium license check

* testing fix bedrock deprecated cohere.command-text-v14

* (feat) add `Predicted Outputs` for OpenAI  (#6594)

* bump openai to openai==1.54.0

* add 'prediction' param

* testing fix bedrock deprecated cohere.command-text-v14

* test test_openai_prediction_param.py

* test_openai_prediction_param_with_caching

* doc Predicted Outputs

* doc Predicted Output

* (fix) Vertex Improve Performance when using `image_url`  (#6593)

* fix transformation vertex

* test test_process_gemini_image

* test_image_completion_request

* testing fix - bedrock has deprecated cohere.command-text-v14

* fix vertex pdf

* bump: version 1.51.5 → 1.52.0

* fix(lowest_tpm_rpm_routing.py): fix parallel rate limit check (#6577)

* fix(lowest_tpm_rpm_routing.py): fix parallel rate limit check

* fix(lowest_tpm_rpm_v2.py): return headers in correct format

* test: update test

* build(deps): bump cookie and express in /docs/my-website (#6566)

Bumps [cookie](https://github.com/jshttp/cookie) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.

Updates `cookie` from 0.6.0 to 0.7.1
- [Release notes](https://github.com/jshttp/cookie/releases)
- [Commits](https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.1)

Updates `express` from 4.20.0 to 4.21.1
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.1/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.20.0...4.21.1)

---
updated-dependencies:
- dependency-name: cookie
  dependency-type: indirect
- dependency-name: express
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* docs(virtual_keys.md): update Dockerfile reference (#6554)

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>

* (proxy fix) - call connect on prisma client when running setup (#6534)

* critical fix - call connect on prisma client when running setup

* fix test_proxy_server_prisma_setup

* fix test_proxy_server_prisma_setup

* Add 3.5 haiku (#6588)

* feat: add claude-3-5-haiku-20241022 entries

* feat: add claude-3-5-haiku-20241022 and vertex_ai/claude-3-5-haiku@20241022 models

* add missing entries, remove vision

* remove image token costs

* Litellm perf improvements 3 (#6573)

* perf: move writing key to cache, to background task

* perf(litellm_pre_call_utils.py): add otel tracing for pre-call utils

adds 200ms on calls with pgdb connected

* fix(litellm_pre_call_utils.py'): rename call_type to actual call used

* perf(proxy_server.py): remove db logic from _get_config_from_file

was causing db calls to occur on every llm request, if team_id was set on key

* fix(auth_checks.py): add check for reducing db calls if user/team id does not exist in db

reduces latency/call by ~100ms

* fix(proxy_server.py): minor fix on existing_settings not incl alerting

* fix(exception_mapping_utils.py): map databricks exception string

* fix(auth_checks.py): fix auth check logic

* test: correctly mark flaky test

* fix(utils.py): handle auth token error for tokenizers.from_pretrained

* build: fix map

* build: fix map

* build: fix json for model map

* test: remove eol model

* fix(proxy_server.py): fix db config loading logic

* fix(proxy_server.py): fix order of config / db updates, to ensure fields not overwritten

* test: skip test if required env var is missing

* test: fix test

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com>
Co-authored-by: paul-gauthier <69695708+paul-gauthier@users.noreply.github.com>

* test: mark flaky test

* test: handle anthropic api instability

* test: update test

* test: bump num retries on langfuse tests - their api is quite bad

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com>
Co-authored-by: paul-gauthier <69695708+paul-gauthier@users.noreply.github.com>
This commit is contained in:
Krish Dholakia 2024-11-06 17:53:46 +05:30 committed by GitHub
parent 4ed7f96360
commit 741da7e182
24 changed files with 1510 additions and 554 deletions

View file

@ -69,6 +69,9 @@ from litellm.litellm_core_utils.get_llm_provider_logic import (
_is_non_openai_azure_model,
get_llm_provider,
)
from litellm.litellm_core_utils.get_supported_openai_params import (
get_supported_openai_params,
)
from litellm.litellm_core_utils.llm_request_utils import _ensure_extra_body_is_safe
from litellm.litellm_core_utils.llm_response_utils.convert_dict_to_response import (
LiteLLMResponseObjectHandler,
@ -962,9 +965,10 @@ def client(original_function): # noqa: PLR0915
result._hidden_params["additional_headers"] = process_response_headers(
result._hidden_params.get("additional_headers") or {}
) # GUARANTEE OPENAI HEADERS IN RESPONSE
result._response_ms = (
end_time - start_time
).total_seconds() * 1000 # return response latency in ms like openai
if result is not None:
result._response_ms = (
end_time - start_time
).total_seconds() * 1000 # return response latency in ms like openai
return result
except Exception as e:
call_type = original_function.__name__
@ -3622,43 +3626,30 @@ def get_optional_params( # noqa: PLR0915
model=model, custom_llm_provider=custom_llm_provider
)
_check_valid_arg(supported_params=supported_params)
if max_tokens is not None:
optional_params["max_new_tokens"] = max_tokens
if stream:
optional_params["stream"] = stream
if temperature is not None:
optional_params["temperature"] = temperature
if top_p is not None:
optional_params["top_p"] = top_p
if frequency_penalty is not None:
optional_params["repetition_penalty"] = frequency_penalty
if seed is not None:
optional_params["random_seed"] = seed
if stop is not None:
optional_params["stop_sequences"] = stop
# WatsonX-only parameters
extra_body = {}
if "decoding_method" in passed_params:
extra_body["decoding_method"] = passed_params.pop("decoding_method")
if "min_tokens" in passed_params or "min_new_tokens" in passed_params:
extra_body["min_new_tokens"] = passed_params.pop(
"min_tokens", passed_params.pop("min_new_tokens")
)
if "top_k" in passed_params:
extra_body["top_k"] = passed_params.pop("top_k")
if "truncate_input_tokens" in passed_params:
extra_body["truncate_input_tokens"] = passed_params.pop(
"truncate_input_tokens"
)
if "length_penalty" in passed_params:
extra_body["length_penalty"] = passed_params.pop("length_penalty")
if "time_limit" in passed_params:
extra_body["time_limit"] = passed_params.pop("time_limit")
if "return_options" in passed_params:
extra_body["return_options"] = passed_params.pop("return_options")
optional_params["extra_body"] = (
extra_body # openai client supports `extra_body` param
optional_params = litellm.IBMWatsonXChatConfig().map_openai_params(
non_default_params=non_default_params,
optional_params=optional_params,
model=model,
drop_params=(
drop_params
if drop_params is not None and isinstance(drop_params, bool)
else False
),
)
# WatsonX-text param check
for param in passed_params.keys():
if litellm.IBMWatsonXAIConfig().is_watsonx_text_param(param):
raise ValueError(
f"LiteLLM now defaults to Watsonx's `/text/chat` endpoint. Please use the `watsonx_text` provider instead, to call the `/text/generation` endpoint. Param: {param}"
)
elif custom_llm_provider == "watsonx_text":
supported_params = get_supported_openai_params(
model=model, custom_llm_provider=custom_llm_provider
)
_check_valid_arg(supported_params=supported_params)
optional_params = litellm.IBMWatsonXAIConfig().map_openai_params(
non_default_params=non_default_params,
optional_params=optional_params,
)
elif custom_llm_provider == "openai":
supported_params = get_supported_openai_params(
@ -4160,290 +4151,6 @@ def get_first_chars_messages(kwargs: dict) -> str:
return ""
def get_supported_openai_params( # noqa: PLR0915
model: str,
custom_llm_provider: Optional[str] = None,
request_type: Literal["chat_completion", "embeddings"] = "chat_completion",
) -> Optional[list]:
"""
Returns the supported openai params for a given model + provider
Example:
```
get_supported_openai_params(model="anthropic.claude-3", custom_llm_provider="bedrock")
```
Returns:
- List if custom_llm_provider is mapped
- None if unmapped
"""
if not custom_llm_provider:
try:
custom_llm_provider = litellm.get_llm_provider(model=model)[1]
except BadRequestError:
return None
if custom_llm_provider == "bedrock":
return litellm.AmazonConverseConfig().get_supported_openai_params(model=model)
elif custom_llm_provider == "ollama":
return litellm.OllamaConfig().get_supported_openai_params()
elif custom_llm_provider == "ollama_chat":
return litellm.OllamaChatConfig().get_supported_openai_params()
elif custom_llm_provider == "anthropic":
return litellm.AnthropicConfig().get_supported_openai_params()
elif custom_llm_provider == "fireworks_ai":
if request_type == "embeddings":
return litellm.FireworksAIEmbeddingConfig().get_supported_openai_params(
model=model
)
else:
return litellm.FireworksAIConfig().get_supported_openai_params()
elif custom_llm_provider == "nvidia_nim":
if request_type == "chat_completion":
return litellm.nvidiaNimConfig.get_supported_openai_params(model=model)
elif request_type == "embeddings":
return litellm.nvidiaNimEmbeddingConfig.get_supported_openai_params()
elif custom_llm_provider == "cerebras":
return litellm.CerebrasConfig().get_supported_openai_params(model=model)
elif custom_llm_provider == "xai":
return litellm.XAIChatConfig().get_supported_openai_params(model=model)
elif custom_llm_provider == "ai21_chat":
return litellm.AI21ChatConfig().get_supported_openai_params(model=model)
elif custom_llm_provider == "volcengine":
return litellm.VolcEngineConfig().get_supported_openai_params(model=model)
elif custom_llm_provider == "groq":
return litellm.GroqChatConfig().get_supported_openai_params(model=model)
elif custom_llm_provider == "hosted_vllm":
return litellm.HostedVLLMChatConfig().get_supported_openai_params(model=model)
elif custom_llm_provider == "deepseek":
return [
# https://platform.deepseek.com/api-docs/api/create-chat-completion
"frequency_penalty",
"max_tokens",
"presence_penalty",
"response_format",
"stop",
"stream",
"temperature",
"top_p",
"logprobs",
"top_logprobs",
"tools",
"tool_choice",
]
elif custom_llm_provider == "cohere":
return [
"stream",
"temperature",
"max_tokens",
"logit_bias",
"top_p",
"frequency_penalty",
"presence_penalty",
"stop",
"n",
"extra_headers",
]
elif custom_llm_provider == "cohere_chat":
return [
"stream",
"temperature",
"max_tokens",
"top_p",
"frequency_penalty",
"presence_penalty",
"stop",
"n",
"tools",
"tool_choice",
"seed",
"extra_headers",
]
elif custom_llm_provider == "maritalk":
return [
"stream",
"temperature",
"max_tokens",
"top_p",
"presence_penalty",
"stop",
]
elif custom_llm_provider == "openai":
return litellm.OpenAIConfig().get_supported_openai_params(model=model)
elif custom_llm_provider == "azure":
if litellm.AzureOpenAIO1Config().is_o1_model(model=model):
return litellm.AzureOpenAIO1Config().get_supported_openai_params(
model=model
)
else:
return litellm.AzureOpenAIConfig().get_supported_openai_params()
elif custom_llm_provider == "openrouter":
return [
"temperature",
"top_p",
"frequency_penalty",
"presence_penalty",
"repetition_penalty",
"seed",
"max_tokens",
"logit_bias",
"logprobs",
"top_logprobs",
"response_format",
"stop",
"tools",
"tool_choice",
]
elif custom_llm_provider == "mistral" or custom_llm_provider == "codestral":
# mistal and codestral api have the exact same params
if request_type == "chat_completion":
return litellm.MistralConfig().get_supported_openai_params()
elif request_type == "embeddings":
return litellm.MistralEmbeddingConfig().get_supported_openai_params()
elif custom_llm_provider == "text-completion-codestral":
return litellm.MistralTextCompletionConfig().get_supported_openai_params()
elif custom_llm_provider == "replicate":
return [
"stream",
"temperature",
"max_tokens",
"top_p",
"stop",
"seed",
"tools",
"tool_choice",
"functions",
"function_call",
]
elif custom_llm_provider == "huggingface":
return litellm.HuggingfaceConfig().get_supported_openai_params()
elif custom_llm_provider == "together_ai":
return [
"stream",
"temperature",
"max_tokens",
"top_p",
"stop",
"frequency_penalty",
"tools",
"tool_choice",
"response_format",
]
elif custom_llm_provider == "ai21":
return [
"stream",
"n",
"temperature",
"max_tokens",
"top_p",
"stop",
"frequency_penalty",
"presence_penalty",
]
elif custom_llm_provider == "databricks":
if request_type == "chat_completion":
return litellm.DatabricksConfig().get_supported_openai_params()
elif request_type == "embeddings":
return litellm.DatabricksEmbeddingConfig().get_supported_openai_params()
elif custom_llm_provider == "palm" or custom_llm_provider == "gemini":
return litellm.GoogleAIStudioGeminiConfig().get_supported_openai_params()
elif custom_llm_provider == "vertex_ai":
if request_type == "chat_completion":
if model.startswith("meta/"):
return litellm.VertexAILlama3Config().get_supported_openai_params()
if model.startswith("mistral"):
return litellm.MistralConfig().get_supported_openai_params()
if model.startswith("codestral"):
return (
litellm.MistralTextCompletionConfig().get_supported_openai_params()
)
if model.startswith("claude"):
return litellm.VertexAIAnthropicConfig().get_supported_openai_params()
return litellm.VertexAIConfig().get_supported_openai_params()
elif request_type == "embeddings":
return litellm.VertexAITextEmbeddingConfig().get_supported_openai_params()
elif custom_llm_provider == "vertex_ai_beta":
if request_type == "chat_completion":
return litellm.VertexGeminiConfig().get_supported_openai_params()
elif request_type == "embeddings":
return litellm.VertexAITextEmbeddingConfig().get_supported_openai_params()
elif custom_llm_provider == "sagemaker":
return ["stream", "temperature", "max_tokens", "top_p", "stop", "n"]
elif custom_llm_provider == "aleph_alpha":
return [
"max_tokens",
"stream",
"top_p",
"temperature",
"presence_penalty",
"frequency_penalty",
"n",
"stop",
]
elif custom_llm_provider == "cloudflare":
return ["max_tokens", "stream"]
elif custom_llm_provider == "nlp_cloud":
return [
"max_tokens",
"stream",
"temperature",
"top_p",
"presence_penalty",
"frequency_penalty",
"n",
"stop",
]
elif custom_llm_provider == "petals":
return ["max_tokens", "temperature", "top_p", "stream"]
elif custom_llm_provider == "deepinfra":
return litellm.DeepInfraConfig().get_supported_openai_params()
elif custom_llm_provider == "perplexity":
return [
"temperature",
"top_p",
"stream",
"max_tokens",
"presence_penalty",
"frequency_penalty",
]
elif custom_llm_provider == "anyscale":
return [
"temperature",
"top_p",
"stream",
"max_tokens",
"stop",
"frequency_penalty",
"presence_penalty",
]
elif custom_llm_provider == "watsonx":
return litellm.IBMWatsonXAIConfig().get_supported_openai_params()
elif custom_llm_provider == "custom_openai" or "text-completion-openai":
return [
"functions",
"function_call",
"temperature",
"top_p",
"n",
"stream",
"stream_options",
"stop",
"max_tokens",
"presence_penalty",
"frequency_penalty",
"logit_bias",
"user",
"response_format",
"seed",
"tools",
"tool_choice",
"max_retries",
"logprobs",
"top_logprobs",
"extra_headers",
]
return None
def _count_characters(text: str) -> int:
# Remove white spaces and count characters
filtered_text = "".join(char for char in text if not char.isspace())
@ -8640,3 +8347,47 @@ def add_dummy_tool(custom_llm_provider: str) -> List[ChatCompletionToolParam]:
),
)
]
from litellm.types.llms.openai import (
ChatCompletionAudioObject,
ChatCompletionImageObject,
ChatCompletionTextObject,
ChatCompletionUserMessage,
OpenAIMessageContent,
ValidUserMessageContentTypes,
)
def validate_chat_completion_user_messages(messages: List[AllMessageValues]):
"""
Ensures all user messages are valid OpenAI chat completion messages.
Args:
messages: List of message dictionaries
message_content_type: Type to validate content against
Returns:
List[dict]: The validated messages
Raises:
ValueError: If any message is invalid
"""
for idx, m in enumerate(messages):
try:
if m["role"] == "user":
user_content = m.get("content")
if user_content is not None:
if isinstance(user_content, str):
continue
elif isinstance(user_content, list):
for item in user_content:
if isinstance(item, dict):
if item.get("type") not in ValidUserMessageContentTypes:
raise Exception("invalid content type")
except Exception:
raise Exception(
f"Invalid user message={m} at index {idx}. Please ensure all user messages are valid OpenAI chat completion messages."
)
return messages