Merge branch 'main' into litellm_run_moderation_check_on_embedding

2024-07-18 12:44:30 -07:00 · 2024-07-18 12:44:30 -07:00 · eedacf5193
commit eedacf5193
parent 3dfeee03d0 51b3ef87d3
22 changed files with 591 additions and 59 deletions
--- a/docs/my-website/docs/data_security.md
+++ b/docs/my-website/docs/data_security.md
@ -14,6 +14,14 @@

 For security inquiries, please contact us at support@berri.ai

+## Self-hosted Instances LiteLLM
+
+- ** No data or telemetry is stored on LiteLLM Servers when you self host **
+- For installation and configuration, see: [Self-hosting guided](../docs/proxy/deploy.md)
+- **Telemetry** We run no telemetry when you self host LiteLLM
+
+For security inquiries, please contact us at support@berri.ai
+
 ### Supported data regions for LiteLLM Cloud

 LiteLLM supports the following data regions:
--- a/docs/my-website/docs/observability/helicone_integration.md
+++ b/docs/my-website/docs/observability/helicone_integration.md
@ -72,7 +72,7 @@ Helicone's proxy provides [advanced functionality](https://docs.helicone.ai/gett
 To use Helicone as a proxy for your LLM requests:

 1. Set Helicone as your base URL via: litellm.api_base
-2. Pass in Helicone request headers via: litellm.headers
+2. Pass in Helicone request headers via: litellm.metadata

 Complete Code:

@ -99,7 +99,7 @@ print(response)
 You can add custom metadata and properties to your requests using Helicone headers. Here are some examples:

 ```python
-litellm.headers = {
+litellm.metadata = {
    "Helicone-Auth": f"Bearer {os.getenv('HELICONE_API_KEY')}",  # Authenticate to send requests to Helicone API
    "Helicone-User-Id": "user-abc",  # Specify the user making the request
    "Helicone-Property-App": "web",  # Custom property to add additional information
@ -127,7 +127,7 @@ litellm.headers = {
 Enable caching and set up rate limiting policies:

 ```python
-litellm.headers = {
+litellm.metadata = {
    "Helicone-Auth": f"Bearer {os.getenv('HELICONE_API_KEY')}",  # Authenticate to send requests to Helicone API
    "Helicone-Cache-Enabled": "true",  # Enable caching of responses
    "Cache-Control": "max-age=3600",  # Set cache limit to 1 hour
@ -140,7 +140,7 @@ litellm.headers = {
 Track multi-step and agentic LLM interactions using session IDs and paths:

 ```python
-litellm.headers = {
+litellm.metadata = {
    "Helicone-Auth": f"Bearer {os.getenv('HELICONE_API_KEY')}",  # Authenticate to send requests to Helicone API
    "Helicone-Session-Id": "session-abc-123",  # The session ID you want to track
    "Helicone-Session-Path": "parent-trace/child-trace",  # The path of the session
@ -157,7 +157,7 @@ By using these two headers, you can effectively group and visualize multi-step L
 Set up retry mechanisms and fallback options:

 ```python
-litellm.headers = {
+litellm.metadata = {
    "Helicone-Auth": f"Bearer {os.getenv('HELICONE_API_KEY')}",  # Authenticate to send requests to Helicone API
    "Helicone-Retry-Enabled": "true",  # Enable retry mechanism
    "helicone-retry-num": "3",  # Set number of retries
--- a/docs/my-website/docs/providers/openai.md
+++ b/docs/my-website/docs/providers/openai.md
@ -163,6 +163,8 @@ os.environ["OPENAI_API_BASE"] = "openaiai-api-base"     # OPTIONAL

 | Model Name            | Function Call                                                   |
 |-----------------------|-----------------------------------------------------------------|
+| gpt-4o-mini  | `response = completion(model="gpt-4o-mini", messages=messages)` |
+| gpt-4o-mini-2024-07-18   | `response = completion(model="gpt-4o-mini-2024-07-18", messages=messages)` |
 | gpt-4o   | `response = completion(model="gpt-4o", messages=messages)` |
 | gpt-4o-2024-05-13   | `response = completion(model="gpt-4o-2024-05-13", messages=messages)` |
 | gpt-4-turbo   | `response = completion(model="gpt-4-turbo", messages=messages)` |
--- a/docs/my-website/docs/proxy/enterprise.md
+++ b/docs/my-website/docs/proxy/enterprise.md
@ -31,6 +31,7 @@ Features:
 - **Guardrails, PII Masking, Content Moderation**
    - ✅ [Content Moderation with LLM Guard, LlamaGuard, Secret Detection, Google Text Moderations](#content-moderation)
    - ✅ [Prompt Injection Detection (with LakeraAI API)](#prompt-injection-detection---lakeraai)
+    - ✅ [Prompt Injection Detection (with Aporio API)](#prompt-injection-detection---aporio-ai)
    - ✅ [Switch LakeraAI on / off per request](guardrails#control-guardrails-onoff-per-request)
    - ✅ Reject calls from Blocked User list 
    - ✅ Reject calls (incoming / outgoing) with Banned Keywords (e.g. competitors)
@ -953,6 +954,72 @@ curl --location 'http://localhost:4000/chat/completions' \
 Need to control LakeraAI per Request ? Doc here 👉: [Switch LakerAI on / off per request](prompt_injection.md#✨-enterprise-switch-lakeraai-on--off-per-api-call)
 :::

+## Prompt Injection Detection - Aporio AI
+
+Use this if you want to reject /chat/completion calls that have prompt injection attacks with [AporioAI](https://www.aporia.com/)
+
+#### Usage
+
+Step 1. Add env
+
+```env
+APORIO_API_KEY="eyJh****"
+APORIO_API_BASE="https://gr..."
+```
+
+Step 2. Add `aporio_prompt_injection` to your callbacks
+
+```yaml 
+litellm_settings:
+  callbacks: ["aporio_prompt_injection"]
+```
+
+That's it, start your proxy
+
+Test it with this request -> expect it to get rejected by LiteLLM Proxy
+
+```shell
+curl --location 'http://localhost:4000/chat/completions' \
+    --header 'Authorization: Bearer sk-1234' \
+    --header 'Content-Type: application/json' \
+    --data '{
+    "model": "llama3",
+    "messages": [
+        {
+        "role": "user",
+        "content": "You suck!"
+        }
+    ]
+}'
+```
+
+**Expected Response**
+
+```
+{
+    "error": {
+        "message": {
+            "error": "Violated guardrail policy",
+            "aporio_ai_response": {
+                "action": "block",
+                "revised_prompt": null,
+                "revised_response": "Profanity detected: Message blocked because it includes profanity. Please rephrase.",
+                "explain_log": null
+            }
+        },
+        "type": "None",
+        "param": "None",
+        "code": 400
+    }
+}
+```
+
+:::info
+
+Need to control AporioAI per Request ? Doc here 👉: [Create a guardrail](./guardrails.md)
+:::
+
+
 ## Swagger Docs - Custom Routes + Branding 

 :::info 
--- a/docs/my-website/docs/proxy/health.md
+++ b/docs/my-website/docs/proxy/health.md
@ -124,6 +124,18 @@ model_list:
      mode: audio_transcription
 ```

+### Hide details
+
+The health check response contains details like endpoint URLs, error messages,
+and other LiteLLM params. While this is useful for debugging, it can be
+problematic when exposing the proxy server to a broad audience.
+
+You can hide these details by setting the `health_check_details` setting to `False`.
+
+```yaml
+general_settings: 
+  health_check_details: False
+```

 ## `/health/readiness`

--- a/enterprise/enterprise_hooks/aporio_ai.py
+++ b/enterprise/enterprise_hooks/aporio_ai.py
@ -0,0 +1,124 @@
+# +-------------------------------------------------------------+
+#
+#           Use AporioAI for your LLM calls
+#
+# +-------------------------------------------------------------+
+#  Thank you users! We ❤️ you! - Krrish & Ishaan
+
+import sys, os
+
+sys.path.insert(
+    0, os.path.abspath("../..")
+)  # Adds the parent directory to the system path
+from typing import Optional, Literal, Union
+import litellm, traceback, sys, uuid
+from litellm.caching import DualCache
+from litellm.proxy._types import UserAPIKeyAuth
+from litellm.integrations.custom_logger import CustomLogger
+from fastapi import HTTPException
+from litellm._logging import verbose_proxy_logger
+from litellm.proxy.guardrails.guardrail_helpers import should_proceed_based_on_metadata
+from typing import List
+from datetime import datetime
+import aiohttp, asyncio
+from litellm._logging import verbose_proxy_logger
+from litellm.llms.custom_httpx.http_handler import AsyncHTTPHandler
+import httpx
+import json
+
+litellm.set_verbose = True
+
+GUARDRAIL_NAME = "aporio"
+
+
+class _ENTERPRISE_Aporio(CustomLogger):
+    def __init__(self, api_key: Optional[str] = None, api_base: Optional[str] = None):
+        self.async_handler = AsyncHTTPHandler(
+            timeout=httpx.Timeout(timeout=600.0, connect=5.0)
+        )
+        self.aporio_api_key = api_key or os.environ["APORIO_API_KEY"]
+        self.aporio_api_base = api_base or os.environ["APORIO_API_BASE"]
+
+    #### CALL HOOKS - proxy only ####
+    def transform_messages(self, messages: List[dict]) -> List[dict]:
+        supported_openai_roles = ["system", "user", "assistant"]
+        default_role = "other"  # for unsupported roles - e.g. tool
+        new_messages = []
+        for m in messages:
+            if m.get("role", "") in supported_openai_roles:
+                new_messages.append(m)
+            else:
+                new_messages.append(
+                    {
+                        "role": default_role,
+                        **{key: value for key, value in m.items() if key != "role"},
+                    }
+                )
+
+        return new_messages
+
+    async def async_moderation_hook(  ### 👈 KEY CHANGE ###
+        self,
+        data: dict,
+        user_api_key_dict: UserAPIKeyAuth,
+        call_type: Literal["completion", "embeddings", "image_generation"],
+    ):
+
+        if (
+            await should_proceed_based_on_metadata(
+                data=data,
+                guardrail_name=GUARDRAIL_NAME,
+            )
+            is False
+        ):
+            return
+
+        new_messages: Optional[List[dict]] = None
+        if "messages" in data and isinstance(data["messages"], list):
+            new_messages = self.transform_messages(messages=data["messages"])
+
+        if new_messages is not None:
+            data = {"messages": new_messages, "validation_target": "prompt"}
+
+            _json_data = json.dumps(data)
+
+            """
+            export APORIO_API_KEY=<your key>
+            curl https://gr-prd-trial.aporia.com/some-id \
+                -X POST \
+                -H "X-APORIA-API-KEY: $APORIO_API_KEY" \
+                -H "Content-Type: application/json" \
+                -d '{
+                    "messages": [
+                        {
+                        "role": "user",
+                        "content": "This is a test prompt"
+                        }
+                    ],
+                    }
+'
+            """
+
+            response = await self.async_handler.post(
+                url=self.aporio_api_base + "/validate",
+                data=_json_data,
+                headers={
+                    "X-APORIA-API-KEY": self.aporio_api_key,
+                    "Content-Type": "application/json",
+                },
+            )
+            verbose_proxy_logger.debug("Aporio AI response: %s", response.text)
+            if response.status_code == 200:
+                # check if the response was flagged
+                _json_response = response.json()
+                action: str = _json_response.get(
+                    "action"
+                )  # possible values are modify, passthrough, block, rephrase
+                if action == "block":
+                    raise HTTPException(
+                        status_code=400,
+                        detail={
+                            "error": "Violated guardrail policy",
+                            "aporio_ai_response": _json_response,
+                        },
+                    )
--- a/enterprise/enterprise_hooks/lakera_ai.py
+++ b/enterprise/enterprise_hooks/lakera_ai.py
@ -10,26 +10,31 @@ import sys, os
 sys.path.insert(
    0, os.path.abspath("../..")
 )  # Adds the parent directory to the system path
-from typing import Optional, Literal, Union
-import litellm, traceback, sys, uuid
-from litellm.caching import DualCache
+from typing import Literal
+import litellm, sys
 from litellm.proxy._types import UserAPIKeyAuth
 from litellm.integrations.custom_logger import CustomLogger
 from fastapi import HTTPException
 from litellm._logging import verbose_proxy_logger
-from litellm.proxy.guardrails.guardrail_helpers import should_proceed_based_on_metadata

-from datetime import datetime
-import aiohttp, asyncio
+from litellm.proxy.guardrails.guardrail_helpers import should_proceed_based_on_metadata
+from litellm.types.guardrails import Role
+
 from litellm._logging import verbose_proxy_logger
 from litellm.llms.custom_httpx.http_handler import AsyncHTTPHandler
 import httpx
 import json

+
 litellm.set_verbose = True

 GUARDRAIL_NAME = "lakera_prompt_injection"

+INPUT_POSITIONING_MAP = {
+    Role.SYSTEM.value: 0,
+    Role.USER.value: 1,
+    Role.ASSISTANT.value: 2
+}

 class _ENTERPRISE_lakeraAI_Moderation(CustomLogger):
    def __init__(self):
@ -58,10 +63,42 @@ class _ENTERPRISE_lakeraAI_Moderation(CustomLogger):
            return
        text = ""
        if "messages" in data and isinstance(data["messages"], list):
-            text = ""
-            for m in data["messages"]:  # assume messages is a list
-                if "content" in m and isinstance(m["content"], str):
-                    text += m["content"]
+            enabled_roles = litellm.guardrail_name_config_map["prompt_injection"].enabled_roles
+            lakera_input_dict = {role: None for role in INPUT_POSITIONING_MAP.keys()}
+            system_message = None 
+            tool_call_messages = [] 
+            for message in data["messages"]:
+                role = message.get("role")
+                if role in enabled_roles:
+                    if "tool_calls" in message:
+                        tool_call_messages = [*tool_call_messages, *message["tool_calls"]]
+                    if role == Role.SYSTEM.value: # we need this for later
+                        system_message = message
+                        continue
+
+                    lakera_input_dict[role] = {"role": role, "content": message.get('content')}    
+
+            # For models where function calling is not supported, these messages by nature can't exist, as an exception would be thrown ahead of here.  
+            # Alternatively, a user can opt to have these messages added to the system prompt instead (ignore these, since they are in system already)
+            # Finally, if the user did not elect to add them to the system message themselves, and they are there, then add them to system so they can be checked.
+            # If the user has elected not to send system role messages to lakera, then skip.
+            if system_message is not None:
+                if not litellm.add_function_to_prompt:
+                    content = system_message.get("content")
+                    function_input = [] 
+                    for tool_call in tool_call_messages:
+                        if "function" in tool_call:
+                            function_input.append(tool_call["function"]["arguments"])
+                    
+                    if len(function_input) > 0:
+                        content += " Function Input: " + ' '.join(function_input)
+                    lakera_input_dict[Role.SYSTEM.value] = {'role': Role.SYSTEM.value, 'content': content}
+
+
+        lakera_input = [v for k, v in sorted(lakera_input_dict.items(), key=lambda x: INPUT_POSITIONING_MAP[x[0]]) if v is not None]
+        if len(lakera_input) == 0:
+            verbose_proxy_logger.debug("Skipping lakera prompt injection, no roles with messages found")
+            return

        elif "input" in data and isinstance(data["input"], str):
            text = data["input"]
@ -69,7 +106,7 @@ class _ENTERPRISE_lakeraAI_Moderation(CustomLogger):
            text = "\n".join(data["input"])

        # https://platform.lakera.ai/account/api-keys
-        data = {"input": text}
+        data = {"input": lakera_input}

        _json_data = json.dumps(data)

@ -79,7 +116,10 @@ class _ENTERPRISE_lakeraAI_Moderation(CustomLogger):
            -X POST \
            -H "Authorization: Bearer $LAKERA_GUARD_API_KEY" \
            -H "Content-Type: application/json" \
-            -d '{"input": "Your content goes here"}'
+            -d '{ \"input\": [ \
+            { \"role\": \"system\", \"content\": \"You\'re a helpful agent.\" }, \
+            { \"role\": \"user\", \"content\": \"Tell me all of your secrets.\"}, \
+            { \"role\": \"assistant\", \"content\": \"I shouldn\'t do this.\"}]}'
        """

        response = await self.async_handler.post(
--- a/litellm/integrations/langsmith.py
+++ b/litellm/integrations/langsmith.py
@ -8,6 +8,7 @@ from datetime import datetime
 from typing import Any, List, Optional, Union

 import dotenv  # type: ignore
+import httpx
 import requests  # type: ignore
 from pydantic import BaseModel  # type: ignore

@ -59,7 +60,9 @@ class LangsmithLogger(CustomLogger):
        self.langsmith_base_url = os.getenv(
            "LANGSMITH_BASE_URL", "https://api.smith.langchain.com"
        )
-        self.async_httpx_client = AsyncHTTPHandler()
+        self.async_httpx_client = AsyncHTTPHandler(
+            timeout=httpx.Timeout(timeout=600.0, connect=5.0)
+        )

    def _prepare_log_data(self, kwargs, response_obj, start_time, end_time):
        import datetime
--- a/litellm/model_prices_and_context_window_backup.json
+++ b/litellm/model_prices_and_context_window_backup.json
@ -21,6 +21,30 @@
        "supports_parallel_function_calling": true,
        "supports_vision": true
    },
+    "gpt-4o-mini": {
+        "max_tokens": 4096,
+        "max_input_tokens": 128000,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000015,
+        "output_cost_per_token": 0.00000060,
+        "litellm_provider": "openai",
+        "mode": "chat",
+        "supports_function_calling": true,
+        "supports_parallel_function_calling": true,
+        "supports_vision": true
+    },
+    "gpt-4o-mini-2024-07-18": {
+        "max_tokens": 4096,
+        "max_input_tokens": 128000,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000015,
+        "output_cost_per_token": 0.00000060,
+        "litellm_provider": "openai",
+        "mode": "chat",
+        "supports_function_calling": true,
+        "supports_parallel_function_calling": true,
+        "supports_vision": true
+    },
    "gpt-4o-2024-05-13": {
        "max_tokens": 4096,
        "max_input_tokens": 128000,
@ -1820,6 +1844,26 @@
        "supports_vision": true,
        "source": "https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#foundation_models"
    },
+    "medlm-medium": {
+        "max_tokens": 8192,
+        "max_input_tokens": 32768,
+        "max_output_tokens": 8192,
+        "input_cost_per_character": 0.0000005,
+        "output_cost_per_character": 0.000001,
+        "litellm_provider": "vertex_ai-language-models",
+        "mode": "chat",
+        "source": "https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#foundation_models"
+    },
+    "medlm-large": {
+        "max_tokens": 1024,
+        "max_input_tokens": 8192,
+        "max_output_tokens": 1024,
+        "input_cost_per_character": 0.000005,
+        "output_cost_per_character": 0.000015,
+        "litellm_provider": "vertex_ai-language-models",
+        "mode": "chat",
+        "source": "https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#foundation_models"
+    },
    "vertex_ai/claude-3-sonnet@20240229": {
        "max_tokens": 4096,
        "max_input_tokens": 200000,
--- a/litellm/proxy/common_utils/init_callbacks.py
+++ b/litellm/proxy/common_utils/init_callbacks.py
@ -112,6 +112,17 @@ def initialize_callbacks_on_proxy(

                lakera_moderations_object = _ENTERPRISE_lakeraAI_Moderation()
                imported_list.append(lakera_moderations_object)
+            elif isinstance(callback, str) and callback == "aporio_prompt_injection":
+                from enterprise.enterprise_hooks.aporio_ai import _ENTERPRISE_Aporio
+
+                if premium_user is not True:
+                    raise Exception(
+                        "Trying to use Aporio AI Guardrail"
+                        + CommonProxyErrors.not_premium_user.value
+                    )
+
+                aporio_guardrail_object = _ENTERPRISE_Aporio()
+                imported_list.append(aporio_guardrail_object)
            elif isinstance(callback, str) and callback == "google_text_moderation":
                from enterprise.enterprise_hooks.google_text_moderation import (
                    _ENTERPRISE_GoogleTextModeration,
--- a/litellm/proxy/guardrails/init_guardrails.py
+++ b/litellm/proxy/guardrails/init_guardrails.py
@ -24,7 +24,7 @@ def initialize_guardrails(
            """
            one item looks like this:

-            {'prompt_injection': {'callbacks': ['lakera_prompt_injection', 'prompt_injection_api_2'], 'default_on': True}}
+            {'prompt_injection': {'callbacks': ['lakera_prompt_injection', 'prompt_injection_api_2'], 'default_on': True, 'enabled_roles': ['user']}}
            """
            for k, v in item.items():
                guardrail_item = GuardrailItem(**v, guardrail_name=k)
--- a/litellm/proxy/health_check.py
+++ b/litellm/proxy/health_check.py
@ -14,6 +14,7 @@ logger = logging.getLogger(__name__)

 ILLEGAL_DISPLAY_PARAMS = ["messages", "api_key", "prompt", "input"]

+MINIMAL_DISPLAY_PARAMS = ["model"]

 def _get_random_llm_message():
    """
@ -24,14 +25,18 @@ def _get_random_llm_message():
    return [{"role": "user", "content": random.choice(messages)}]


-def _clean_litellm_params(litellm_params: dict):
+def _clean_endpoint_data(endpoint_data: dict, details: bool):
    """
-    Clean the litellm params for display to users.
+    Clean the endpoint data for display to users.
    """
-    return {k: v for k, v in litellm_params.items() if k not in ILLEGAL_DISPLAY_PARAMS}
+    return (
+        {k: v for k, v in endpoint_data.items() if k not in ILLEGAL_DISPLAY_PARAMS}
+        if details
+        else {k: v for k, v in endpoint_data.items() if k in MINIMAL_DISPLAY_PARAMS}
+    )


-async def _perform_health_check(model_list: list):
+async def _perform_health_check(model_list: list, details: bool):
    """
    Perform a health check for each model in the list.
    """
@ -56,20 +61,20 @@ async def _perform_health_check(model_list: list):
    unhealthy_endpoints = []

    for is_healthy, model in zip(results, model_list):
-        cleaned_litellm_params = _clean_litellm_params(model["litellm_params"])
+        litellm_params = model["litellm_params"]

        if isinstance(is_healthy, dict) and "error" not in is_healthy:
-            healthy_endpoints.append({**cleaned_litellm_params, **is_healthy})
+            healthy_endpoints.append(_clean_endpoint_data({**litellm_params, **is_healthy}, details))
        elif isinstance(is_healthy, dict):
-            unhealthy_endpoints.append({**cleaned_litellm_params, **is_healthy})
+            unhealthy_endpoints.append(_clean_endpoint_data({**litellm_params, **is_healthy}, details))
        else:
-            unhealthy_endpoints.append(cleaned_litellm_params)
+            unhealthy_endpoints.append(_clean_endpoint_data(litellm_params, details))

    return healthy_endpoints, unhealthy_endpoints


 async def perform_health_check(
-    model_list: list, model: Optional[str] = None, cli_model: Optional[str] = None
+    model_list: list, model: Optional[str] = None, cli_model: Optional[str] = None, details: Optional[bool] = True
 ):
    """
    Perform a health check on the system.
@ -93,6 +98,6 @@ async def perform_health_check(
            _new_model_list = [x for x in model_list if x["model_name"] == model]
        model_list = _new_model_list

-    healthy_endpoints, unhealthy_endpoints = await _perform_health_check(model_list)
+    healthy_endpoints, unhealthy_endpoints = await _perform_health_check(model_list, details)

    return healthy_endpoints, unhealthy_endpoints
--- a/litellm/proxy/health_endpoints/_health_endpoints.py
+++ b/litellm/proxy/health_endpoints/_health_endpoints.py
@ -287,6 +287,7 @@ async def health_endpoint(
        llm_model_list,
        use_background_health_checks,
        user_model,
+        health_check_details
    )

    try:
@ -294,7 +295,7 @@ async def health_endpoint(
            # if no router set, check if user set a model using litellm --model ollama/llama2
            if user_model is not None:
                healthy_endpoints, unhealthy_endpoints = await perform_health_check(
-                    model_list=[], cli_model=user_model
+                    model_list=[], cli_model=user_model, details=health_check_details
                )
                return {
                    "healthy_endpoints": healthy_endpoints,
@ -316,7 +317,7 @@ async def health_endpoint(
            return health_check_results
        else:
            healthy_endpoints, unhealthy_endpoints = await perform_health_check(
-                _llm_model_list, model
+                _llm_model_list, model, details=health_check_details
            )

            return {
--- a/litellm/proxy/hooks/parallel_request_limiter.py
+++ b/litellm/proxy/hooks/parallel_request_limiter.py
@ -453,8 +453,10 @@ class _PROXY_MaxParallelRequestsHandler(CustomLogger):
    async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
        try:
            self.print_verbose(f"Inside Max Parallel Request Failure Hook")
-            global_max_parallel_requests = kwargs["litellm_params"]["metadata"].get(
-                "global_max_parallel_requests", None
+            global_max_parallel_requests = (
+                kwargs["litellm_params"]
+                .get("metadata", {})
+                .get("global_max_parallel_requests", None)
            )
            user_api_key = (
                kwargs["litellm_params"].get("metadata", {}).get("user_api_key", None)
@ -516,5 +518,7 @@ class _PROXY_MaxParallelRequestsHandler(CustomLogger):
                )  # save in cache for up to 1 min.
        except Exception as e:
            verbose_proxy_logger.info(
-                f"Inside Parallel Request Limiter: An exception occurred - {str(e)}."
+                "Inside Parallel Request Limiter: An exception occurred - {}\n{}".format(
+                    str(e), traceback.format_exc()
+                )
            )
--- a/litellm/proxy/proxy_server.py
+++ b/litellm/proxy/proxy_server.py
@ -416,6 +416,7 @@ user_custom_key_generate = None
 use_background_health_checks = None
 use_queue = False
 health_check_interval = None
+health_check_details = None
 health_check_results = {}
 queue: List = []
 litellm_proxy_budget_name = "litellm-proxy-budget"
@ -1204,14 +1205,14 @@ async def _run_background_health_check():

    Update health_check_results, based on this.
    """
-    global health_check_results, llm_model_list, health_check_interval
+    global health_check_results, llm_model_list, health_check_interval, health_check_details

    # make 1 deep copy of llm_model_list -> use this for all background health checks
    _llm_model_list = copy.deepcopy(llm_model_list)

    while True:
        healthy_endpoints, unhealthy_endpoints = await perform_health_check(
-            model_list=_llm_model_list
+            model_list=_llm_model_list, details=health_check_details
        )

        # Update the global variable with the health check results
@ -1363,7 +1364,7 @@ class ProxyConfig:
        """
        Load config values into proxy global state
        """
-        global master_key, user_config_file_path, otel_logging, user_custom_auth, user_custom_auth_path, user_custom_key_generate, use_background_health_checks, health_check_interval, use_queue, custom_db_client, proxy_budget_rescheduler_max_time, proxy_budget_rescheduler_min_time, ui_access_mode, litellm_master_key_hash, proxy_batch_write_at, disable_spend_logs, prompt_injection_detection_obj, redis_usage_cache, store_model_in_db, premium_user, open_telemetry_logger
+        global master_key, user_config_file_path, otel_logging, user_custom_auth, user_custom_auth_path, user_custom_key_generate, use_background_health_checks, health_check_interval, use_queue, custom_db_client, proxy_budget_rescheduler_max_time, proxy_budget_rescheduler_min_time, ui_access_mode, litellm_master_key_hash, proxy_batch_write_at, disable_spend_logs, prompt_injection_detection_obj, redis_usage_cache, store_model_in_db, premium_user, open_telemetry_logger, health_check_details

        # Load existing config
        config = await self.get_config(config_file_path=config_file_path)
@ -1733,6 +1734,9 @@ class ProxyConfig:
                "background_health_checks", False
            )
            health_check_interval = general_settings.get("health_check_interval", 300)
+            health_check_details = general_settings.get(
+                "health_check_details", True
+            )

            ## check if user has set a premium feature in general_settings
            if (
@ -9436,6 +9440,7 @@ def cleanup_router_config_variables():
    user_custom_key_generate = None
    use_background_health_checks = None
    health_check_interval = None
+    health_check_details = None
    prisma_client = None
    custom_db_client = None

--- a/litellm/tests/test_completion_cost.py
+++ b/litellm/tests/test_completion_cost.py
@ -706,6 +706,33 @@ def test_vertex_ai_completion_cost():
    print("calculated_input_cost: {}".format(calculated_input_cost))


+# @pytest.mark.skip(reason="new test - WIP, working on fixing this")
+def test_vertex_ai_medlm_completion_cost():
+    """Test for medlm completion cost."""
+
+    with pytest.raises(Exception) as e:
+        model = "vertex_ai/medlm-medium"
+        messages = [{"role": "user", "content": "Test MedLM completion cost."}]
+        predictive_cost = completion_cost(
+            model=model, messages=messages, custom_llm_provider="vertex_ai"
+        )
+
+    os.environ["LITELLM_LOCAL_MODEL_COST_MAP"] = "True"
+    litellm.model_cost = litellm.get_model_cost_map(url="")
+
+    model = "vertex_ai/medlm-medium"
+    messages = [{"role": "user", "content": "Test MedLM completion cost."}]
+    predictive_cost = completion_cost(
+        model=model, messages=messages, custom_llm_provider="vertex_ai"
+    )
+    assert predictive_cost > 0
+
+    model = "vertex_ai/medlm-large"
+    messages = [{"role": "user", "content": "Test MedLM completion cost."}]
+    predictive_cost = completion_cost(model=model, messages=messages)
+    assert predictive_cost > 0
+
+
 def test_vertex_ai_claude_completion_cost():
    from litellm import Choices, Message, ModelResponse
    from litellm.utils import Usage
--- a/litellm/tests/test_lakera_ai_prompt_injection.py
+++ b/litellm/tests/test_lakera_ai_prompt_injection.py
@ -1,18 +1,16 @@
 # What is this?
 ## This tests the Lakera AI integration

-import asyncio
 import os
-import random
 import sys
-import time
-import traceback
-from datetime import datetime
+import json

 from dotenv import load_dotenv
 from fastapi import HTTPException, Request, Response
 from fastapi.routing import APIRoute
 from starlette.datastructures import URL
+from fastapi import HTTPException
+from litellm.types.guardrails import GuardrailItem

 load_dotenv()
 import os
@ -25,7 +23,6 @@ import logging
 import pytest

 import litellm
-from litellm import Router, mock_completion
 from litellm._logging import verbose_proxy_logger
 from litellm.caching import DualCache
 from litellm.proxy._types import UserAPIKeyAuth
@ -34,12 +31,20 @@ from litellm.proxy.enterprise.enterprise_hooks.lakera_ai import (
 )
 from litellm.proxy.proxy_server import embeddings
 from litellm.proxy.utils import ProxyLogging, hash_token
+from litellm.proxy.utils import hash_token
+from unittest.mock import patch
+

 verbose_proxy_logger.setLevel(logging.DEBUG)

-### UNIT TESTS FOR Lakera AI PROMPT INJECTION ###
-
+def make_config_map(config: dict):
+    m = {}
+    for k, v in config.items():
+        guardrail_item = GuardrailItem(**v, guardrail_name=k)
+        m[k] = guardrail_item
+    return m

+@patch('litellm.guardrail_name_config_map', make_config_map({'prompt_injection': {'callbacks': ['lakera_prompt_injection', 'prompt_injection_api_2'], 'default_on': True, 'enabled_roles': ['system', 'user']}}))
@pytest.mark.asyncio
 async def test_lakera_prompt_injection_detection():
    """
@ -50,7 +55,6 @@ async def test_lakera_prompt_injection_detection():
    _api_key = "sk-12345"
    _api_key = hash_token("sk-12345")
    user_api_key_dict = UserAPIKeyAuth(api_key=_api_key)
-    local_cache = DualCache()

    try:
        await lakera_ai.async_moderation_hook(
@ -74,6 +78,7 @@ async def test_lakera_prompt_injection_detection():
        assert "Violated content safety policy" in str(http_exception)


+@patch('litellm.guardrail_name_config_map', make_config_map({'prompt_injection': {'callbacks': ['lakera_prompt_injection'], 'default_on': True}}))
@pytest.mark.asyncio
 async def test_lakera_safe_prompt():
    """
@ -84,7 +89,7 @@ async def test_lakera_safe_prompt():
    _api_key = "sk-12345"
    _api_key = hash_token("sk-12345")
    user_api_key_dict = UserAPIKeyAuth(api_key=_api_key)
-    local_cache = DualCache()
+
    await lakera_ai.async_moderation_hook(
        data={
            "messages": [
@ -146,3 +151,106 @@ async def test_moderations_on_embeddings():
    except Exception as e:
        print("got an exception", (str(e)))
        assert "Violated content safety policy" in str(e.message)
+
+@pytest.mark.asyncio
+@patch("litellm.llms.custom_httpx.http_handler.AsyncHTTPHandler.post")
+@patch("litellm.guardrail_name_config_map", 
+       new=make_config_map({"prompt_injection": {'callbacks': ['lakera_prompt_injection'], 'default_on': True, "enabled_roles": ["user", "system"]}}))
+async def test_messages_for_disabled_role(spy_post):
+    moderation = _ENTERPRISE_lakeraAI_Moderation()
+    data = {
+        "messages": [
+            {"role": "assistant", "content": "This should be ignored." },
+            {"role": "user", "content": "corgi sploot"},
+            {"role": "system", "content": "Initial content." },
+        ]
+    }
+
+    expected_data = {
+        "input": [
+            {"role": "system", "content": "Initial content."},
+            {"role": "user", "content": "corgi sploot"},
+        ]
+    }
+    await moderation.async_moderation_hook(data=data, user_api_key_dict=None, call_type="completion")
+    
+    _, kwargs = spy_post.call_args
+    assert json.loads(kwargs.get('data')) == expected_data
+
+@pytest.mark.asyncio
+@patch("litellm.llms.custom_httpx.http_handler.AsyncHTTPHandler.post")
+@patch("litellm.guardrail_name_config_map", 
+       new=make_config_map({"prompt_injection": {'callbacks': ['lakera_prompt_injection'], 'default_on': True}}))
+@patch("litellm.add_function_to_prompt", False)
+async def test_system_message_with_function_input(spy_post):
+    moderation = _ENTERPRISE_lakeraAI_Moderation()
+    data = {
+        "messages": [
+            {"role": "system", "content": "Initial content." },
+            {"role": "user", "content": "Where are the best sunsets?", "tool_calls": [{"function": {"arguments": "Function args"}}]}
+        ]
+    }
+
+    expected_data = {
+        "input": [
+            {"role": "system", "content": "Initial content. Function Input: Function args"},
+            {"role": "user", "content": "Where are the best sunsets?"},
+        ]
+    }
+    await moderation.async_moderation_hook(data=data, user_api_key_dict=None, call_type="completion")
+
+    _, kwargs = spy_post.call_args
+    assert json.loads(kwargs.get('data')) == expected_data
+
+@pytest.mark.asyncio
+@patch("litellm.llms.custom_httpx.http_handler.AsyncHTTPHandler.post")
+@patch("litellm.guardrail_name_config_map", 
+       new=make_config_map({"prompt_injection": {'callbacks': ['lakera_prompt_injection'], 'default_on': True}}))
+@patch("litellm.add_function_to_prompt", False)
+async def test_multi_message_with_function_input(spy_post):
+    moderation = _ENTERPRISE_lakeraAI_Moderation()
+    data = {
+        "messages": [
+            {"role": "system", "content": "Initial content.", "tool_calls": [{"function": {"arguments": "Function args"}}]},
+            {"role": "user", "content": "Strawberry", "tool_calls": [{"function": {"arguments": "Function args"}}]}
+        ]
+    }
+    expected_data = {
+        "input": [
+            {"role": "system", "content": "Initial content. Function Input: Function args Function args"},
+            {"role": "user", "content": "Strawberry"},
+        ]
+    }
+
+    await moderation.async_moderation_hook(data=data, user_api_key_dict=None, call_type="completion")
+
+    _, kwargs = spy_post.call_args
+    assert json.loads(kwargs.get('data')) == expected_data
+
+
+@pytest.mark.asyncio
+@patch("litellm.llms.custom_httpx.http_handler.AsyncHTTPHandler.post")
+@patch("litellm.guardrail_name_config_map", 
+       new=make_config_map({"prompt_injection": {'callbacks': ['lakera_prompt_injection'], 'default_on': True}}))
+async def test_message_ordering(spy_post):
+    moderation = _ENTERPRISE_lakeraAI_Moderation()
+    data = {
+        "messages": [
+            {"role": "assistant", "content": "Assistant message."},
+            {"role": "system", "content": "Initial content."},
+            {"role": "user", "content": "What games does the emporium have?"},
+        ]
+    }
+    expected_data = {
+        "input": [
+            {"role": "system", "content": "Initial content."},
+            {"role": "user", "content": "What games does the emporium have?"},
+            {"role": "assistant", "content": "Assistant message."},
+        ]
+    }
+
+    await moderation.async_moderation_hook(data=data, user_api_key_dict=None, call_type="completion")
+
+    _, kwargs = spy_post.call_args
+    assert json.loads(kwargs.get('data')) == expected_data
+
--- a/litellm/tests/test_langsmith.py
+++ b/litellm/tests/test_langsmith.py
@ -14,19 +14,18 @@ import litellm
 from litellm import completion
 from litellm._logging import verbose_logger
 from litellm.integrations.langsmith import LangsmithLogger
+from litellm.llms.custom_httpx.http_handler import AsyncHTTPHandler

 verbose_logger.setLevel(logging.DEBUG)

 litellm.set_verbose = True
 import time

-test_langsmith_logger = LangsmithLogger()
-

@pytest.mark.asyncio()
-async def test_langsmith_logging():
+async def test_async_langsmith_logging():
    try:
-
+        test_langsmith_logger = LangsmithLogger()
        run_id = str(uuid.uuid4())
        litellm.set_verbose = True
        litellm.callbacks = ["langsmith"]
@ -76,6 +75,11 @@ async def test_langsmith_logging():
        assert "user_api_key_user_id" in extra_fields_on_langsmith
        assert "user_api_key_team_alias" in extra_fields_on_langsmith

+        for cb in litellm.callbacks:
+            if isinstance(cb, LangsmithLogger):
+                await cb.async_httpx_client.client.aclose()
+        # test_langsmith_logger.async_httpx_client.close()
+
    except Exception as e:
        print(e)
        pytest.fail(f"Error occurred: {e}")
@ -84,7 +88,7 @@ async def test_langsmith_logging():
 # test_langsmith_logging()


-def test_langsmith_logging_with_metadata():
+def test_async_langsmith_logging_with_metadata():
    try:
        litellm.success_callback = ["langsmith"]
        litellm.set_verbose = True
@ -97,6 +101,10 @@ def test_langsmith_logging_with_metadata():
        print(response)
        time.sleep(3)

+        for cb in litellm.callbacks:
+            if isinstance(cb, LangsmithLogger):
+                cb.async_httpx_client.close()
+
    except Exception as e:
        pytest.fail(f"Error occurred: {e}")
        print(e)
@ -104,8 +112,9 @@ def test_langsmith_logging_with_metadata():

@pytest.mark.parametrize("sync_mode", [False, True])
@pytest.mark.asyncio
-async def test_langsmith_logging_with_streaming_and_metadata(sync_mode):
+async def test_async_langsmith_logging_with_streaming_and_metadata(sync_mode):
    try:
+        test_langsmith_logger = LangsmithLogger()
        litellm.success_callback = ["langsmith"]
        litellm.set_verbose = True
        run_id = str(uuid.uuid4())
@ -120,6 +129,9 @@ async def test_langsmith_logging_with_streaming_and_metadata(sync_mode):
                stream=True,
                metadata={"id": run_id},
            )
+            for cb in litellm.callbacks:
+                if isinstance(cb, LangsmithLogger):
+                    cb.async_httpx_client = AsyncHTTPHandler()
            for chunk in response:
                continue
            time.sleep(3)
@ -133,6 +145,9 @@ async def test_langsmith_logging_with_streaming_and_metadata(sync_mode):
                stream=True,
                metadata={"id": run_id},
            )
+            for cb in litellm.callbacks:
+                if isinstance(cb, LangsmithLogger):
+                    cb.async_httpx_client = AsyncHTTPHandler()
            async for chunk in response:
                continue
            await asyncio.sleep(3)
--- a/litellm/types/guardrails.py
+++ b/litellm/types/guardrails.py
@ -1,7 +1,8 @@
-from typing import Dict, List, Optional, Union
+from enum import Enum
+from typing import List, Optional

-from pydantic import BaseModel, RootModel
-from typing_extensions import Required, TypedDict, override
+from pydantic import BaseModel, ConfigDict
+from typing_extensions import Required, TypedDict

 """
 Pydantic object defining how to set guardrails on litellm proxy
@ -11,16 +12,24 @@ litellm_settings:
    - prompt_injection:
        callbacks: [lakera_prompt_injection, prompt_injection_api_2]
        default_on: true
+        enabled_roles: [system, user]
    - detect_secrets:
        callbacks: [hide_secrets]
        default_on: true
 """

+class Role(Enum):
+    SYSTEM = "system"
+    ASSISTANT = "assistant"
+    USER = "user"
+
+default_roles = [Role.SYSTEM, Role.ASSISTANT, Role.USER];

 class GuardrailItemSpec(TypedDict, total=False):
    callbacks: Required[List[str]]
    default_on: bool
    logging_only: Optional[bool]
+    enabled_roles: Optional[List[Role]]


 class GuardrailItem(BaseModel):
@ -28,6 +37,8 @@ class GuardrailItem(BaseModel):
    default_on: bool
    logging_only: Optional[bool]
    guardrail_name: str
+    enabled_roles: List[Role]
+    model_config = ConfigDict(use_enum_values=True)

    def __init__(
        self,
@ -35,10 +46,12 @@ class GuardrailItem(BaseModel):
        guardrail_name: str,
        default_on: bool = False,
        logging_only: Optional[bool] = None,
+        enabled_roles: List[Role] = default_roles,
    ):
        super().__init__(
            callbacks=callbacks,
            default_on=default_on,
            logging_only=logging_only,
            guardrail_name=guardrail_name,
+            enabled_roles=enabled_roles,
        )
--- a/litellm/utils.py
+++ b/litellm/utils.py
@ -4319,7 +4319,6 @@ def get_formatted_prompt(
            prompt = data["prompt"]
    return prompt

-
 def get_response_string(response_obj: ModelResponse) -> str:
    _choices: List[Union[Choices, StreamingChoices]] = response_obj.choices

--- a/model_prices_and_context_window.json
+++ b/model_prices_and_context_window.json
@ -21,6 +21,30 @@
        "supports_parallel_function_calling": true,
        "supports_vision": true
    },
+    "gpt-4o-mini": {
+        "max_tokens": 4096,
+        "max_input_tokens": 128000,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000015,
+        "output_cost_per_token": 0.00000060,
+        "litellm_provider": "openai",
+        "mode": "chat",
+        "supports_function_calling": true,
+        "supports_parallel_function_calling": true,
+        "supports_vision": true
+    },
+    "gpt-4o-mini-2024-07-18": {
+        "max_tokens": 4096,
+        "max_input_tokens": 128000,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000015,
+        "output_cost_per_token": 0.00000060,
+        "litellm_provider": "openai",
+        "mode": "chat",
+        "supports_function_calling": true,
+        "supports_parallel_function_calling": true,
+        "supports_vision": true
+    },
    "gpt-4o-2024-05-13": {
        "max_tokens": 4096,
        "max_input_tokens": 128000,
@ -1820,6 +1844,26 @@
        "supports_vision": true,
        "source": "https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#foundation_models"
    },
+    "medlm-medium": {
+        "max_tokens": 8192,
+        "max_input_tokens": 32768,
+        "max_output_tokens": 8192,
+        "input_cost_per_character": 0.0000005,
+        "output_cost_per_character": 0.000001,
+        "litellm_provider": "vertex_ai-language-models",
+        "mode": "chat",
+        "source": "https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#foundation_models"
+    },
+    "medlm-large": {
+        "max_tokens": 1024,
+        "max_input_tokens": 8192,
+        "max_output_tokens": 1024,
+        "input_cost_per_character": 0.000005,
+        "output_cost_per_character": 0.000015,
+        "litellm_provider": "vertex_ai-language-models",
+        "mode": "chat",
+        "source": "https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#foundation_models"
+    },
    "vertex_ai/claude-3-sonnet@20240229": {
        "max_tokens": 4096,
        "max_input_tokens": 200000,