Merge branch 'main' into main

2024-04-30 22:48:52 +05:30 · 2024-04-30 22:48:52 +05:30 · d770df2259
commit d770df2259
parent 318b4813f2 00d1440d0d
198 changed files with 10972 additions and 7448 deletions
--- a/.github/workflows/interpret_load_test.py
+++ b/.github/workflows/interpret_load_test.py
@ -77,6 +77,9 @@ if __name__ == "__main__":
    new_release_body = (
        existing_release_body
        + "\n\n"
+        + "### Don't want to maintain your internal proxy? get in touch 🎉"
+        + "\nHosted Proxy Alpha: https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat"
+        + "\n\n"
        + "## Load Test LiteLLM Proxy Results"
        + "\n\n"
        + markdown_table
--- a/.gitignore
+++ b/.gitignore
@ -50,3 +50,5 @@ kub.yaml
 loadtest_kub.yaml
 litellm/proxy/_new_secret_config.yaml
 litellm/proxy/_new_secret_config.yaml
+litellm/proxy/_super_secret_config.yaml
+litellm/proxy/_super_secret_config.yaml
--- a/README.md
+++ b/README.md
@ -5,7 +5,7 @@
        <p align="center">Call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, etc.]
        <br>
    </p>
-<h4 align="center"><a href="https://docs.litellm.ai/docs/simple_proxy" target="_blank">OpenAI Proxy Server</a> | <a href="https://docs.litellm.ai/docs/enterprise"target="_blank">Enterprise Tier</a></h4>
+<h4 align="center"><a href="https://docs.litellm.ai/docs/simple_proxy" target="_blank">OpenAI Proxy Server</a> | <a href="https://docs.litellm.ai/docs/hosted" target="_blank"> Hosted Proxy (Preview)</a> | <a href="https://docs.litellm.ai/docs/enterprise"target="_blank">Enterprise Tier</a></h4>
 <h4 align="center">
    <a href="https://pypi.org/project/litellm/" target="_blank">
        <img src="https://img.shields.io/pypi/v/litellm.svg" alt="PyPI Version">
@ -128,7 +128,9 @@ response = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content

 # OpenAI Proxy - ([Docs](https://docs.litellm.ai/docs/simple_proxy))

-Set Budgets & Rate limits across multiple projects
+Track spend + Load Balance across multiple projects
+
+[Hosted Proxy (Preview)](https://docs.litellm.ai/docs/hosted)

 The proxy provides:

@ -225,6 +227,7 @@ curl 'http://0.0.0.0:4000/key/generate' \
 | [perplexity-ai](https://docs.litellm.ai/docs/providers/perplexity)                  | ✅                                                      | ✅                                                                              | ✅                                                                                  | ✅                                                                                |
 | [Groq AI](https://docs.litellm.ai/docs/providers/groq)                              | ✅                                                      | ✅                                                                              | ✅                                                                                  | ✅                                                                                |
 | [anyscale](https://docs.litellm.ai/docs/providers/anyscale)                         | ✅                                                      | ✅                                                                              | ✅                                                                                  | ✅                                                                                |
+| [IBM - watsonx.ai](https://docs.litellm.ai/docs/providers/watsonx)                  | ✅                                                      | ✅                                                                              | ✅                                                                                  | ✅                                                                                | ✅ 
 | [voyage ai](https://docs.litellm.ai/docs/providers/voyage)                          |                                                         |                                                                                 |                                                                                     |                                                                                   | ✅                                                                            |
 | [xinference [Xorbits Inference]](https://docs.litellm.ai/docs/providers/xinference) |                                                         |                                                                                 |                                                                                     |                                                                                   | ✅                                                                            |

--- a/cookbook/liteLLM_IBM_Watsonx.ipynb
+++ b/cookbook/liteLLM_IBM_Watsonx.ipynb
--- a/docs/my-website/docs/debugging/local_debugging.md
+++ b/docs/my-website/docs/debugging/local_debugging.md
@ -23,6 +23,14 @@ response = completion(model="gpt-3.5-turbo", messages=messages)
 response = completion("command-nightly", messages)
 ```

+## JSON Logs 
+
+If you need to store the logs as JSON, just set the `litellm.json_logs = True`.
+
+We currently just log the raw POST request from litellm as a JSON - [**See Code**]. 
+
+[Share feedback here](https://github.com/BerriAI/litellm/issues)
+
 ## Logger Function 
 But sometimes all you care about is seeing exactly what's getting sent to your api call and what's being returned - e.g. if the api call is failing, why is that happening? what are the exact params being set? 

--- a/docs/my-website/docs/enterprise.md
+++ b/docs/my-website/docs/enterprise.md
@ -8,12 +8,13 @@ For companies that need SSO, user management and professional support for LiteLL
 :::

 This covers: 
- ✅ **Features under the [LiteLLM Commercial License](https://docs.litellm.ai/docs/proxy/enterprise):**
+- ✅ **Features under the [LiteLLM Commercial License (Content Mod, Custom Tags, etc.)](https://docs.litellm.ai/docs/proxy/enterprise)**
 - ✅ **Feature Prioritization**
 - ✅ **Custom Integrations**
 - ✅ **Professional Support - Dedicated discord + slack**
 - ✅ **Custom SLAs**
- ✅ **Secure access with Single Sign-On**
+- ✅ [**Secure UI access with Single Sign-On**](../docs/proxy/ui.md#setup-ssoauth-for-ui)
+- ✅ [**JWT-Auth**](../docs/proxy/token_auth.md)


 ## Frequently Asked Questions
--- a/docs/my-website/docs/hosted.md
+++ b/docs/my-website/docs/hosted.md
@ -0,0 +1,49 @@
+import Image from '@theme/IdealImage';
+
+# Hosted LiteLLM Proxy
+
+LiteLLM maintains the proxy, so you can focus on your core products. 
+
+## [**Get Onboarded**](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
+
+This is in alpha. Schedule a call with us, and we'll give you a hosted proxy within 30 minutes. 
+
+[**🚨 Schedule Call**](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
+
+### **Status**: Alpha 
+
+Our proxy is already used in production by customers. 
+
+See our status page for [**live reliability**](https://status.litellm.ai/)
+
+### **Benefits**
+- **No Maintenance, No Infra**: We'll maintain the proxy, and spin up any additional infrastructure (e.g.: separate server for spend logs) to make sure you can load balance + track spend across multiple LLM projects. 
+- **Reliable**: Our hosted proxy is tested on 1k requests per second, making it reliable for high load.
+- **Secure**: LiteLLM is currently undergoing SOC-2 compliance, to make sure your data is as secure as possible.
+
+### Pricing
+
+Pricing is based on usage. We can figure out a price that works for your team, on the call. 
+
+[**🚨 Schedule Call**](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
+
+## **Screenshots**
+
+### 1. Create keys
+
+<Image img={require('../img/litellm_hosted_ui_create_key.png')} />
+
+### 2. Add Models
+
+<Image img={require('../img/litellm_hosted_ui_add_models.png')}/>
+
+### 3. Track spend 
+
+<Image img={require('../img/litellm_hosted_usage_dashboard.png')} />
+
+
+### 4. Configure load balancing 
+
+<Image img={require('../img/litellm_hosted_ui_router.png')} />
+
+#### [**🚨 Schedule Call**](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
--- a/docs/my-website/docs/load_test.md
+++ b/docs/my-website/docs/load_test.md
@ -213,3 +213,349 @@ asyncio.run(loadtest_fn())

 ```

+## Multi-Instance TPM/RPM Load Test (Router)
+
+Test if your defined tpm/rpm limits are respected across multiple instances of the Router object. 
+
+In our test:
+- Max RPM per deployment is = 100 requests per minute
+- Max Throughput / min on router = 200 requests per minute (2 deployments)
+- Load we'll send through router = 600 requests per minute
+
+:::info
+
+If you don't want to call a real LLM API endpoint, you can setup a fake openai server. [See code](#extra---setup-fake-openai-server)
+
+:::
+
+### Code 
+
+Let's hit the router with 600 requests per minute. 
+
+Copy this script 👇. Save it as `test_loadtest_router.py` AND run it with `python3 test_loadtest_router.py`
+
+
+```python
+from litellm import Router 
+import litellm
+litellm.suppress_debug_info = True
+litellm.set_verbose = False
+import logging
+logging.basicConfig(level=logging.CRITICAL)
+import os, random, uuid, time, asyncio
+
+# Model list for OpenAI and Anthropic models
+model_list = [
+    {
+        "model_name": "fake-openai-endpoint",
+        "litellm_params": {
+            "model": "gpt-3.5-turbo",
+            "api_key": "my-fake-key",
+            "api_base": "http://0.0.0.0:8080",
+            "rpm": 100
+        },
+    },
+    {
+        "model_name": "fake-openai-endpoint",
+        "litellm_params": {
+            "model": "gpt-3.5-turbo",
+            "api_key": "my-fake-key",
+            "api_base": "http://0.0.0.0:8081",
+            "rpm": 100
+        },
+    },
+]
+
+router_1 = Router(model_list=model_list, num_retries=0, enable_pre_call_checks=True, routing_strategy="usage-based-routing-v2", redis_host=os.getenv("REDIS_HOST"), redis_port=os.getenv("REDIS_PORT"), redis_password=os.getenv("REDIS_PASSWORD"))
+router_2 = Router(model_list=model_list, num_retries=0, routing_strategy="usage-based-routing-v2", enable_pre_call_checks=True, redis_host=os.getenv("REDIS_HOST"), redis_port=os.getenv("REDIS_PORT"), redis_password=os.getenv("REDIS_PASSWORD"))
+
+
+
+async def router_completion_non_streaming():
+  try:
+    client: Router = random.sample([router_1, router_2], 1)[0] # randomly pick b/w clients
+    # print(f"client={client}")
+    response = await client.acompletion(
+              model="fake-openai-endpoint", # [CHANGE THIS] (if you call it something else on your proxy)
+              messages=[{"role": "user", "content": f"This is a test: {uuid.uuid4()}"}],
+          )
+    return response
+  except Exception as e:
+    # print(e)
+    return None
+  
+async def loadtest_fn():
+    start = time.time()
+    n = 600  # Number of concurrent tasks
+    tasks = [router_completion_non_streaming() for _ in range(n)]
+    chat_completions = await asyncio.gather(*tasks)
+    successful_completions = [c for c in chat_completions if c is not None]
+    print(n, time.time() - start, len(successful_completions))
+
+def get_utc_datetime():
+    import datetime as dt
+    from datetime import datetime
+
+    if hasattr(dt, "UTC"):
+        return datetime.now(dt.UTC)  # type: ignore
+    else:
+        return datetime.utcnow()  # type: ignore
+
+
+# Run the event loop to execute the async function
+async def parent_fn():
+  for _ in range(10):
+    dt = get_utc_datetime()
+    current_minute = dt.strftime("%H-%M")
+    print(f"triggered new batch - {current_minute}")
+    await loadtest_fn()
+    await asyncio.sleep(10)
+
+asyncio.run(parent_fn())
+```
+## Multi-Instance TPM/RPM Load Test (Proxy)
+
+Test if your defined tpm/rpm limits are respected across multiple instances. 
+
+The quickest way to do this is by testing the [proxy](./proxy/quick_start.md). The proxy uses the [router](./routing.md) under the hood, so if you're using either of them, this test should work for you. 
+
+In our test:
+- Max RPM per deployment is = 100 requests per minute
+- Max Throughput / min on proxy = 200 requests per minute (2 deployments)
+- Load we'll send to proxy = 600 requests per minute
+
+
+So we'll send 600 requests per minute, but expect only 200 requests per minute to succeed.
+
+:::info
+
+If you don't want to call a real LLM API endpoint, you can setup a fake openai server. [See code](#extra---setup-fake-openai-server)
+
+:::
+
+### 1. Setup config 
+
+```yaml
+model_list:
+- litellm_params:
+    api_base: http://0.0.0.0:8080
+    api_key: my-fake-key
+    model: openai/my-fake-model
+    rpm: 100
+  model_name: fake-openai-endpoint
+- litellm_params:
+    api_base: http://0.0.0.0:8081
+    api_key: my-fake-key
+    model: openai/my-fake-model-2
+    rpm: 100
+  model_name: fake-openai-endpoint
+router_settings:
+  num_retries: 0
+  enable_pre_call_checks: true
+  redis_host: os.environ/REDIS_HOST ## 👈 IMPORTANT! Setup the proxy w/ redis
+  redis_password: os.environ/REDIS_PASSWORD
+  redis_port: os.environ/REDIS_PORT
+  routing_strategy: usage-based-routing-v2
+```
+
+### 2. Start proxy 2 instances
+
+**Instance 1**
+```bash
+litellm --config /path/to/config.yaml --port 4000
+
+## RUNNING on http://0.0.0.0:4000
+```
+
+**Instance 2**
+```bash
+litellm --config /path/to/config.yaml --port 4001
+
+## RUNNING on http://0.0.0.0:4001
+```
+
+### 3. Run Test 
+
+Let's hit the proxy with 600 requests per minute. 
+
+Copy this script 👇. Save it as `test_loadtest_proxy.py` AND run it with `python3 test_loadtest_proxy.py`
+
+```python
+from openai import AsyncOpenAI, AsyncAzureOpenAI
+import random, uuid
+import time, asyncio, litellm
+# import logging
+# logging.basicConfig(level=logging.DEBUG)
+#### LITELLM PROXY #### 
+litellm_client = AsyncOpenAI(
+    api_key="sk-1234", # [CHANGE THIS]
+    base_url="http://0.0.0.0:4000"
+)
+litellm_client_2 = AsyncOpenAI(
+    api_key="sk-1234", # [CHANGE THIS]
+    base_url="http://0.0.0.0:4001"
+)
+
+async def proxy_completion_non_streaming():
+  try:
+    client = random.sample([litellm_client, litellm_client_2], 1)[0] # randomly pick b/w clients
+    # print(f"client={client}")
+    response = await client.chat.completions.create(
+              model="fake-openai-endpoint", # [CHANGE THIS] (if you call it something else on your proxy)
+              messages=[{"role": "user", "content": f"This is a test: {uuid.uuid4()}"}],
+          )
+    return response
+  except Exception as e:
+    # print(e)
+    return None
+  
+async def loadtest_fn():
+    start = time.time()
+    n = 600  # Number of concurrent tasks
+    tasks = [proxy_completion_non_streaming() for _ in range(n)]
+    chat_completions = await asyncio.gather(*tasks)
+    successful_completions = [c for c in chat_completions if c is not None]
+    print(n, time.time() - start, len(successful_completions))
+
+def get_utc_datetime():
+    import datetime as dt
+    from datetime import datetime
+
+    if hasattr(dt, "UTC"):
+        return datetime.now(dt.UTC)  # type: ignore
+    else:
+        return datetime.utcnow()  # type: ignore
+
+
+# Run the event loop to execute the async function
+async def parent_fn():
+  for _ in range(10):
+    dt = get_utc_datetime()
+    current_minute = dt.strftime("%H-%M")
+    print(f"triggered new batch - {current_minute}")
+    await loadtest_fn()
+    await asyncio.sleep(10)
+
+asyncio.run(parent_fn())
+
+```
+
+
+### Extra - Setup Fake OpenAI Server 
+
+Let's setup a fake openai server with a RPM limit of 100.
+
+Let's call our file `fake_openai_server.py`. 
+
+```
+# import sys, os
+# sys.path.insert(
+#     0, os.path.abspath("../")
+# )  # Adds the parent directory to the system path
+from fastapi import FastAPI, Request, status, HTTPException, Depends
+from fastapi.responses import StreamingResponse
+from fastapi.security import OAuth2PasswordBearer
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import JSONResponse
+from fastapi import FastAPI, Request, HTTPException, UploadFile, File
+import httpx, os, json
+from openai import AsyncOpenAI
+from typing import Optional
+from slowapi import Limiter
+from slowapi.util import get_remote_address
+from slowapi.errors import RateLimitExceeded
+from fastapi import FastAPI, Request, HTTPException
+from fastapi.responses import PlainTextResponse
+
+
+class ProxyException(Exception):
+    # NOTE: DO NOT MODIFY THIS
+    # This is used to map exactly to OPENAI Exceptions
+    def __init__(
+        self,
+        message: str,
+        type: str,
+        param: Optional[str],
+        code: Optional[int],
+    ):
+        self.message = message
+        self.type = type
+        self.param = param
+        self.code = code
+
+    def to_dict(self) -> dict:
+        """Converts the ProxyException instance to a dictionary."""
+        return {
+            "message": self.message,
+            "type": self.type,
+            "param": self.param,
+            "code": self.code,
+        }
+
+
+limiter = Limiter(key_func=get_remote_address)
+app = FastAPI()
+app.state.limiter = limiter
+
+@app.exception_handler(RateLimitExceeded)
+async def _rate_limit_exceeded_handler(request: Request, exc: RateLimitExceeded):
+    return JSONResponse(status_code=429,
+                        content={"detail": "Rate Limited!"})
+
+app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
+
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+# for completion
+@app.post("/chat/completions")
+@app.post("/v1/chat/completions")
+@limiter.limit("100/minute")
+async def completion(request: Request):
+    # raise HTTPException(status_code=429, detail="Rate Limited!")
+    return {
+        "id": "chatcmpl-123",
+        "object": "chat.completion",
+        "created": 1677652288,
+        "model": None,
+        "system_fingerprint": "fp_44709d6fcb",
+        "choices": [{
+            "index": 0,
+            "message": {
+            "role": "assistant",
+            "content": "\n\nHello there, how may I assist you today?",
+            },
+            "logprobs": None,
+            "finish_reason": "stop"
+        }],
+        "usage": {
+            "prompt_tokens": 9,
+            "completion_tokens": 12,
+            "total_tokens": 21
+        }
+    }
+
+if __name__ == "__main__":
+    import socket
+    import uvicorn
+    port = 8080
+    while True:
+        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+        result = sock.connect_ex(('0.0.0.0', port))
+        if result != 0:
+            print(f"Port {port} is available, starting server...")
+            break
+        else:
+            port += 1
+
+    uvicorn.run(app, host="0.0.0.0", port=port)
+```
+
+```bash
+python3 fake_openai_server.py
+```
--- a/docs/my-website/docs/observability/custom_callback.md
+++ b/docs/my-website/docs/observability/custom_callback.md
@ -331,49 +331,25 @@ response = litellm.completion(model="gpt-3.5-turbo", messages=messages, metadata
 ## Examples

 ### Custom Callback to track costs for Streaming + Non-Streaming
+By default, the response cost is accessible in the logging object via `kwargs["response_cost"]` on success (sync + async)
 ```python

+# Step 1. Write your custom callback function
 def track_cost_callback(
    kwargs,                 # kwargs to completion
    completion_response,    # response from completion
    start_time, end_time    # start/end time
 ):
    try:
-        # init logging config
-        logging.basicConfig(
-                filename='cost.log',
-                level=logging.INFO,
-                format='%(asctime)s - %(message)s',
-                datefmt='%Y-%m-%d %H:%M:%S'
-        )
-
-        # check if it has collected an entire stream response
-        if "complete_streaming_response" in kwargs:
-            # for tracking streaming cost we pass the "messages" and the output_text to litellm.completion_cost 
-            completion_response=kwargs["complete_streaming_response"]
-            input_text = kwargs["messages"]
-            output_text = completion_response["choices"][0]["message"]["content"]
-            response_cost = litellm.completion_cost(
-                model = kwargs["model"],
-                messages = input_text,
-                completion=output_text
-            )
-            print("streaming response_cost", response_cost)
-            logging.info(f"Model {kwargs['model']} Cost: ${response_cost:.8f}")
-
-        # for non streaming responses
-        else:
-            # we pass the completion_response obj
-            if kwargs["stream"] != True:
-                response_cost = litellm.completion_cost(completion_response=completion_response)
-                print("regular response_cost", response_cost)
-                logging.info(f"Model {completion_response.model} Cost: ${response_cost:.8f}")
+        response_cost = kwargs["response_cost"] # litellm calculates response cost for you
+        print("regular response_cost", response_cost)
    except:
        pass

-# Assign the custom callback function
+# Step 2. Assign the custom callback function
 litellm.success_callback = [track_cost_callback]

+# Step 3. Make litellm.completion call
 response = completion(
    model="gpt-3.5-turbo",
    messages=[
--- a/docs/my-website/docs/observability/greenscale_integration.md
+++ b/docs/my-website/docs/observability/greenscale_integration.md
@ -0,0 +1,68 @@
+# Greenscale Tutorial
+
+[Greenscale](https://greenscale.ai/) is a production monitoring platform for your LLM-powered app that provides you granular key insights into your GenAI spending and responsible usage. Greenscale only captures metadata to minimize the exposure risk of personally identifiable information (PII).
+
+## Getting Started
+
+Use Greenscale to log requests across all LLM Providers
+
+liteLLM provides `callbacks`, making it easy for you to log data depending on the status of your responses.
+
+## Using Callbacks
+
+First, email `hello@greenscale.ai` to get an API_KEY.
+
+Use just 1 line of code, to instantly log your responses **across all providers** with Greenscale:
+
+```python
+litellm.success_callback = ["greenscale"]
+```
+
+### Complete code
+
+```python
+from litellm import completion
+
+## set env variables
+os.environ['GREENSCALE_API_KEY'] = 'your-greenscale-api-key'
+os.environ['GREENSCALE_ENDPOINT'] = 'greenscale-endpoint'
+os.environ["OPENAI_API_KEY"]= ""
+
+# set callback
+litellm.success_callback = ["greenscale"]
+
+#openai call
+response = completion(
+  model="gpt-3.5-turbo",
+  messages=[{"role": "user", "content": "Hi 👋 - i'm openai"}]
+  metadata={
+    "greenscale_project": "acme-project",
+    "greenscale_application": "acme-application"
+  }
+)
+```
+
+## Additional information in metadata
+
+You can send any additional information to Greenscale by using the `metadata` field in completion and `greenscale_` prefix. This can be useful for sending metadata about the request, such as the project and application name, customer_id, enviornment, or any other information you want to track usage. `greenscale_project` and `greenscale_application` are required fields.
+
+```python
+#openai call with additional metadata
+response = completion(
+  model="gpt-3.5-turbo",
+  messages=[
+    {"role": "user", "content": "Hi 👋 - i'm openai"}
+  ],
+  metadata={
+    "greenscale_project": "acme-project",
+    "greenscale_application": "acme-application",
+    "greenscale_customer_id": "customer-123"
+  }
+)
+```
+
+## Support & Talk with Greenscale Team
+
+- [Schedule Demo 👋](https://calendly.com/nandesh/greenscale)
+- [Website 💻](https://greenscale.ai)
+- Our email ✉️ `hello@greenscale.ai`
--- a/docs/my-website/docs/observability/langfuse_integration.md
+++ b/docs/my-website/docs/observability/langfuse_integration.md
@ -121,10 +121,12 @@ response = completion(
  metadata={
      "generation_name": "ishaan-test-generation",  # set langfuse Generation Name
      "generation_id": "gen-id22",                  # set langfuse Generation ID 
-      "trace_id": "trace-id22",                     # set langfuse Trace ID
      "trace_user_id": "user-id2",                  # set langfuse Trace User ID
      "session_id": "session-1",                    # set langfuse Session ID
      "tags": ["tag1", "tag2"]                      # set langfuse Tags
+      "trace_id": "trace-id22",                     # set langfuse Trace ID
+      ### OR ### 
+      "existing_trace_id": "trace-id22",                     # if generation is continuation of past trace. This prevents default behaviour of setting a trace name
  },
 )

@ -167,6 +169,9 @@ messages = [
 chat(messages)
 ```

+## Redacting Messages, Response Content from Langfuse Logging 
+
+Set `litellm.turn_off_message_logging=True` This will prevent the messages and responses from being logged to langfuse, but request metadata will still be logged.

 ## Troubleshooting & Errors
 ### Data not getting logged to Langfuse ? 
--- a/docs/my-website/docs/observability/langsmith_integration.md
+++ b/docs/my-website/docs/observability/langsmith_integration.md
@ -57,7 +57,7 @@ os.environ["LANGSMITH_API_KEY"] = ""
 os.environ['OPENAI_API_KEY']=""

 # set langfuse as a callback, litellm will send the data to langfuse
-litellm.success_callback = ["langfuse"] 
+litellm.success_callback = ["langsmith"] 
 
 response = litellm.completion(
    model="gpt-3.5-turbo",
@ -76,4 +76,4 @@ print(response)
 - [Schedule Demo 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
 - [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
 - Our numbers 📞 +1 (770) 8783-106 / ‭+1 (412) 618-6238‬
- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
+- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
--- a/docs/my-website/docs/observability/sentry.md
+++ b/docs/my-website/docs/observability/sentry.md
@ -40,5 +40,9 @@ response = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content
 print(response)
 ```

+## Redacting Messages, Response Content from Sentry Logging 
+
+Set `litellm.turn_off_message_logging=True` This will prevent the messages and responses from being logged to sentry, but request metadata will still be logged.
+
 [Let us know](https://github.com/BerriAI/litellm/issues/new?assignees=&labels=enhancement&projects=&template=feature_request.yml&title=%5BFeature%5D%3A+) if you need any additional options from Sentry. 

--- a/docs/my-website/docs/providers/anthropic.md
+++ b/docs/my-website/docs/providers/anthropic.md
@ -224,6 +224,91 @@ assert isinstance(
 ```


+### Parallel Function Calling 
+
+Here's how to pass the result of a function call back to an anthropic model: 
+
+```python
+from litellm import completion
+import os 
+
+os.environ["ANTHROPIC_API_KEY"] = "sk-ant.."
+
+
+litellm.set_verbose = True
+
+### 1ST FUNCTION CALL ###
+tools = [
+    {
+        "type": "function",
+        "function": {
+            "name": "get_current_weather",
+            "description": "Get the current weather in a given location",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "location": {
+                        "type": "string",
+                        "description": "The city and state, e.g. San Francisco, CA",
+                    },
+                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
+                },
+                "required": ["location"],
+            },
+        },
+    }
+]
+messages = [
+    {
+        "role": "user",
+        "content": "What's the weather like in Boston today in Fahrenheit?",
+    }
+]
+try:
+    # test without max tokens
+    response = completion(
+        model="anthropic/claude-3-opus-20240229",
+        messages=messages,
+        tools=tools,
+        tool_choice="auto",
+    )
+    # Add any assertions, here to check response args
+    print(response)
+    assert isinstance(response.choices[0].message.tool_calls[0].function.name, str)
+    assert isinstance(
+        response.choices[0].message.tool_calls[0].function.arguments, str
+    )
+
+    messages.append(
+        response.choices[0].message.model_dump()
+    )  # Add assistant tool invokes
+    tool_result = (
+        '{"location": "Boston", "temperature": "72", "unit": "fahrenheit"}'
+    )
+    # Add user submitted tool results in the OpenAI format
+    messages.append(
+        {
+            "tool_call_id": response.choices[0].message.tool_calls[0].id,
+            "role": "tool",
+            "name": response.choices[0].message.tool_calls[0].function.name,
+            "content": tool_result,
+        }
+    )
+    ### 2ND FUNCTION CALL ###
+    # In the second response, Claude should deduce answer from tool results
+    second_response = completion(
+        model="anthropic/claude-3-opus-20240229",
+        messages=messages,
+        tools=tools,
+        tool_choice="auto",
+    )
+    print(second_response)
+except Exception as e:
+    print(f"An error occurred - {str(e)}")
+```
+
+s/o @[Shekhar Patnaik](https://www.linkedin.com/in/patnaikshekhar) for requesting this!
+
 ## Usage - Vision 

 ```python
--- a/docs/my-website/docs/providers/gemini.md
+++ b/docs/my-website/docs/providers/gemini.md
@ -23,7 +23,7 @@ In certain use-cases you may need to make calls to the models and pass [safety s
 ```python
 response = completion(
    model="gemini/gemini-pro", 
-    messages=[{"role": "user", "content": "write code for saying hi from LiteLLM"}]
+    messages=[{"role": "user", "content": "write code for saying hi from LiteLLM"}],
    safety_settings=[
        {
            "category": "HARM_CATEGORY_HARASSMENT",
--- a/docs/my-website/docs/providers/groq.md
+++ b/docs/my-website/docs/providers/groq.md
@ -48,6 +48,8 @@ We support ALL Groq models, just set `groq/` as a prefix when sending completion

 | Model Name               | Function Call                                                                                                                                                      |
 |--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| llama3-8b-8192 | `completion(model="groq/llama3-8b-8192", messages)` | 
+| llama3-70b-8192 | `completion(model="groq/llama3-70b-8192", messages)` | 
 | llama2-70b-4096 | `completion(model="groq/llama2-70b-4096", messages)` | 
 | mixtral-8x7b-32768 | `completion(model="groq/mixtral-8x7b-32768", messages)` |
 | gemma-7b-it | `completion(model="groq/gemma-7b-it", messages)` |  
--- a/docs/my-website/docs/providers/mistral.md
+++ b/docs/my-website/docs/providers/mistral.md
@ -53,6 +53,50 @@ All models listed here https://docs.mistral.ai/platform/endpoints are supported.
 | open-mixtral-8x22b | `completion(model="mistral/open-mixtral-8x22b", messages)` | 


+## Function Calling 
+
+```python
+from litellm import completion
+
+# set env
+os.environ["MISTRAL_API_KEY"] = "your-api-key"
+
+tools = [
+    {
+        "type": "function",
+        "function": {
+            "name": "get_current_weather",
+            "description": "Get the current weather in a given location",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "location": {
+                        "type": "string",
+                        "description": "The city and state, e.g. San Francisco, CA",
+                    },
+                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
+                },
+                "required": ["location"],
+            },
+        },
+    }
+]
+messages = [{"role": "user", "content": "What's the weather like in Boston today?"}]
+
+response = completion(
+    model="mistral/mistral-large-latest",
+    messages=messages,
+    tools=tools,
+    tool_choice="auto",
+)
+# Add any assertions, here to check response args
+print(response)
+assert isinstance(response.choices[0].message.tool_calls[0].function.name, str)
+assert isinstance(
+    response.choices[0].message.tool_calls[0].function.arguments, str
+)
+```
+
 ## Sample Usage - Embedding
 ```python
 from litellm import embedding
--- a/docs/my-website/docs/providers/replicate.md
+++ b/docs/my-website/docs/providers/replicate.md
@ -1,7 +1,16 @@
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
 # Replicate

 LiteLLM supports all models on Replicate

+
+## Usage
+
+<Tabs>
+<TabItem value="sdk" label="SDK">
+
 ### API KEYS
 ```python
 import os 
@ -16,14 +25,175 @@ import os
 ## set ENV variables
 os.environ["REPLICATE_API_KEY"] = "replicate key"

-# replicate llama-2 call
+# replicate llama-3 call
 response = completion(
-    model="replicate/llama-2-70b-chat:2796ee9483c3fd7aa2e171d38f4ca12251a30609463dcfd4cd76703f22e96cdf", 
+    model="replicate/meta/meta-llama-3-8b-instruct", 
    messages = [{ "content": "Hello, how are you?","role": "user"}]
 )
 ```

-### Example - Calling Replicate Deployments
+</TabItem>
+<TabItem value="proxy" label="PROXY">
+
+1. Add models to your config.yaml
+
+  ```yaml
+  model_list:
+    - model_name: llama-3
+      litellm_params:
+        model: replicate/meta/meta-llama-3-8b-instruct
+        api_key: os.environ/REPLICATE_API_KEY
+  ```
+
+
+
+2. Start the proxy 
+
+  ```bash
+  $ litellm --config /path/to/config.yaml --debug
+  ```
+
+3. Send Request to LiteLLM Proxy Server
+
+  <Tabs>
+
+  <TabItem value="openai" label="OpenAI Python v1.0.0+">
+
+  ```python
+  import openai
+  client = openai.OpenAI(
+      api_key="sk-1234",             # pass litellm proxy key, if you're using virtual keys
+      base_url="http://0.0.0.0:4000" # litellm-proxy-base url
+  )
+
+  response = client.chat.completions.create(
+      model="llama-3",
+      messages = [
+        {
+            "role": "system",
+            "content": "Be a good human!"
+        },
+        {
+            "role": "user",
+            "content": "What do you know about earth?"
+        }
+    ]
+  )
+
+  print(response)
+  ```
+
+  </TabItem>
+
+  <TabItem value="curl" label="curl">
+
+  ```shell
+  curl --location 'http://0.0.0.0:4000/chat/completions' \
+      --header 'Authorization: Bearer sk-1234' \
+      --header 'Content-Type: application/json' \
+      --data '{
+      "model": "llama-3",
+      "messages": [
+        {
+            "role": "system",
+            "content": "Be a good human!"
+        },
+        {
+            "role": "user",
+            "content": "What do you know about earth?"
+        }
+        ],
+  }'
+  ```
+  </TabItem>
+
+  </Tabs>
+
+
+### Expected Replicate Call 
+
+This is the call litellm will make to replicate, from the above example: 
+
+```bash
+
+POST Request Sent from LiteLLM:
+curl -X POST \
+https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct \
+-H 'Authorization: Token your-api-key' -H 'Content-Type: application/json' \
+-d '{'version': 'meta/meta-llama-3-8b-instruct', 'input': {'prompt': '<|start_header_id|>system<|end_header_id|>\n\nBe a good human!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat do you know about earth?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'}}'
+```
+
+</TabItem>
+
+</Tabs>
+
+## Advanced Usage - Prompt Formatting 
+
+LiteLLM has prompt template mappings for all `meta-llama` llama3 instruct models. [**See Code**](https://github.com/BerriAI/litellm/blob/4f46b4c3975cd0f72b8c5acb2cb429d23580c18a/litellm/llms/prompt_templates/factory.py#L1360)
+
+To apply a custom prompt template: 
+
+<Tabs>
+<TabItem value="sdk" label="SDK">
+
+```python 
+import litellm
+
+import os 
+os.environ["REPLICATE_API_KEY"] = ""
+
+# Create your own custom prompt template 
+litellm.register_prompt_template(
+	    model="togethercomputer/LLaMA-2-7B-32K",
+        initial_prompt_value="You are a good assistant" # [OPTIONAL]
+	    roles={
+            "system": {
+                "pre_message": "[INST] <<SYS>>\n", # [OPTIONAL]
+                "post_message": "\n<</SYS>>\n [/INST]\n" # [OPTIONAL]
+            },
+            "user": { 
+                "pre_message": "[INST] ", # [OPTIONAL]
+                "post_message": " [/INST]" # [OPTIONAL]
+            }, 
+            "assistant": {
+                "pre_message": "\n" # [OPTIONAL]
+                "post_message": "\n" # [OPTIONAL]
+            }
+        }
+        final_prompt_value="Now answer as best you can:" # [OPTIONAL]
+)
+
+def test_replicate_custom_model():
+    model = "replicate/togethercomputer/LLaMA-2-7B-32K"
+    response = completion(model=model, messages=messages)
+    print(response['choices'][0]['message']['content'])
+    return response
+
+test_replicate_custom_model()
+```
+</TabItem>
+<TabItem value="proxy" label="PROXY">
+
+```yaml
+# Model-specific parameters
+model_list:
+  - model_name: mistral-7b # model alias
+    litellm_params: # actual params for litellm.completion()
+      model: "replicate/mistralai/Mistral-7B-Instruct-v0.1" 
+      api_key: os.environ/REPLICATE_API_KEY
+      initial_prompt_value: "\n"
+      roles: {"system":{"pre_message":"<|im_start|>system\n", "post_message":"<|im_end|>"}, "assistant":{"pre_message":"<|im_start|>assistant\n","post_message":"<|im_end|>"}, "user":{"pre_message":"<|im_start|>user\n","post_message":"<|im_end|>"}}
+      final_prompt_value: "\n"
+      bos_token: "<s>"
+      eos_token: "</s>"
+      max_tokens: 4096
+```
+
+</TabItem>
+
+</Tabs>
+
+## Advanced Usage - Calling Replicate Deployments
 Calling a [deployed replicate LLM](https://replicate.com/deployments)
 Add the `replicate/deployments/` prefix to your model, so litellm will call the `deployments` endpoint. This will call `ishaan-jaff/ishaan-mistral` deployment on replicate

@ -40,7 +210,7 @@ Replicate responses can take 3-5 mins due to replicate cold boots, if you're try

 :::

-### Replicate Models
+## Replicate Models
 liteLLM supports all replicate LLMs

 For replicate models ensure to add a `replicate/` prefix to the `model` arg. liteLLM detects it using this arg. 
@ -49,15 +219,15 @@ Below are examples on how to call replicate LLMs using liteLLM

 Model Name                  | Function Call                                                  | Required OS Variables                |
 -----------------------------|----------------------------------------------------------------|--------------------------------------|
- replicate/llama-2-70b-chat | `completion(model='replicate/llama-2-70b-chat:2796ee9483c3fd7aa2e171d38f4ca12251a30609463dcfd4cd76703f22e96cdf', messages, supports_system_prompt=True)` | `os.environ['REPLICATE_API_KEY']`    |
- a16z-infra/llama-2-13b-chat| `completion(model='replicate/a16z-infra/llama-2-13b-chat:2a7f981751ec7fdf87b5b91ad4db53683a98082e9ff7bfd12c8cd5ea85980a52', messages, supports_system_prompt=True)`| `os.environ['REPLICATE_API_KEY']`    |
+ replicate/llama-2-70b-chat | `completion(model='replicate/llama-2-70b-chat:2796ee9483c3fd7aa2e171d38f4ca12251a30609463dcfd4cd76703f22e96cdf', messages)` | `os.environ['REPLICATE_API_KEY']`    |
+ a16z-infra/llama-2-13b-chat| `completion(model='replicate/a16z-infra/llama-2-13b-chat:2a7f981751ec7fdf87b5b91ad4db53683a98082e9ff7bfd12c8cd5ea85980a52', messages)`| `os.environ['REPLICATE_API_KEY']`    |
 replicate/vicuna-13b  | `completion(model='replicate/vicuna-13b:6282abe6a492de4145d7bb601023762212f9ddbbe78278bd6771c8b3b2f2a13b', messages)` | `os.environ['REPLICATE_API_KEY']` |
 daanelson/flan-t5-large    | `completion(model='replicate/daanelson/flan-t5-large:ce962b3f6792a57074a601d3979db5839697add2e4e02696b3ced4c022d4767f', messages)`    | `os.environ['REPLICATE_API_KEY']`    |
 custom-llm    | `completion(model='replicate/custom-llm-version-id', messages)`    | `os.environ['REPLICATE_API_KEY']`    |
  replicate deployment    | `completion(model='replicate/deployments/ishaan-jaff/ishaan-mistral', messages)`    | `os.environ['REPLICATE_API_KEY']`    |


-### Passing additional params - max_tokens, temperature 
+## Passing additional params - max_tokens, temperature 
 See all litellm.completion supported params [here](https://docs.litellm.ai/docs/completion/input)

 ```python
@ -73,11 +243,22 @@ response = completion(
    messages = [{ "content": "Hello, how are you?","role": "user"}],
    max_tokens=20,
    temperature=0.5
-
 )
 ```

-### Passings Replicate specific params
+**proxy**
+
+```yaml
+  model_list:
+    - model_name: llama-3
+      litellm_params:
+        model: replicate/meta/meta-llama-3-8b-instruct
+        api_key: os.environ/REPLICATE_API_KEY
+        max_tokens: 20
+        temperature: 0.5
+```
+
+## Passings Replicate specific params
 Send params [not supported by `litellm.completion()`](https://docs.litellm.ai/docs/completion/input) but supported by Replicate by passing them to `litellm.completion`

 Example `seed`, `min_tokens` are Replicate specific param
@ -98,3 +279,15 @@ response = completion(
    top_k=20,
 )
 ```
+
+**proxy**
+
+```yaml
+  model_list:
+    - model_name: llama-3
+      litellm_params:
+        model: replicate/meta/meta-llama-3-8b-instruct
+        api_key: os.environ/REPLICATE_API_KEY
+        min_tokens: 2
+        top_k: 20
+```
--- a/docs/my-website/docs/providers/vllm.md
+++ b/docs/my-website/docs/providers/vllm.md
@ -4,6 +4,13 @@ LiteLLM supports all models on VLLM.

 🚀[Code Tutorial](https://github.com/BerriAI/litellm/blob/main/cookbook/VLLM_Model_Testing.ipynb)

+
+:::info
+
+To call a HOSTED VLLM Endpoint use [these docs](./openai_compatible.md)
+
+:::
+
 ### Quick Start
 ```
 pip install litellm vllm
--- a/docs/my-website/docs/providers/watsonx.md
+++ b/docs/my-website/docs/providers/watsonx.md
@ -0,0 +1,284 @@
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# IBM watsonx.ai
+
+LiteLLM supports all IBM [watsonx.ai](https://watsonx.ai/) foundational models and embeddings.
+
+## Environment Variables
+```python
+os.environ["WATSONX_URL"] = ""  # (required) Base URL of your WatsonX instance
+# (required) either one of the following:
+os.environ["WATSONX_APIKEY"] = "" # IBM cloud API key
+os.environ["WATSONX_TOKEN"] = "" # IAM auth token
+# optional - can also be passed as params to completion() or embedding()
+os.environ["WATSONX_PROJECT_ID"] = "" # Project ID of your WatsonX instance
+os.environ["WATSONX_DEPLOYMENT_SPACE_ID"] = "" # ID of your deployment space to use deployed models
+```
+
+See [here](https://cloud.ibm.com/apidocs/watsonx-ai#api-authentication) for more information on how to get an access token to authenticate to watsonx.ai.
+
+## Usage
+
+<a target="_blank" href="https://colab.research.google.com/github/BerriAI/litellm/blob/main/cookbook/liteLLM_IBM_Watsonx.ipynb">
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+</a>
+
+```python
+import os
+from litellm import completion
+
+os.environ["WATSONX_URL"] = ""
+os.environ["WATSONX_APIKEY"] = ""
+
+response = completion(
+  model="watsonx/ibm/granite-13b-chat-v2",
+  messages=[{ "content": "what is your favorite colour?","role": "user"}],
+  project_id="<my-project-id>" # or pass with os.environ["WATSONX_PROJECT_ID"]
+)
+
+response = completion(
+  model="watsonx/meta-llama/llama-3-8b-instruct",
+  messages=[{ "content": "what is your favorite colour?","role": "user"}],
+  project_id="<my-project-id>"
+)
+```
+
+## Usage - Streaming
+```python
+import os
+from litellm import completion
+
+os.environ["WATSONX_URL"] = ""
+os.environ["WATSONX_APIKEY"] = ""
+os.environ["WATSONX_PROJECT_ID"] = ""
+
+response = completion(
+  model="watsonx/ibm/granite-13b-chat-v2",
+  messages=[{ "content": "what is your favorite colour?","role": "user"}],
+  stream=True
+)
+for chunk in response:
+  print(chunk)
+```
+
+#### Example Streaming Output Chunk
+```json
+{
+  "choices": [
+    {
+      "finish_reason": null,
+      "index": 0,
+      "delta": {
+        "content": "I don't have a favorite color, but I do like the color blue. What's your favorite color?"
+      }
+    }
+  ],
+  "created": null,
+  "model": "watsonx/ibm/granite-13b-chat-v2",
+  "usage": {
+    "prompt_tokens": null,
+    "completion_tokens": null,
+    "total_tokens": null
+  }
+}
+```
+
+## Usage - Models in deployment spaces
+
+Models that have been deployed to a deployment space (e.g.: tuned models) can be called using the `deployment/<deployment_id>` format (where `<deployment_id>` is the ID of the deployed model in your deployment space). 
+
+The ID of your deployment space must also be set in the environment variable `WATSONX_DEPLOYMENT_SPACE_ID` or passed to the function as `space_id=<deployment_space_id>`. 
+
+```python
+import litellm
+response = litellm.completion(
+    model="watsonx/deployment/<deployment_id>",
+    messages=[{"content": "Hello, how are you?", "role": "user"}],
+    space_id="<deployment_space_id>"
+)
+```
+
+## Usage - Embeddings
+
+LiteLLM also supports making requests to IBM watsonx.ai embedding models. The credential needed for this is the same as for completion.
+
+```python
+from litellm import embedding
+
+response = embedding(
+    model="watsonx/ibm/slate-30m-english-rtrvr",
+    input=["What is the capital of France?"],
+    project_id="<my-project-id>"
+)
+print(response)
+# EmbeddingResponse(model='ibm/slate-30m-english-rtrvr', data=[{'object': 'embedding', 'index': 0, 'embedding': [-0.037463713, -0.02141933, -0.02851813, 0.015519324, ..., -0.0021367231, -0.01704561, -0.001425816, 0.0035238306]}], object='list', usage=Usage(prompt_tokens=8, total_tokens=8))
+```
+
+## OpenAI Proxy Usage 
+
+Here's how to call IBM watsonx.ai with the LiteLLM Proxy Server
+
+### 1. Save keys in your environment
+
+```bash
+export WATSONX_URL=""
+export WATSONX_APIKEY=""
+export WATSONX_PROJECT_ID=""
+```
+
+### 2. Start the proxy 
+
+<Tabs>
+<TabItem value="cli" label="CLI">
+
+```bash
+$ litellm --model watsonx/meta-llama/llama-3-8b-instruct
+
+# Server running on http://0.0.0.0:4000
+```
+
+</TabItem>
+<TabItem value="config" label="config.yaml">
+
+```yaml
+model_list:
+  - model_name: llama-3-8b
+    litellm_params:
+      # all params accepted by litellm.completion()
+      model: watsonx/meta-llama/llama-3-8b-instruct
+      api_key: "os.environ/WATSONX_API_KEY" # does os.getenv("WATSONX_API_KEY")
+```
+</TabItem>
+</Tabs>
+
+### 3. Test it
+
+
+<Tabs>
+<TabItem value="Curl" label="Curl Request">
+
+```shell
+curl --location 'http://0.0.0.0:4000/chat/completions' \
+--header 'Content-Type: application/json' \
+--data ' {
+      "model": "llama-3-8b",
+      "messages": [
+        {
+          "role": "user",
+          "content": "what is your favorite colour?"
+        }
+      ]
+    }
+'
+```
+</TabItem>
+<TabItem value="openai" label="OpenAI v1.0.0+">
+
+```python
+import openai
+client = openai.OpenAI(
+    api_key="anything",
+    base_url="http://0.0.0.0:4000"
+)
+
+# request sent to model set on litellm proxy, `litellm --model`
+response = client.chat.completions.create(model="llama-3-8b", messages=[
+    {
+        "role": "user",
+        "content": "what is your favorite colour?"
+    }
+])
+
+print(response)
+
+```
+</TabItem>
+<TabItem value="langchain" label="Langchain">
+
+```python
+from langchain.chat_models import ChatOpenAI
+from langchain.prompts.chat import (
+    ChatPromptTemplate,
+    HumanMessagePromptTemplate,
+    SystemMessagePromptTemplate,
+)
+from langchain.schema import HumanMessage, SystemMessage
+
+chat = ChatOpenAI(
+    openai_api_base="http://0.0.0.0:4000", # set openai_api_base to the LiteLLM Proxy
+    model = "llama-3-8b",
+    temperature=0.1
+)
+
+messages = [
+    SystemMessage(
+        content="You are a helpful assistant that im using to make a test request to."
+    ),
+    HumanMessage(
+        content="test from litellm. tell me why it's amazing in 1 sentence"
+    ),
+]
+response = chat(messages)
+
+print(response)
+```
+</TabItem>
+</Tabs>
+
+
+## Authentication
+
+### Passing credentials as parameters
+
+You can also pass the credentials as parameters to the completion and embedding functions.
+
+```python
+import os
+from litellm import completion
+
+response = completion(
+            model="watsonx/ibm/granite-13b-chat-v2",
+            messages=[{ "content": "What is your favorite color?","role": "user"}],
+            url="",
+            api_key="",
+            project_id=""
+)
+```
+
+
+## Supported IBM watsonx.ai Models
+
+Here are some examples of models available in IBM watsonx.ai that you can use with LiteLLM:
+
+| Mode Name | Command |
+| ---------- | --------- |
+| Flan T5 XXL | `completion(model=watsonx/google/flan-t5-xxl, messages=messages)` |
+| Flan Ul2 | `completion(model=watsonx/google/flan-ul2, messages=messages)` |
+| Mt0 XXL | `completion(model=watsonx/bigscience/mt0-xxl, messages=messages)` |
+| Gpt Neox | `completion(model=watsonx/eleutherai/gpt-neox-20b, messages=messages)` |
+| Mpt 7B Instruct2 | `completion(model=watsonx/ibm/mpt-7b-instruct2, messages=messages)` |
+| Starcoder | `completion(model=watsonx/bigcode/starcoder, messages=messages)` |
+| Llama 2 70B Chat | `completion(model=watsonx/meta-llama/llama-2-70b-chat, messages=messages)` |
+| Llama 2 13B Chat | `completion(model=watsonx/meta-llama/llama-2-13b-chat, messages=messages)` |
+| Granite 13B Instruct | `completion(model=watsonx/ibm/granite-13b-instruct-v1, messages=messages)` |
+| Granite 13B Chat | `completion(model=watsonx/ibm/granite-13b-chat-v1, messages=messages)` |
+| Flan T5 XL | `completion(model=watsonx/google/flan-t5-xl, messages=messages)` |
+| Granite 13B Chat V2 | `completion(model=watsonx/ibm/granite-13b-chat-v2, messages=messages)` |
+| Granite 13B Instruct V2 | `completion(model=watsonx/ibm/granite-13b-instruct-v2, messages=messages)` |
+| Elyza Japanese Llama 2 7B Instruct | `completion(model=watsonx/elyza/elyza-japanese-llama-2-7b-instruct, messages=messages)` |
+| Mixtral 8X7B Instruct V01 Q | `completion(model=watsonx/ibm-mistralai/mixtral-8x7b-instruct-v01-q, messages=messages)` |
+
+
+For a list of all available models in watsonx.ai, see [here](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx&locale=en&audience=wdp).
+
+
+## Supported IBM watsonx.ai Embedding Models
+
+| Model Name           | Function Call                               |
+|----------------------|---------------------------------------------|
+| Slate 30m | `embedding(model="watsonx/ibm/slate-30m-english-rtrvr", input=input)` |
+| Slate 125m  | `embedding(model="watsonx/ibm/slate-125m-english-rtrvr", input=input)` |
+
+
+For a list of all available embedding models in watsonx.ai, see [here](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models-embed.html?context=wx).
--- a/docs/my-website/docs/proxy/alerting.md
+++ b/docs/my-website/docs/proxy/alerting.md
@ -1,13 +1,13 @@
-# Slack Alerting
+# 🚨 Alerting 

 Get alerts for:
- hanging LLM api calls
- failed LLM api calls
- slow LLM api calls
- budget Tracking per key/user:
+- Hanging LLM api calls
+- Failed LLM api calls
+- Slow LLM api calls
+- Budget Tracking per key/user:
    - When a User/Key crosses their Budget 
    - When a User/Key is 15% away from crossing their Budget
- failed db read/writes
+- Failed db read/writes

 ## Quick Start

--- a/docs/my-website/docs/proxy/caching.md
+++ b/docs/my-website/docs/proxy/caching.md
@ -61,6 +61,22 @@ litellm_settings:
    ttl: 600 # will be cached on redis for 600s
 ```

+
+## SSL
+
+just set `REDIS_SSL="True"` in your .env, and LiteLLM will pick this up. 
+
+```env
+REDIS_SSL="True"
+```
+
+For quick testing, you can also use REDIS_URL, eg.:
+
+```
+REDIS_URL="rediss://.."
+```
+
+but we **don't** recommend using REDIS_URL in prod. We've noticed a performance difference between using it vs. redis_host, port, etc. 
 #### Step 2: Add Redis Credentials to .env
 Set either `REDIS_URL` or the `REDIS_HOST` in your os environment, to enable caching.

--- a/docs/my-website/docs/proxy/configs.md
+++ b/docs/my-website/docs/proxy/configs.md
@ -62,9 +62,11 @@ model_list:

 litellm_settings: # module level litellm settings - https://github.com/BerriAI/litellm/blob/main/litellm/__init__.py
  drop_params: True
+  success_callback: ["langfuse"] # OPTIONAL - if you want to start sending LLM Logs to Langfuse. Make sure to set `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` in your env

 general_settings: 
  master_key: sk-1234 # [OPTIONAL] Only use this if you to require all calls to contain this key (Authorization: Bearer sk-1234)
+  alerting: ["slack"] # [OPTIONAL] If you want Slack Alerts for Hanging LLM requests, Slow llm responses, Budget Alerts. Make sure to set `SLACK_WEBHOOK_URL` in your env
 ```
 :::info

@ -600,6 +602,7 @@ general_settings:
  "general_settings": {
    "completion_model": "string",
    "disable_spend_logs": "boolean", # turn off writing each transaction to the db
+    "disable_master_key_return": "boolean", # turn off returning master key on UI (checked on '/user/info' endpoint)
    "disable_reset_budget": "boolean", # turn off reset budget scheduled task
    "enable_jwt_auth": "boolean", # allow proxy admin to auth in via jwt tokens with 'litellm_proxy_admin' in claims
    "enforce_user_param": "boolean", # requires all openai endpoint requests to have a 'user' param
--- a/docs/my-website/docs/proxy/deploy.md
+++ b/docs/my-website/docs/proxy/deploy.md
@ -231,13 +231,16 @@ Your OpenAI proxy server is now running on `http://127.0.0.1:4000`.
 | Docs | When to Use |
 | --- | --- |
 | [Quick Start](#quick-start) | call 100+ LLMs + Load Balancing |
-| [Deploy with Database](#deploy-with-database) | + use Virtual Keys + Track Spend |
+| [Deploy with Database](#deploy-with-database) | + use Virtual Keys + Track Spend (Note: When deploying with a database providing a `DATABASE_URL` and `LITELLM_MASTER_KEY` are required in your env ) |
 | [LiteLLM container + Redis](#litellm-container--redis) | + load balance across multiple litellm containers |
 | [LiteLLM Database container + PostgresDB + Redis](#litellm-database-container--postgresdb--redis) | + use Virtual Keys + Track Spend + load balance across multiple litellm containers |

 ## Deploy with Database
 ### Docker, Kubernetes, Helm Chart

+Requirements:
+- Need a postgres database (e.g. [Supabase](https://supabase.com/), [Neon](https://neon.tech/), etc) Set `DATABASE_URL=postgresql://<user>:<password>@<host>:<port>/<dbname>` in your env 
+- Set a `LITELLM_MASTER_KEY`, this is your Proxy Admin key - you can use this to create other keys (🚨 must start with `sk-`)

 <Tabs>

@ -252,6 +255,8 @@ docker pull ghcr.io/berriai/litellm-database:main-latest
 ```shell
 docker run \
    -v $(pwd)/litellm_config.yaml:/app/config.yaml \
+    -e LITELLM_MASTER_KEY=sk-1234 \
+    -e DATABASE_URL=postgresql://<user>:<password>@<host>:<port>/<dbname> \
    -e AZURE_API_KEY=d6*********** \
    -e AZURE_API_BASE=https://openai-***********/ \
    -p 4000:4000 \
@ -267,26 +272,63 @@ Your OpenAI proxy server is now running on `http://0.0.0.0:4000`.
 #### Step 1. Create deployment.yaml

 ```yaml
-   apiVersion: apps/v1
-   kind: Deployment
-   metadata:
-     name: litellm-deployment
-   spec:
-     replicas: 1
-     selector:
-       matchLabels:
-         app: litellm
-     template:
-       metadata:
-         labels:
-           app: litellm
-       spec:
-         containers:
-           - name: litellm-container
-             image: ghcr.io/berriai/litellm-database:main-latest
-             env:
-              - name: DATABASE_URL
-                value: postgresql://<user>:<password>@<host>:<port>/<dbname>
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: litellm-deployment
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: litellm
+  template:
+    metadata:
+      labels:
+        app: litellm
+    spec:
+      containers:
+        - name: litellm-container
+          image: ghcr.io/berriai/litellm:main-latest
+          imagePullPolicy: Always
+          env:
+            - name: AZURE_API_KEY
+              value: "d6******"
+            - name: AZURE_API_BASE
+              value: "https://ope******"
+            - name: LITELLM_MASTER_KEY
+              value: "sk-1234"
+            - name: DATABASE_URL
+              value: "po**********"
+          args:
+            - "--config"
+            - "/app/proxy_config.yaml"  # Update the path to mount the config file
+          volumeMounts:                 # Define volume mount for proxy_config.yaml
+            - name: config-volume
+              mountPath: /app
+              readOnly: true
+          livenessProbe:
+            httpGet:
+              path: /health/liveliness
+              port: 4000
+            initialDelaySeconds: 120
+            periodSeconds: 15
+            successThreshold: 1
+            failureThreshold: 3
+            timeoutSeconds: 10
+          readinessProbe:
+            httpGet:
+              path: /health/readiness
+              port: 4000
+            initialDelaySeconds: 120
+            periodSeconds: 15
+            successThreshold: 1
+            failureThreshold: 3
+            timeoutSeconds: 10
+      volumes:  # Define volume to mount proxy_config.yaml
+        - name: config-volume
+          configMap:
+            name: litellm-config  
+
 ```

 ```bash
--- a/docs/my-website/docs/proxy/logging.md
+++ b/docs/my-website/docs/proxy/logging.md
@ -401,7 +401,7 @@ litellm_settings:
 Start the LiteLLM Proxy and make a test request to verify the logs reached your callback API 

 ## Logging Proxy Input/Output - Langfuse
-We will use the `--config` to set `litellm.success_callback = ["langfuse"]` this will log all successfull LLM calls to langfuse
+We will use the `--config` to set `litellm.success_callback = ["langfuse"]` this will log all successfull LLM calls to langfuse. Make sure to set `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` in your environment

 **Step 1** Install langfuse

@ -419,7 +419,13 @@ litellm_settings:
  success_callback: ["langfuse"]
 ```

-**Step 3**: Start the proxy, make a test request
+**Step 3**: Set required env variables for logging to langfuse
+```shell
+export LANGFUSE_PUBLIC_KEY="pk_kk"
+export LANGFUSE_SECRET_KEY="sk_ss
+```
+
+**Step 4**: Start the proxy, make a test request

 Start proxy
 ```shell
@ -569,6 +575,22 @@ curl -X POST 'http://0.0.0.0:4000/key/generate' \

 All requests made with these keys will log data to their team-specific logging.

+### Redacting Messages, Response Content from Langfuse Logging 
+
+Set `litellm.turn_off_message_logging=True` This will prevent the messages and responses from being logged to langfuse, but request metadata will still be logged.
+
+```yaml
+model_list:
+ - model_name: gpt-3.5-turbo
+    litellm_params:
+      model: gpt-3.5-turbo
+litellm_settings:
+  success_callback: ["langfuse"]
+  turn_off_message_logging: True
+```
+
+
+
 ## Logging Proxy Input/Output - DataDog
 We will use the `--config` to set `litellm.success_callback = ["datadog"]` this will log all successfull LLM calls to DataDog

--- a/docs/my-website/docs/proxy/prod.md
+++ b/docs/my-website/docs/proxy/prod.md
@ -16,7 +16,7 @@ Expected Performance in Production
 | `/chat/completions` Requests/hour | `126K` |


-## 1. Switch of Debug Logging
+## 1. Switch off Debug Logging

 Remove `set_verbose: True` from your config.yaml
 ```yaml
@ -40,7 +40,7 @@ Use this Docker `CMD`. This will start the proxy with 1 Uvicorn Async Worker
 CMD ["--port", "4000", "--config", "./proxy_server_config.yaml"]
 ```

-## 2. Batch write spend updates every 60s
+## 3. Batch write spend updates every 60s

 The default proxy batch write is 10s. This is to make it easy to see spend when debugging locally. 

@ -49,11 +49,35 @@ In production, we recommend using a longer interval period of 60s. This reduces
 ```yaml
 general_settings:
  master_key: sk-1234
-  proxy_batch_write_at: 5 # 👈 Frequency of batch writing logs to server (in seconds)
+  proxy_batch_write_at: 60 # 👈 Frequency of batch writing logs to server (in seconds)
 ```

+## 4. use Redis 'port','host', 'password'. NOT 'redis_url'

-## 3. Move spend logs to separate server
+When connecting to Redis use redis port, host, and password params. Not 'redis_url'. We've seen a 80 RPS difference between these 2 approaches when using the async redis client. 
+
+This is still something we're investigating. Keep track of it [here](https://github.com/BerriAI/litellm/issues/3188)
+
+Recommended to do this for prod: 
+
+```yaml
+router_settings:
+  routing_strategy: usage-based-routing-v2 
+  # redis_url: "os.environ/REDIS_URL"
+  redis_host: os.environ/REDIS_HOST
+  redis_port: os.environ/REDIS_PORT
+  redis_password: os.environ/REDIS_PASSWORD
+```
+
+## 5. Switch off resetting budgets
+
+Add this to your config.yaml. (Only spend per Key, User and Team will be tracked - spend per API Call will not be written to the LiteLLM Database)
+```yaml
+general_settings:
+  disable_reset_budget: true
+```
+
+## 6. Move spend logs to separate server (BETA)

 Writing each spend log to the db can slow down your proxy. In testing we saw a 70% improvement in median response time, by moving writing spend logs to a separate server. 

@ -141,24 +165,6 @@ A t2.micro should be sufficient to handle 1k logs / minute on this server.

 This consumes at max 120MB, and <0.1 vCPU. 

-## 4. Switch off resetting budgets
-
-Add this to your config.yaml. (Only spend per Key, User and Team will be tracked - spend per API Call will not be written to the LiteLLM Database)
-```yaml
-general_settings:
-  disable_spend_logs: true
-  disable_reset_budget: true
-```
-
-## 5. Switch of `litellm.telemetry`
-
-Switch of all telemetry tracking done by litellm
-
-```yaml
-litellm_settings:
-  telemetry: False
-```
-
 ## Machine Specifications to Deploy LiteLLM

 | Service | Spec | CPUs | Memory | Architecture | Version|
--- a/docs/my-website/docs/proxy/grafana_metrics.md
+++ b/docs/my-website/docs/proxy/grafana_metrics.md
@ -14,6 +14,7 @@ model_list:
      model: gpt-3.5-turbo
 litellm_settings:
  success_callback: ["prometheus"]
+  failure_callback: ["prometheus"]
 ```

 Start the proxy
@ -48,9 +49,10 @@ http://localhost:4000/metrics

 | Metric Name          | Description                          |
 |----------------------|--------------------------------------|
-| `litellm_requests_metric`             | Number of requests made, per `"user", "key", "model"`          |
-| `litellm_spend_metric`                | Total Spend, per `"user", "key", "model"`                 |
-| `litellm_total_tokens`         | input + output tokens per `"user", "key", "model"`     |
+| `litellm_requests_metric`             | Number of requests made, per `"user", "key", "model", "team", "end-user"`          |
+| `litellm_spend_metric`                | Total Spend, per `"user", "key", "model", "team", "end-user"`                 |
+| `litellm_total_tokens`         | input + output tokens per `"user", "key", "model", "team", "end-user"`     |
+| `litellm_llm_api_failed_requests_metric`   | Number of failed LLM API requests per `"user", "key", "model", "team", "end-user"`    |

 ## Monitor System Health

@ -69,3 +71,4 @@ litellm_settings:
 |----------------------|--------------------------------------|
 | `litellm_redis_latency`         | histogram latency for redis calls     |
 | `litellm_redis_fails`         | Number of failed redis calls    |
+| `litellm_self_latency`         | Histogram latency for successful litellm api call    |
--- a/docs/my-website/docs/proxy/quick_start.md
+++ b/docs/my-website/docs/proxy/quick_start.md
@ -348,6 +348,29 @@ query_result = embeddings.embed_query(text)

 print(f"TITAN EMBEDDINGS")
 print(query_result[:5])
+```
+</TabItem>
+<TabItem value="litellm" label="LiteLLM SDK">
+
+This is **not recommended**. There is duplicate logic as the proxy also uses the sdk, which might lead to unexpected errors. 
+
+```python
+from litellm import completion 
+
+response = completion(
+    model="openai/gpt-3.5-turbo", 
+    messages = [
+        {
+            "role": "user",
+            "content": "this is a test request, write a short poem"
+        }
+    ], 
+    api_key="anything", 
+    base_url="http://0.0.0.0:4000"
+    )
+
+print(response)
+
 ```
 </TabItem>
 </Tabs>
--- a/docs/my-website/docs/proxy/reliability.md
+++ b/docs/my-website/docs/proxy/reliability.md
@ -136,6 +136,21 @@ curl --location 'http://0.0.0.0:4000/chat/completions' \
 '
 ```

+### Test it!
+
+
+```bash
+curl --location 'http://0.0.0.0:4000/chat/completions' \
+     --header 'Content-Type: application/json' \
+     --data-raw '{
+        "model": "zephyr-beta", # 👈 MODEL NAME to fallback from
+        "messages": [
+            {"role": "user", "content": "what color is red"}
+        ],
+        "mock_testing_fallbacks": true
+     }'
+```
+
 ## Advanced - Context Window Fallbacks 

 **Before call is made** check if a call is within model context window with  **`enable_pre_call_checks: true`**.
--- a/docs/my-website/docs/proxy/user_keys.md
+++ b/docs/my-website/docs/proxy/user_keys.md
@ -121,6 +121,9 @@ from langchain.prompts.chat import (
    SystemMessagePromptTemplate,
 )
 from langchain.schema import HumanMessage, SystemMessage
+import os 
+
+os.environ["OPENAI_API_KEY"] = "anything"

 chat = ChatOpenAI(
    openai_api_base="http://0.0.0.0:4000",
--- a/docs/my-website/docs/routing.md
+++ b/docs/my-website/docs/routing.md
@ -95,7 +95,7 @@ print(response)
 - `router.image_generation()` - completion calls in OpenAI `/v1/images/generations` endpoint format
 - `router.aimage_generation()` - async image generation calls

-### Advanced - Routing Strategies
+## Advanced - Routing Strategies
 #### Routing Strategies - Weighted Pick, Rate Limit Aware, Least Busy, Latency Based

 Router provides 4 strategies for routing your calls across multiple deployments: 
@ -279,7 +279,7 @@ router_settings:
 ```

 </TabItem>
-<TabItem value="simple-shuffle" label="(Default) Weighted Pick">
+<TabItem value="simple-shuffle" label="(Default) Weighted Pick (Async)">

 **Default** Picks a deployment based on the provided **Requests per minute (rpm) or Tokens per minute (tpm)**

@ -443,6 +443,35 @@ asyncio.run(router_acompletion())

 ## Basic Reliability

+### Max Parallel Requests (ASYNC)
+
+Used in semaphore for async requests on router. Limit the max concurrent calls made to a deployment. Useful in high-traffic scenarios. 
+
+If tpm/rpm is set, and no max parallel request limit given, we use the RPM or calculated RPM (tpm/1000/6) as the max parallel request limit. 
+
+
+```python
+from litellm import Router 
+
+model_list = [{
+	"model_name": "gpt-4",
+	"litellm_params": {
+		"model": "azure/gpt-4",
+		...
+		"max_parallel_requests": 10 # 👈 SET PER DEPLOYMENT
+	}
+}]
+
+### OR ### 
+
+router = Router(model_list=model_list, default_max_parallel_requests=20) # 👈 SET DEFAULT MAX PARALLEL REQUESTS 
+
+
+# deployment max parallel requests > default max parallel requests
+```
+
+[**See Code**](https://github.com/BerriAI/litellm/blob/a978f2d8813c04dad34802cb95e0a0e35a3324bc/litellm/utils.py#L5605)
+
 ### Timeouts 

 The timeout set in router is for the entire length of the call, and is passed down to the completion() call level as well. 
--- a/docs/my-website/docs/set_keys.md
+++ b/docs/my-website/docs/set_keys.md
@ -5,6 +5,9 @@ LiteLLM allows you to specify the following:
 * API Base
 * API Version
 * API Type
+* Project
+* Location
+* Token

 Useful Helper functions: 
 * [`check_valid_key()`](#check_valid_key)
@ -43,6 +46,24 @@ os.environ['AZURE_API_TYPE'] = "azure" # [OPTIONAL]
 os.environ['OPENAI_API_BASE'] = "https://openai-gpt-4-test2-v-12.openai.azure.com/"
 ```

+### Setting Project, Location, Token
+
+For cloud providers:
+- Azure
+- Bedrock
+- GCP
+- Watson AI 
+
+you might need to set additional parameters. LiteLLM provides a common set of params, that we map across all providers. 
+
+|      | LiteLLM param | Watson       | Vertex AI    | Azure        | Bedrock      |
+|------|--------------|--------------|--------------|--------------|--------------|
+| Project | project | watsonx_project | vertex_project | n/a | n/a |
+| Region | region_name | watsonx_region_name | vertex_location | n/a | aws_region_name |
+| Token | token | watsonx_token or token | n/a | azure_ad_token | n/a |
+
+If you want, you can call them by their provider-specific params as well. 
+
 ## litellm variables

 ### litellm.api_key
--- a/docs/my-website/docusaurus.config.js
+++ b/docs/my-website/docusaurus.config.js
@ -105,6 +105,12 @@ const config = {
            label: 'Enterprise',
            to: "docs/enterprise"
          },
+          {
+            sidebarId: 'tutorialSidebar',
+            position: 'left',
+            label: '🚀 Hosted',
+            to: "docs/hosted"
+          },
          {
            href: 'https://github.com/BerriAI/litellm',
            label: 'GitHub',
--- a/docs/my-website/img/litellm_hosted_ui_add_models.png
+++ b/docs/my-website/img/litellm_hosted_ui_add_models.png
--- a/docs/my-website/img/litellm_hosted_ui_create_key.png
+++ b/docs/my-website/img/litellm_hosted_ui_create_key.png
--- a/docs/my-website/img/litellm_hosted_ui_router.png
+++ b/docs/my-website/img/litellm_hosted_ui_router.png
--- a/docs/my-website/img/litellm_hosted_usage_dashboard.png
+++ b/docs/my-website/img/litellm_hosted_usage_dashboard.png
--- a/docs/my-website/sidebars.js
+++ b/docs/my-website/sidebars.js
@ -43,6 +43,12 @@ const sidebars = {
        "proxy/user_keys",
        "proxy/enterprise",
        "proxy/virtual_keys",
+        "proxy/alerting",
+        {
+          type: "category",
+          label: "Logging",
+          items: ["proxy/logging", "proxy/streaming_logging"],
+        },
        "proxy/team_based_routing",
        "proxy/ui",
        "proxy/cost_tracking",
@ -58,12 +64,7 @@ const sidebars = {
        "proxy/pii_masking",
        "proxy/prompt_injection",
        "proxy/caching",
-        {
-          type: "category",
-          label: "Logging, Alerting",
-          items: ["proxy/logging", "proxy/alerting", "proxy/streaming_logging"],
-        },
-        "proxy/grafana_metrics",
+        "proxy/prometheus",
        "proxy/call_hooks",
        "proxy/rules",
        "proxy/cli", 
@ -148,6 +149,7 @@ const sidebars = {
        "providers/openrouter", 
        "providers/custom_openai_proxy",
        "providers/petals",
+        "providers/watsonx",
      ],
    },
    "proxy/custom_pricing",
@ -173,6 +175,7 @@ const sidebars = {
        "observability/langsmith_integration",
        "observability/slack_integration",
        "observability/traceloop_integration",
+        "observability/athina_integration",
        "observability/lunary_integration",
        "observability/athina_integration",
        "observability/helicone_integration",
--- a/docs/my-website/src/pages/token_usage.md
+++ b/docs/my-website/src/pages/token_usage.md
@ -16,7 +16,7 @@ However, we also expose 3 public helper functions to calculate token usage acros
 ```python
 from litellm import token_counter

-messages = [{"user": "role", "content": "Hey, how's it going"}]
+messages = [{"role": "user", "content": "Hey, how's it going"}]
 print(token_counter(model="gpt-3.5-turbo", messages=messages))
 ```

--- a/litellm-js/spend-logs/package-lock.json
+++ b/litellm-js/spend-logs/package-lock.json
@ -6,7 +6,7 @@
    "": {
      "dependencies": {
        "@hono/node-server": "^1.9.0",
-        "hono": "^4.1.5"
+        "hono": "^4.2.7"
      },
      "devDependencies": {
        "@types/node": "^20.11.17",
@ -463,9 +463,9 @@
      }
    },
    "node_modules/hono": {
-      "version": "4.1.5",
-      "resolved": "https://registry.npmjs.org/hono/-/hono-4.1.5.tgz",
-      "integrity": "sha512-3ChJiIoeCxvkt6vnkxJagplrt1YZg3NyNob7ssVeK2PUqEINp4q1F94HzFnvY9QE8asVmbW5kkTDlyWylfg2vg==",
+      "version": "4.2.7",
+      "resolved": "https://registry.npmjs.org/hono/-/hono-4.2.7.tgz",
+      "integrity": "sha512-k1xHi86tJnRIVvqhFMBDGFKJ8r5O+bEsT4P59ZK59r0F300Xd910/r237inVfuT/VmE86RQQffX4OYNda6dLXw==",
      "engines": {
        "node": ">=16.0.0"
      }
--- a/litellm-js/spend-logs/package.json
+++ b/litellm-js/spend-logs/package.json
@ -4,7 +4,7 @@
  },
  "dependencies": {
    "@hono/node-server": "^1.9.0",
-    "hono": "^4.1.5"
+    "hono": "^4.2.7"
  },
  "devDependencies": {
    "@types/node": "^20.11.17",
--- a/litellm/init.py
+++ b/litellm/init.py
@ -2,7 +2,7 @@
 import threading, requests, os
 from typing import Callable, List, Optional, Dict, Union, Any, Literal
 from litellm.caching import Cache
-from litellm._logging import set_verbose, _turn_on_debug, verbose_logger
+from litellm._logging import set_verbose, _turn_on_debug, verbose_logger, json_logs
 from litellm.proxy._types import (
    KeyManagementSystem,
    KeyManagementSettings,
@ -16,11 +16,24 @@ dotenv.load_dotenv()
 if set_verbose == True:
    _turn_on_debug()
 #############################################
+### Callbacks /Logging / Success / Failure Handlers ###
 input_callback: List[Union[str, Callable]] = []
 success_callback: List[Union[str, Callable]] = []
 failure_callback: List[Union[str, Callable]] = []
 service_callback: List[Union[str, Callable]] = []
 callbacks: List[Callable] = []
+_langfuse_default_tags: Optional[
+    List[
+        Literal[
+            "user_api_key_alias",
+            "user_api_key_user_id",
+            "user_api_key_user_email",
+            "user_api_key_team_alias",
+            "semantic-similarity",
+            "proxy_base_url",
+        ]
+    ]
+] = None
 _async_input_callback: List[Callable] = (
    []
 )  # internal variable - async custom callbacks are routed here.
@ -32,6 +45,9 @@ _async_failure_callback: List[Callable] = (
 )  # internal variable - async custom callbacks are routed here.
 pre_call_rules: List[Callable] = []
 post_call_rules: List[Callable] = []
+turn_off_message_logging: Optional[bool] = False
+## end of callbacks #############
+
 email: Optional[str] = (
    None  # Not used anymore, will be removed in next MAJOR release - https://github.com/BerriAI/litellm/discussions/648
 )
@ -43,6 +59,7 @@ max_tokens = 256  # OpenAI Defaults
 drop_params = False
 modify_params = False
 retry = True
+### AUTH ###
 api_key: Optional[str] = None
 openai_key: Optional[str] = None
 azure_key: Optional[str] = None
@ -52,6 +69,7 @@ cohere_key: Optional[str] = None
 clarifai_key: Optional[str] = None
 maritalk_key: Optional[str] = None
 ai21_key: Optional[str] = None
+ollama_key: Optional[str] = None
 openrouter_key: Optional[str] = None
 huggingface_key: Optional[str] = None
 vertex_project: Optional[str] = None
@ -61,7 +79,12 @@ cloudflare_api_key: Optional[str] = None
 baseten_key: Optional[str] = None
 aleph_alpha_key: Optional[str] = None
 nlp_cloud_key: Optional[str] = None
+common_cloud_provider_auth_params: dict = {
+    "params": ["project", "region_name", "token"],
+    "providers": ["vertex_ai", "bedrock", "watsonx", "azure"],
+}
 use_client: bool = False
+ssl_verify: bool = True
 disable_streaming_logging: bool = False
 ### GUARDRAILS ###
 llamaguard_model_name: Optional[str] = None
@ -283,6 +306,7 @@ aleph_alpha_models: List = []
 bedrock_models: List = []
 deepinfra_models: List = []
 perplexity_models: List = []
+watsonx_models: List = []
 for key, value in model_cost.items():
    if value.get("litellm_provider") == "openai":
        open_ai_chat_completion_models.append(key)
@ -327,6 +351,8 @@ for key, value in model_cost.items():
        deepinfra_models.append(key)
    elif value.get("litellm_provider") == "perplexity":
        perplexity_models.append(key)
+    elif value.get("litellm_provider") == "watsonx":
+        watsonx_models.append(key)

 # known openai compatible endpoints - we'll eventually move this list to the model_prices_and_context_window.json dictionary
 openai_compatible_endpoints: List = [
@ -530,6 +556,7 @@ model_list = (
    + perplexity_models
    + maritalk_models
    + vertex_language_models
+    + watsonx_models
 )

 provider_list: List = [
@ -569,6 +596,7 @@ provider_list: List = [
    "cloudflare",
    "xinference",
    "fireworks_ai",
+    "watsonx",
    "custom",  # custom apis
 ]

@ -590,6 +618,7 @@ models_by_provider: dict = {
    "deepinfra": deepinfra_models,
    "perplexity": perplexity_models,
    "maritalk": maritalk_models,
+    "watsonx": watsonx_models,
 }

 # mapping for those models which have larger equivalents
@ -701,9 +730,11 @@ from .llms.bedrock import (
    AmazonLlamaConfig,
    AmazonStabilityConfig,
    AmazonMistralConfig,
+    AmazonBedrockGlobalConfig,
 )
 from .llms.openai import OpenAIConfig, OpenAITextCompletionConfig
 from .llms.azure import AzureOpenAIConfig, AzureOpenAIError
+from .llms.watsonx import IBMWatsonXAIConfig
 from .main import *  # type: ignore
 from .integrations import *
 from .exceptions import (
--- a/litellm/_logging.py
+++ b/litellm/_logging.py
@ -1,7 +1,7 @@
 import logging

 set_verbose = False
-
+json_logs = False
 # Create a handler for the logger (you may need to adapt this based on your needs)
 handler = logging.StreamHandler()
 handler.setLevel(logging.DEBUG)
--- a/litellm/_redis.py
+++ b/litellm/_redis.py
@ -32,6 +32,25 @@ def _get_redis_kwargs():
    return available_args


+def _get_redis_url_kwargs(client=None):
+    if client is None:
+        client = redis.Redis.from_url
+    arg_spec = inspect.getfullargspec(redis.Redis.from_url)
+
+    # Only allow primitive arguments
+    exclude_args = {
+        "self",
+        "connection_pool",
+        "retry",
+    }
+
+    include_args = ["url"]
+
+    available_args = [x for x in arg_spec.args if x not in exclude_args] + include_args
+
+    return available_args
+
+
 def _get_redis_env_kwarg_mapping():
    PREFIX = "REDIS_"

@ -91,27 +110,39 @@ def _get_redis_client_logic(**env_overrides):
        redis_kwargs.pop("password", None)
    elif "host" not in redis_kwargs or redis_kwargs["host"] is None:
        raise ValueError("Either 'host' or 'url' must be specified for redis.")
-    litellm.print_verbose(f"redis_kwargs: {redis_kwargs}")
+    # litellm.print_verbose(f"redis_kwargs: {redis_kwargs}")
    return redis_kwargs


 def get_redis_client(**env_overrides):
    redis_kwargs = _get_redis_client_logic(**env_overrides)
    if "url" in redis_kwargs and redis_kwargs["url"] is not None:
-        redis_kwargs.pop(
-            "connection_pool", None
-        )  # redis.from_url doesn't support setting your own connection pool
-        return redis.Redis.from_url(**redis_kwargs)
+        args = _get_redis_url_kwargs()
+        url_kwargs = {}
+        for arg in redis_kwargs:
+            if arg in args:
+                url_kwargs[arg] = redis_kwargs[arg]
+
+        return redis.Redis.from_url(**url_kwargs)
    return redis.Redis(**redis_kwargs)


 def get_redis_async_client(**env_overrides):
    redis_kwargs = _get_redis_client_logic(**env_overrides)
    if "url" in redis_kwargs and redis_kwargs["url"] is not None:
-        redis_kwargs.pop(
-            "connection_pool", None
-        )  # redis.from_url doesn't support setting your own connection pool
-        return async_redis.Redis.from_url(**redis_kwargs)
+        args = _get_redis_url_kwargs(client=async_redis.Redis.from_url)
+        url_kwargs = {}
+        for arg in redis_kwargs:
+            if arg in args:
+                url_kwargs[arg] = redis_kwargs[arg]
+            else:
+                litellm.print_verbose(
+                    "REDIS: ignoring argument: {}. Not an allowed async_redis.Redis.from_url arg.".format(
+                        arg
+                    )
+                )
+        return async_redis.Redis.from_url(**url_kwargs)
+
    return async_redis.Redis(
        socket_timeout=5,
        **redis_kwargs,
@ -124,4 +155,9 @@ def get_redis_connection_pool(**env_overrides):
        return async_redis.BlockingConnectionPool.from_url(
            timeout=5, url=redis_kwargs["url"]
        )
+    connection_class = async_redis.Connection
+    if "ssl" in redis_kwargs and redis_kwargs["ssl"] is not None:
+        connection_class = async_redis.SSLConnection
+        redis_kwargs.pop("ssl", None)
+        redis_kwargs["connection_class"] = connection_class
    return async_redis.BlockingConnectionPool(timeout=5, **redis_kwargs)
--- a/litellm/_service_logger.py
+++ b/litellm/_service_logger.py
@ -1,9 +1,13 @@
-import litellm
+import litellm, traceback
+from litellm.proxy._types import UserAPIKeyAuth
 from .types.services import ServiceTypes, ServiceLoggerPayload
 from .integrations.prometheus_services import PrometheusServicesLogger
+from .integrations.custom_logger import CustomLogger
+from datetime import timedelta
+from typing import Union


-class ServiceLogging:
+class ServiceLogging(CustomLogger):
    """
    Separate class used for monitoring health of litellm-adjacent services (redis/postgres).
    """
@ -14,11 +18,12 @@ class ServiceLogging:
        self.mock_testing_async_success_hook = 0
        self.mock_testing_sync_failure_hook = 0
        self.mock_testing_async_failure_hook = 0
-
        if "prometheus_system" in litellm.service_callback:
            self.prometheusServicesLogger = PrometheusServicesLogger()

-    def service_success_hook(self, service: ServiceTypes, duration: float):
+    def service_success_hook(
+        self, service: ServiceTypes, duration: float, call_type: str
+    ):
        """
        [TODO] Not implemented for sync calls yet. V0 is focused on async monitoring (used by proxy).
        """
@ -26,7 +31,7 @@ class ServiceLogging:
            self.mock_testing_sync_success_hook += 1

    def service_failure_hook(
-        self, service: ServiceTypes, duration: float, error: Exception
+        self, service: ServiceTypes, duration: float, error: Exception, call_type: str
    ):
        """
        [TODO] Not implemented for sync calls yet. V0 is focused on async monitoring (used by proxy).
@ -34,7 +39,9 @@ class ServiceLogging:
        if self.mock_testing:
            self.mock_testing_sync_failure_hook += 1

-    async def async_service_success_hook(self, service: ServiceTypes, duration: float):
+    async def async_service_success_hook(
+        self, service: ServiceTypes, duration: float, call_type: str
+    ):
        """
        - For counting if the redis, postgres call is successful
        """
@ -42,7 +49,11 @@ class ServiceLogging:
            self.mock_testing_async_success_hook += 1

        payload = ServiceLoggerPayload(
-            is_error=False, error=None, service=service, duration=duration
+            is_error=False,
+            error=None,
+            service=service,
+            duration=duration,
+            call_type=call_type,
        )
        for callback in litellm.service_callback:
            if callback == "prometheus_system":
@ -51,7 +62,11 @@ class ServiceLogging:
                )

    async def async_service_failure_hook(
-        self, service: ServiceTypes, duration: float, error: Exception
+        self,
+        service: ServiceTypes,
+        duration: float,
+        error: Union[str, Exception],
+        call_type: str,
    ):
        """
        - For counting if the redis, postgres call is unsuccessful
@ -59,8 +74,18 @@ class ServiceLogging:
        if self.mock_testing:
            self.mock_testing_async_failure_hook += 1

+        error_message = ""
+        if isinstance(error, Exception):
+            error_message = str(error)
+        elif isinstance(error, str):
+            error_message = error
+
        payload = ServiceLoggerPayload(
-            is_error=True, error=str(error), service=service, duration=duration
+            is_error=True,
+            error=error_message,
+            service=service,
+            duration=duration,
+            call_type=call_type,
        )
        for callback in litellm.service_callback:
            if callback == "prometheus_system":
@ -69,3 +94,37 @@ class ServiceLogging:
                await self.prometheusServicesLogger.async_service_failure_hook(
                    payload=payload
                )
+
+    async def async_post_call_failure_hook(
+        self, original_exception: Exception, user_api_key_dict: UserAPIKeyAuth
+    ):
+        """
+        Hook to track failed litellm-service calls
+        """
+        return await super().async_post_call_failure_hook(
+            original_exception, user_api_key_dict
+        )
+
+    async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
+        """
+        Hook to track latency for litellm proxy llm api calls
+        """
+        try:
+            _duration = end_time - start_time
+            if isinstance(_duration, timedelta):
+                _duration = _duration.total_seconds()
+            elif isinstance(_duration, float):
+                pass
+            else:
+                raise Exception(
+                    "Duration={} is not a float or timedelta object. type={}".format(
+                        _duration, type(_duration)
+                    )
+                )  # invalid _duration value
+            await self.async_service_success_hook(
+                service=ServiceTypes.LITELLM,
+                duration=_duration,
+                call_type=kwargs["call_type"],
+            )
+        except Exception as e:
+            raise e
--- a/litellm/caching.py
+++ b/litellm/caching.py
@ -13,7 +13,6 @@ import json, traceback, ast, hashlib
 from typing import Optional, Literal, List, Union, Any, BinaryIO
 from openai._models import BaseModel as OpenAIObject
 from litellm._logging import verbose_logger
-from litellm._service_logger import ServiceLogging
 from litellm.types.services import ServiceLoggerPayload, ServiceTypes
 import traceback

@ -90,6 +89,13 @@ class InMemoryCache(BaseCache):
            return_val.append(val)
        return return_val

+    def increment_cache(self, key, value: int, **kwargs) -> int:
+        # get the value
+        init_value = self.get_cache(key=key) or 0
+        value = init_value + value
+        self.set_cache(key, value, **kwargs)
+        return value
+
    async def async_get_cache(self, key, **kwargs):
        return self.get_cache(key=key, **kwargs)

@ -132,6 +138,7 @@ class RedisCache(BaseCache):
        **kwargs,
    ):
        from ._redis import get_redis_client, get_redis_connection_pool
+        from litellm._service_logger import ServiceLogging
        import redis

        redis_kwargs = {}
@ -142,18 +149,19 @@ class RedisCache(BaseCache):
        if password is not None:
            redis_kwargs["password"] = password

+        ### HEALTH MONITORING OBJECT ###
+        if kwargs.get("service_logger_obj", None) is not None and isinstance(
+            kwargs["service_logger_obj"], ServiceLogging
+        ):
+            self.service_logger_obj = kwargs.pop("service_logger_obj")
+        else:
+            self.service_logger_obj = ServiceLogging()
+
        redis_kwargs.update(kwargs)
        self.redis_client = get_redis_client(**redis_kwargs)
        self.redis_kwargs = redis_kwargs
        self.async_redis_conn_pool = get_redis_connection_pool(**redis_kwargs)

-        if "url" in redis_kwargs and redis_kwargs["url"] is not None:
-            parsed_kwargs = redis.connection.parse_url(redis_kwargs["url"])
-            redis_kwargs.update(parsed_kwargs)
-            self.redis_kwargs.update(parsed_kwargs)
-            # pop url
-            self.redis_kwargs.pop("url")
-
        # redis namespaces
        self.namespace = namespace
        # for high traffic, we store the redis results in memory and then batch write to redis
@ -165,8 +173,15 @@ class RedisCache(BaseCache):
        except Exception as e:
            pass

-        ### HEALTH MONITORING OBJECT ###
-        self.service_logger_obj = ServiceLogging()
+        ### ASYNC HEALTH PING ###
+        try:
+            # asyncio.get_running_loop().create_task(self.ping())
+            result = asyncio.get_running_loop().create_task(self.ping())
+        except Exception:
+            pass
+
+        ### SYNC HEALTH PING ###
+        self.redis_client.ping()

    def init_async_client(self):
        from ._redis import get_redis_async_client
@ -198,6 +213,42 @@ class RedisCache(BaseCache):
                f"LiteLLM Caching: set() - Got exception from REDIS : {str(e)}"
            )

+    def increment_cache(self, key, value: int, **kwargs) -> int:
+        _redis_client = self.redis_client
+        start_time = time.time()
+        try:
+            result = _redis_client.incr(name=key, amount=value)
+            ## LOGGING ##
+            end_time = time.time()
+            _duration = end_time - start_time
+            asyncio.create_task(
+                self.service_logger_obj.service_success_hook(
+                    service=ServiceTypes.REDIS,
+                    duration=_duration,
+                    call_type="increment_cache",
+                )
+            )
+            return result
+        except Exception as e:
+            ## LOGGING ##
+            end_time = time.time()
+            _duration = end_time - start_time
+            asyncio.create_task(
+                self.service_logger_obj.async_service_failure_hook(
+                    service=ServiceTypes.REDIS,
+                    duration=_duration,
+                    error=e,
+                    call_type="increment_cache",
+                )
+            )
+            verbose_logger.error(
+                "LiteLLM Redis Caching: increment_cache() - Got exception from REDIS %s, Writing value=%s",
+                str(e),
+                value,
+            )
+            traceback.print_exc()
+            raise e
+
    async def async_scan_iter(self, pattern: str, count: int = 100) -> list:
        start_time = time.time()
        try:
@ -216,7 +267,9 @@ class RedisCache(BaseCache):
                _duration = end_time - start_time
                asyncio.create_task(
                    self.service_logger_obj.async_service_success_hook(
-                        service=ServiceTypes.REDIS, duration=_duration
+                        service=ServiceTypes.REDIS,
+                        duration=_duration,
+                        call_type="async_scan_iter",
                    )
                )  # DO NOT SLOW DOWN CALL B/C OF THIS
            return keys
@ -227,7 +280,10 @@ class RedisCache(BaseCache):
            _duration = end_time - start_time
            asyncio.create_task(
                self.service_logger_obj.async_service_failure_hook(
-                    service=ServiceTypes.REDIS, duration=_duration, error=e
+                    service=ServiceTypes.REDIS,
+                    duration=_duration,
+                    error=e,
+                    call_type="async_scan_iter",
                )
            )
            raise e
@ -267,7 +323,9 @@ class RedisCache(BaseCache):
                _duration = end_time - start_time
                asyncio.create_task(
                    self.service_logger_obj.async_service_success_hook(
-                        service=ServiceTypes.REDIS, duration=_duration
+                        service=ServiceTypes.REDIS,
+                        duration=_duration,
+                        call_type="async_set_cache",
                    )
                )
            except Exception as e:
@ -275,7 +333,10 @@ class RedisCache(BaseCache):
                _duration = end_time - start_time
                asyncio.create_task(
                    self.service_logger_obj.async_service_failure_hook(
-                        service=ServiceTypes.REDIS, duration=_duration, error=e
+                        service=ServiceTypes.REDIS,
+                        duration=_duration,
+                        error=e,
+                        call_type="async_set_cache",
                    )
                )
                # NON blocking - notify users Redis is throwing an exception
@ -292,6 +353,10 @@ class RedisCache(BaseCache):
        """
        _redis_client = self.init_async_client()
        start_time = time.time()
+
+        print_verbose(
+            f"Set Async Redis Cache: key list: {cache_list}\nttl={ttl}, redis_version={self.redis_version}"
+        )
        try:
            async with _redis_client as redis_client:
                async with redis_client.pipeline(transaction=True) as pipe:
@ -316,7 +381,9 @@ class RedisCache(BaseCache):
            _duration = end_time - start_time
            asyncio.create_task(
                self.service_logger_obj.async_service_success_hook(
-                    service=ServiceTypes.REDIS, duration=_duration
+                    service=ServiceTypes.REDIS,
+                    duration=_duration,
+                    call_type="async_set_cache_pipeline",
                )
            )
            return results
@ -326,7 +393,10 @@ class RedisCache(BaseCache):
            _duration = end_time - start_time
            asyncio.create_task(
                self.service_logger_obj.async_service_failure_hook(
-                    service=ServiceTypes.REDIS, duration=_duration, error=e
+                    service=ServiceTypes.REDIS,
+                    duration=_duration,
+                    error=e,
+                    call_type="async_set_cache_pipeline",
                )
            )

@ -359,6 +429,7 @@ class RedisCache(BaseCache):
                    self.service_logger_obj.async_service_success_hook(
                        service=ServiceTypes.REDIS,
                        duration=_duration,
+                        call_type="async_increment",
                    )
                )
                return result
@ -368,7 +439,10 @@ class RedisCache(BaseCache):
            _duration = end_time - start_time
            asyncio.create_task(
                self.service_logger_obj.async_service_failure_hook(
-                    service=ServiceTypes.REDIS, duration=_duration, error=e
+                    service=ServiceTypes.REDIS,
+                    duration=_duration,
+                    error=e,
+                    call_type="async_increment",
                )
            )
            verbose_logger.error(
@ -459,7 +533,9 @@ class RedisCache(BaseCache):
                _duration = end_time - start_time
                asyncio.create_task(
                    self.service_logger_obj.async_service_success_hook(
-                        service=ServiceTypes.REDIS, duration=_duration
+                        service=ServiceTypes.REDIS,
+                        duration=_duration,
+                        call_type="async_get_cache",
                    )
                )
                return response
@ -469,7 +545,10 @@ class RedisCache(BaseCache):
                _duration = end_time - start_time
                asyncio.create_task(
                    self.service_logger_obj.async_service_failure_hook(
-                        service=ServiceTypes.REDIS, duration=_duration, error=e
+                        service=ServiceTypes.REDIS,
+                        duration=_duration,
+                        error=e,
+                        call_type="async_get_cache",
                    )
                )
                # NON blocking - notify users Redis is throwing an exception
@ -497,7 +576,9 @@ class RedisCache(BaseCache):
            _duration = end_time - start_time
            asyncio.create_task(
                self.service_logger_obj.async_service_success_hook(
-                    service=ServiceTypes.REDIS, duration=_duration
+                    service=ServiceTypes.REDIS,
+                    duration=_duration,
+                    call_type="async_batch_get_cache",
                )
            )

@ -519,21 +600,81 @@ class RedisCache(BaseCache):
            _duration = end_time - start_time
            asyncio.create_task(
                self.service_logger_obj.async_service_failure_hook(
-                    service=ServiceTypes.REDIS, duration=_duration, error=e
+                    service=ServiceTypes.REDIS,
+                    duration=_duration,
+                    error=e,
+                    call_type="async_batch_get_cache",
                )
            )
            print_verbose(f"Error occurred in pipeline read - {str(e)}")
            return key_value_dict

-    async def ping(self):
+    def sync_ping(self) -> bool:
+        """
+        Tests if the sync redis client is correctly setup.
+        """
+        print_verbose(f"Pinging Sync Redis Cache")
+        start_time = time.time()
+        try:
+            response = self.redis_client.ping()
+            print_verbose(f"Redis Cache PING: {response}")
+            ## LOGGING ##
+            end_time = time.time()
+            _duration = end_time - start_time
+            self.service_logger_obj.service_success_hook(
+                service=ServiceTypes.REDIS,
+                duration=_duration,
+                call_type="sync_ping",
+            )
+            return response
+        except Exception as e:
+            # NON blocking - notify users Redis is throwing an exception
+            ## LOGGING ##
+            end_time = time.time()
+            _duration = end_time - start_time
+            self.service_logger_obj.service_failure_hook(
+                service=ServiceTypes.REDIS,
+                duration=_duration,
+                error=e,
+                call_type="sync_ping",
+            )
+            print_verbose(
+                f"LiteLLM Redis Cache PING: - Got exception from REDIS : {str(e)}"
+            )
+            traceback.print_exc()
+            raise e
+
+    async def ping(self) -> bool:
        _redis_client = self.init_async_client()
+        start_time = time.time()
        async with _redis_client as redis_client:
            print_verbose(f"Pinging Async Redis Cache")
            try:
                response = await redis_client.ping()
-                print_verbose(f"Redis Cache PING: {response}")
+                ## LOGGING ##
+                end_time = time.time()
+                _duration = end_time - start_time
+                asyncio.create_task(
+                    self.service_logger_obj.async_service_success_hook(
+                        service=ServiceTypes.REDIS,
+                        duration=_duration,
+                        call_type="async_ping",
+                    )
+                )
+                return response
            except Exception as e:
                # NON blocking - notify users Redis is throwing an exception
+                ## LOGGING ##
+                end_time = time.time()
+                _duration = end_time - start_time
+                asyncio.create_task(
+                    self.service_logger_obj.async_service_failure_hook(
+                        service=ServiceTypes.REDIS,
+                        duration=_duration,
+                        error=e,
+                        call_type="async_ping",
+                    )
+                )
                print_verbose(
                    f"LiteLLM Redis Cache PING: - Got exception from REDIS : {str(e)}"
                )
@ -1064,6 +1205,30 @@ class DualCache(BaseCache):
        except Exception as e:
            print_verbose(e)

+    def increment_cache(
+        self, key, value: int, local_only: bool = False, **kwargs
+    ) -> int:
+        """
+        Key - the key in cache
+
+        Value - int - the value you want to increment by
+
+        Returns - int - the incremented value
+        """
+        try:
+            result: int = value
+            if self.in_memory_cache is not None:
+                result = self.in_memory_cache.increment_cache(key, value, **kwargs)
+
+            if self.redis_cache is not None and local_only == False:
+                result = self.redis_cache.increment_cache(key, value, **kwargs)
+
+            return result
+        except Exception as e:
+            print_verbose(f"LiteLLM Cache: Excepton async add_cache: {str(e)}")
+            traceback.print_exc()
+            raise e
+
    def get_cache(self, key, local_only: bool = False, **kwargs):
        # Try to fetch from in-memory cache first
        try:
@ -1116,7 +1281,7 @@ class DualCache(BaseCache):
                        self.in_memory_cache.set_cache(key, redis_result[key], **kwargs)

                for key, value in redis_result.items():
-                    result[sublist_keys.index(key)] = value
+                    result[keys.index(key)] = value

            print_verbose(f"async batch get cache: cache result: {result}")
            return result
@ -1166,10 +1331,8 @@ class DualCache(BaseCache):
                    keys, **kwargs
                )

-                print_verbose(f"in_memory_result: {in_memory_result}")
                if in_memory_result is not None:
                    result = in_memory_result
-
            if None in result and self.redis_cache is not None and local_only == False:
                """
                - for the none values in the result
@ -1185,22 +1348,23 @@ class DualCache(BaseCache):

                if redis_result is not None:
                    # Update in-memory cache with the value from Redis
-                    for key in redis_result:
-                        await self.in_memory_cache.async_set_cache(
-                            key, redis_result[key], **kwargs
-                        )
+                    for key, value in redis_result.items():
+                        if value is not None:
+                            await self.in_memory_cache.async_set_cache(
+                                key, redis_result[key], **kwargs
+                            )
+                for key, value in redis_result.items():
+                    index = keys.index(key)
+                    result[index] = value

-                sublist_dict = dict(zip(sublist_keys, redis_result))
-
-                for key, value in sublist_dict.items():
-                    result[sublist_keys.index(key)] = value
-
-            print_verbose(f"async batch get cache: cache result: {result}")
            return result
        except Exception as e:
            traceback.print_exc()

    async def async_set_cache(self, key, value, local_only: bool = False, **kwargs):
+        print_verbose(
+            f"async set cache: cache key: {key}; local_only: {local_only}; value: {value}"
+        )
        try:
            if self.in_memory_cache is not None:
                await self.in_memory_cache.async_set_cache(key, value, **kwargs)
--- a/litellm/exceptions.py
+++ b/litellm/exceptions.py
@ -82,14 +82,18 @@ class UnprocessableEntityError(UnprocessableEntityError):  # type: ignore

 class Timeout(APITimeoutError):  # type: ignore
    def __init__(self, message, model, llm_provider):
-        self.status_code = 408
-        self.message = message
-        self.model = model
-        self.llm_provider = llm_provider
        request = httpx.Request(method="POST", url="https://api.openai.com/v1")
        super().__init__(
            request=request
        )  # Call the base class constructor with the parameters it needs
+        self.status_code = 408
+        self.message = message
+        self.model = model
+        self.llm_provider = llm_provider
+
+    # custom function to convert to str
+    def __str__(self):
+        return str(self.message)


 class PermissionDeniedError(PermissionDeniedError):  # type:ignore
--- a/litellm/integrations/custom_logger.py
+++ b/litellm/integrations/custom_logger.py
@ -6,7 +6,7 @@ import requests
 from litellm.proxy._types import UserAPIKeyAuth
 from litellm.caching import DualCache

-from typing import Literal, Union
+from typing import Literal, Union, Optional

 dotenv.load_dotenv()  # Loading env variables using dotenv
 import traceback
@ -46,6 +46,17 @@ class CustomLogger:  # https://docs.litellm.ai/docs/observability/custom_callbac
    async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
        pass

+    #### PRE-CALL CHECKS - router/proxy only ####
+    """
+    Allows usage-based-routing-v2 to run pre-call rpm checks within the picked deployment's semaphore (concurrency-safe tpm/rpm checks).
+    """
+
+    async def async_pre_call_check(self, deployment: dict) -> Optional[dict]:
+        pass
+
+    def pre_call_check(self, deployment: dict) -> Optional[dict]:
+        pass
+
    #### CALL HOOKS - proxy only ####
    """
    Control the modify incoming / outgoung data before calling the model
--- a/litellm/integrations/greenscale.py
+++ b/litellm/integrations/greenscale.py
@ -0,0 +1,51 @@
+import requests
+import json
+import traceback
+from datetime import datetime, timezone
+
+class GreenscaleLogger:
+    def __init__(self):
+        import os
+        self.greenscale_api_key = os.getenv("GREENSCALE_API_KEY")
+        self.headers = {
+            "api-key": self.greenscale_api_key,
+            "Content-Type": "application/json"
+        }
+        self.greenscale_logging_url = os.getenv("GREENSCALE_ENDPOINT")
+
+    def log_event(self, kwargs, response_obj, start_time, end_time, print_verbose):
+        try:
+            response_json = response_obj.model_dump() if response_obj else {}
+            data = {
+                "modelId": kwargs.get("model"),
+                "inputTokenCount": response_json.get("usage", {}).get("prompt_tokens"),
+                "outputTokenCount": response_json.get("usage", {}).get("completion_tokens"),
+            }
+            data["timestamp"] = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
+            
+            if type(end_time) == datetime and type(start_time) == datetime:
+                data["invocationLatency"] = int((end_time - start_time).total_seconds() * 1000)
+
+
+            # Add additional metadata keys to tags
+            tags = []
+            metadata = kwargs.get("litellm_params", {}).get("metadata", {})
+            for key, value in metadata.items():
+                if key.startswith("greenscale"): 
+                    if key == "greenscale_project":
+                        data["project"] = value
+                    elif key == "greenscale_application":
+                        data["application"] = value
+                    else:
+                        tags.append({"key": key.replace("greenscale_", ""), "value": str(value)})
+            
+            data["tags"] = tags
+
+            response = requests.post(self.greenscale_logging_url, headers=self.headers, data=json.dumps(data, default=str))
+            if response.status_code != 200:
+                print_verbose(f"Greenscale Logger Error - {response.text}, {response.status_code}")
+            else:
+                print_verbose(f"Greenscale Logger Succeeded - {response.text}")
+        except Exception as e:
+            print_verbose(f"Greenscale Logger Error - {e}, Stack trace: {traceback.format_exc()}")
+            pass
--- a/litellm/integrations/langfuse.py
+++ b/litellm/integrations/langfuse.py
@ -12,7 +12,9 @@ import litellm

 class LangFuseLogger:
    # Class variables or attributes
-    def __init__(self, langfuse_public_key=None, langfuse_secret=None):
+    def __init__(
+        self, langfuse_public_key=None, langfuse_secret=None, flush_interval=1
+    ):
        try:
            from langfuse import Langfuse
        except Exception as e:
@ -31,9 +33,17 @@ class LangFuseLogger:
            host=self.langfuse_host,
            release=self.langfuse_release,
            debug=self.langfuse_debug,
-            flush_interval=1,  # flush interval in seconds
+            flush_interval=flush_interval,  # flush interval in seconds
        )

+        # set the current langfuse project id in the environ
+        # this is used by Alerting to link to the correct project
+        try:
+            project_id = self.Langfuse.client.projects.get().data[0].id
+            os.environ["LANGFUSE_PROJECT_ID"] = project_id
+        except:
+            project_id = None
+
        if os.getenv("UPSTREAM_LANGFUSE_SECRET_KEY") is not None:
            self.upstream_langfuse_secret_key = os.getenv(
                "UPSTREAM_LANGFUSE_SECRET_KEY"
@ -76,6 +86,7 @@ class LangFuseLogger:
            print_verbose(
                f"Langfuse Logging - Enters logging function for model {kwargs}"
            )
+
            litellm_params = kwargs.get("litellm_params", {})
            metadata = (
                litellm_params.get("metadata", {}) or {}
@ -133,6 +144,7 @@ class LangFuseLogger:
                self._log_langfuse_v2(
                    user_id,
                    metadata,
+                    litellm_params,
                    output,
                    start_time,
                    end_time,
@ -224,6 +236,7 @@ class LangFuseLogger:
        self,
        user_id,
        metadata,
+        litellm_params,
        output,
        start_time,
        end_time,
@ -252,15 +265,18 @@ class LangFuseLogger:
                tags = metadata_tags

            trace_name = metadata.get("trace_name", None)
-            if trace_name is None:
+            trace_id = metadata.get("trace_id", None)
+            existing_trace_id = metadata.get("existing_trace_id", None)
+            if trace_name is None and existing_trace_id is None:
                # just log `litellm-{call_type}` as the trace name
+                ## DO NOT SET TRACE_NAME if trace-id set. this can lead to overwriting of past traces.
                trace_name = f"litellm-{kwargs.get('call_type', 'completion')}"

            trace_params = {
                "name": trace_name,
                "input": input,
                "user_id": metadata.get("trace_user_id", user_id),
-                "id": metadata.get("trace_id", None),
+                "id": trace_id or existing_trace_id,
                "session_id": metadata.get("session_id", None),
            }

@ -278,13 +294,13 @@ class LangFuseLogger:
            clean_metadata = {}
            if isinstance(metadata, dict):
                for key, value in metadata.items():
-                    # generate langfuse tags
-                    if key in [
-                        "user_api_key",
-                        "user_api_key_user_id",
-                        "user_api_key_team_id",
-                        "semantic-similarity",
-                    ]:
+
+                    # generate langfuse tags - Default Tags sent to Langfuse from LiteLLM Proxy
+                    if (
+                        litellm._langfuse_default_tags is not None
+                        and isinstance(litellm._langfuse_default_tags, list)
+                        and key in litellm._langfuse_default_tags
+                    ):
                        tags.append(f"{key}:{value}")

                    # clean litellm metadata before logging
@ -298,13 +314,53 @@ class LangFuseLogger:
                    else:
                        clean_metadata[key] = value

+            if (
+                litellm._langfuse_default_tags is not None
+                and isinstance(litellm._langfuse_default_tags, list)
+                and "proxy_base_url" in litellm._langfuse_default_tags
+            ):
+                proxy_base_url = os.environ.get("PROXY_BASE_URL", None)
+                if proxy_base_url is not None:
+                    tags.append(f"proxy_base_url:{proxy_base_url}")
+
+            api_base = litellm_params.get("api_base", None)
+            if api_base:
+                clean_metadata["api_base"] = api_base
+
+            vertex_location = kwargs.get("vertex_location", None)
+            if vertex_location:
+                clean_metadata["vertex_location"] = vertex_location
+
+            aws_region_name = kwargs.get("aws_region_name", None)
+            if aws_region_name:
+                clean_metadata["aws_region_name"] = aws_region_name
+
            if supports_tags:
                if "cache_hit" in kwargs:
                    if kwargs["cache_hit"] is None:
                        kwargs["cache_hit"] = False
                    tags.append(f"cache_hit:{kwargs['cache_hit']}")
+                    clean_metadata["cache_hit"] = kwargs["cache_hit"]
                trace_params.update({"tags": tags})

+            proxy_server_request = litellm_params.get("proxy_server_request", None)
+            if proxy_server_request:
+                method = proxy_server_request.get("method", None)
+                url = proxy_server_request.get("url", None)
+                headers = proxy_server_request.get("headers", None)
+                clean_headers = {}
+                if headers:
+                    for key, value in headers.items():
+                        # these headers can leak our API keys and/or JWT tokens
+                        if key.lower() not in ["authorization", "cookie", "referer"]:
+                            clean_headers[key] = value
+
+                clean_metadata["request"] = {
+                    "method": method,
+                    "url": url,
+                    "headers": clean_headers,
+                }
+
            print_verbose(f"trace_params: {trace_params}")

            trace = self.Langfuse.trace(**trace_params)
@ -323,7 +379,11 @@ class LangFuseLogger:
                # just log `litellm-{call_type}` as the generation name
                generation_name = f"litellm-{kwargs.get('call_type', 'completion')}"

-            system_fingerprint = response_obj.get("system_fingerprint", None)
+            if response_obj is not None and "system_fingerprint" in response_obj:
+                system_fingerprint = response_obj.get("system_fingerprint", None)
+            else:
+                system_fingerprint = None
+
            if system_fingerprint is not None:
                optional_params["system_fingerprint"] = system_fingerprint

--- a/litellm/integrations/langsmith.py
+++ b/litellm/integrations/langsmith.py
@ -7,6 +7,19 @@ from datetime import datetime

 dotenv.load_dotenv()  # Loading env variables using dotenv
 import traceback
+import asyncio
+import types
+from pydantic import BaseModel
+
+
+def is_serializable(value):
+    non_serializable_types = (
+        types.CoroutineType,
+        types.FunctionType,
+        types.GeneratorType,
+        BaseModel,
+    )
+    return not isinstance(value, non_serializable_types)


 class LangsmithLogger:
@ -21,7 +34,9 @@ class LangsmithLogger:
    def log_event(self, kwargs, response_obj, start_time, end_time, print_verbose):
        # Method definition
        # inspired by Langsmith http api here: https://github.com/langchain-ai/langsmith-cookbook/blob/main/tracing-examples/rest/rest.ipynb
-        metadata = kwargs.get('litellm_params', {}).get("metadata", {}) or {}  # if metadata is None
+        metadata = (
+            kwargs.get("litellm_params", {}).get("metadata", {}) or {}
+        )  # if metadata is None

        # set project name and run_name for langsmith logging
        # users can pass project_name and run name to litellm.completion()
@ -51,26 +66,42 @@ class LangsmithLogger:
            new_kwargs = {}
            for key in kwargs:
                value = kwargs[key]
-                if key == "start_time" or key == "end_time":
+                if key == "start_time" or key == "end_time" or value is None:
                    pass
                elif type(value) == datetime.datetime:
                    new_kwargs[key] = value.isoformat()
-                elif type(value) != dict:
+                elif type(value) != dict and is_serializable(value=value):
                    new_kwargs[key] = value

-            requests.post(
+            if isinstance(response_obj, BaseModel):
+                try:
+                    response_obj = response_obj.model_dump()
+                except:
+                    response_obj = response_obj.dict()  # type: ignore
+
+            print(f"response_obj: {response_obj}")
+
+            data = {
+                "name": run_name,
+                "run_type": "llm",  # this should always be llm, since litellm always logs llm calls. Langsmith allow us to log "chain"
+                "inputs": new_kwargs,
+                "outputs": response_obj,
+                "session_name": project_name,
+                "start_time": start_time,
+                "end_time": end_time,
+            }
+            print(f"data: {data}")
+
+            response = requests.post(
                "https://api.smith.langchain.com/runs",
-                json={
-                    "name": run_name,
-                    "run_type": "llm",  # this should always be llm, since litellm always logs llm calls. Langsmith allow us to log "chain"
-                    "inputs": {**new_kwargs},
-                    "outputs": response_obj.json(),
-                    "session_name": project_name,
-                    "start_time": start_time,
-                    "end_time": end_time,
-                },
+                json=data,
                headers={"x-api-key": self.langsmith_api_key},
            )
+
+            if response.status_code >= 300:
+                print_verbose(f"Error: {response.status_code}")
+            else:
+                print_verbose("Run successfully created")
            print_verbose(
                f"Langsmith Layer Logging - final response object: {response_obj}"
            )
--- a/litellm/integrations/prometheus.py
+++ b/litellm/integrations/prometheus.py
@ -19,27 +19,33 @@ class PrometheusLogger:
        **kwargs,
    ):
        try:
-            verbose_logger.debug(f"in init prometheus metrics")
+            print(f"in init prometheus metrics")
            from prometheus_client import Counter

+            self.litellm_llm_api_failed_requests_metric = Counter(
+                name="litellm_llm_api_failed_requests_metric",
+                documentation="Total number of failed LLM API calls via litellm",
+                labelnames=["end_user", "hashed_api_key", "model", "team", "user"],
+            )
+
            self.litellm_requests_metric = Counter(
                name="litellm_requests_metric",
                documentation="Total number of LLM calls to litellm",
-                labelnames=["end_user", "hashed_api_key", "model", "team"],
+                labelnames=["end_user", "hashed_api_key", "model", "team", "user"],
            )

            # Counter for spend
            self.litellm_spend_metric = Counter(
                "litellm_spend_metric",
                "Total spend on LLM requests",
-                labelnames=["end_user", "hashed_api_key", "model", "team"],
+                labelnames=["end_user", "hashed_api_key", "model", "team", "user"],
            )

            # Counter for total_output_tokens
            self.litellm_tokens_metric = Counter(
                "litellm_total_tokens",
                "Total number of input + output tokens from LLM requests",
-                labelnames=["end_user", "hashed_api_key", "model", "team"],
+                labelnames=["end_user", "hashed_api_key", "model", "team", "user"],
            )
        except Exception as e:
            print_verbose(f"Got exception on init prometheus client {str(e)}")
@ -61,15 +67,21 @@ class PrometheusLogger:

            # unpack kwargs
            model = kwargs.get("model", "")
-            response_cost = kwargs.get("response_cost", 0.0)
+            response_cost = kwargs.get("response_cost", 0.0) or 0
            litellm_params = kwargs.get("litellm_params", {}) or {}
            proxy_server_request = litellm_params.get("proxy_server_request") or {}
            end_user_id = proxy_server_request.get("body", {}).get("user", None)
+            user_id = litellm_params.get("metadata", {}).get(
+                "user_api_key_user_id", None
+            )
            user_api_key = litellm_params.get("metadata", {}).get("user_api_key", None)
            user_api_team = litellm_params.get("metadata", {}).get(
                "user_api_key_team_id", None
            )
-            tokens_used = response_obj.get("usage", {}).get("total_tokens", 0)
+            if response_obj is not None:
+                tokens_used = response_obj.get("usage", {}).get("total_tokens", 0)
+            else:
+                tokens_used = 0

            print_verbose(
                f"inside track_prometheus_metrics, model {model}, response_cost {response_cost}, tokens_used {tokens_used}, end_user_id {end_user_id}, user_api_key {user_api_key}"
@ -85,14 +97,20 @@ class PrometheusLogger:
                user_api_key = hash_token(user_api_key)

            self.litellm_requests_metric.labels(
-                end_user_id, user_api_key, model, user_api_team
+                end_user_id, user_api_key, model, user_api_team, user_id
            ).inc()
            self.litellm_spend_metric.labels(
-                end_user_id, user_api_key, model, user_api_team
+                end_user_id, user_api_key, model, user_api_team, user_id
            ).inc(response_cost)
            self.litellm_tokens_metric.labels(
-                end_user_id, user_api_key, model, user_api_team
+                end_user_id, user_api_key, model, user_api_team, user_id
            ).inc(tokens_used)
+
+            ### FAILURE INCREMENT ###
+            if "exception" in kwargs:
+                self.litellm_llm_api_failed_requests_metric.labels(
+                    end_user_id, user_api_key, model, user_api_team, user_id
+                ).inc()
        except Exception as e:
            traceback.print_exc()
            verbose_logger.debug(
--- a/litellm/integrations/prometheus_services.py
+++ b/litellm/integrations/prometheus_services.py
@ -44,9 +44,18 @@ class PrometheusServicesLogger:
            )  # store the prometheus histogram/counter we need to call for each field in payload

            for service in self.services:
-                histogram = self.create_histogram(service)
-                counter = self.create_counter(service)
-                self.payload_to_prometheus_map[service] = [histogram, counter]
+                histogram = self.create_histogram(service, type_of_request="latency")
+                counter_failed_request = self.create_counter(
+                    service, type_of_request="failed_requests"
+                )
+                counter_total_requests = self.create_counter(
+                    service, type_of_request="total_requests"
+                )
+                self.payload_to_prometheus_map[service] = [
+                    histogram,
+                    counter_failed_request,
+                    counter_total_requests,
+                ]

            self.prometheus_to_amount_map: dict = (
                {}
@ -74,26 +83,26 @@ class PrometheusServicesLogger:
                    return metric
        return None

-    def create_histogram(self, label: str):
-        metric_name = "litellm_{}_latency".format(label)
+    def create_histogram(self, service: str, type_of_request: str):
+        metric_name = "litellm_{}_{}".format(service, type_of_request)
        is_registered = self.is_metric_registered(metric_name)
        if is_registered:
            return self.get_metric(metric_name)
        return self.Histogram(
            metric_name,
-            "Latency for {} service".format(label),
-            labelnames=[label],
+            "Latency for {} service".format(service),
+            labelnames=[service],
        )

-    def create_counter(self, label: str):
-        metric_name = "litellm_{}_failed_requests".format(label)
+    def create_counter(self, service: str, type_of_request: str):
+        metric_name = "litellm_{}_{}".format(service, type_of_request)
        is_registered = self.is_metric_registered(metric_name)
        if is_registered:
            return self.get_metric(metric_name)
        return self.Counter(
            metric_name,
-            "Total failed requests for {} service".format(label),
-            labelnames=[label],
+            "Total {} for {} service".format(type_of_request, service),
+            labelnames=[service],
        )

    def observe_histogram(
@ -129,6 +138,12 @@ class PrometheusServicesLogger:
                        labels=payload.service.value,
                        amount=payload.duration,
                    )
+                elif isinstance(obj, self.Counter) and "total_requests" in obj._name:
+                    self.increment_counter(
+                        counter=obj,
+                        labels=payload.service.value,
+                        amount=1,  # LOG TOTAL REQUESTS TO PROMETHEUS
+                    )

    def service_failure_hook(self, payload: ServiceLoggerPayload):
        if self.mock_testing:
@ -141,7 +156,7 @@ class PrometheusServicesLogger:
                    self.increment_counter(
                        counter=obj,
                        labels=payload.service.value,
-                        amount=1,  # LOG ERROR COUNT TO PROMETHEUS
+                        amount=1,  # LOG ERROR COUNT / TOTAL REQUESTS TO PROMETHEUS
                    )

    async def async_service_success_hook(self, payload: ServiceLoggerPayload):
@ -160,6 +175,12 @@ class PrometheusServicesLogger:
                        labels=payload.service.value,
                        amount=payload.duration,
                    )
+                elif isinstance(obj, self.Counter) and "total_requests" in obj._name:
+                    self.increment_counter(
+                        counter=obj,
+                        labels=payload.service.value,
+                        amount=1,  # LOG TOTAL REQUESTS TO PROMETHEUS
+                    )

    async def async_service_failure_hook(self, payload: ServiceLoggerPayload):
        print(f"received error payload: {payload.error}")
--- a/litellm/integrations/slack_alerting.py
+++ b/litellm/integrations/slack_alerting.py
@ -0,0 +1,541 @@
+#### What this does ####
+#    Class for sending Slack Alerts #
+import dotenv, os
+
+dotenv.load_dotenv()  # Loading env variables using dotenv
+import copy
+import traceback
+from litellm._logging import verbose_logger, verbose_proxy_logger
+import litellm
+from typing import List, Literal, Any, Union, Optional, Dict
+from litellm.caching import DualCache
+import asyncio
+import aiohttp
+from litellm.llms.custom_httpx.http_handler import AsyncHTTPHandler
+import datetime
+
+
+class SlackAlerting:
+    # Class variables or attributes
+    def __init__(
+        self,
+        alerting_threshold: float = 300,
+        alerting: Optional[List] = [],
+        alert_types: Optional[
+            List[
+                Literal[
+                    "llm_exceptions",
+                    "llm_too_slow",
+                    "llm_requests_hanging",
+                    "budget_alerts",
+                    "db_exceptions",
+                ]
+            ]
+        ] = [
+            "llm_exceptions",
+            "llm_too_slow",
+            "llm_requests_hanging",
+            "budget_alerts",
+            "db_exceptions",
+        ],
+        alert_to_webhook_url: Optional[
+            Dict
+        ] = None,  # if user wants to separate alerts to diff channels
+    ):
+        self.alerting_threshold = alerting_threshold
+        self.alerting = alerting
+        self.alert_types = alert_types
+        self.internal_usage_cache = DualCache()
+        self.async_http_handler = AsyncHTTPHandler()
+        self.alert_to_webhook_url = alert_to_webhook_url
+        self.langfuse_logger = None
+
+        try:
+            from litellm.integrations.langfuse import LangFuseLogger
+
+            self.langfuse_logger = LangFuseLogger(
+                os.getenv("LANGFUSE_PUBLIC_KEY"),
+                os.getenv("LANGFUSE_SECRET_KEY"),
+                flush_interval=1,
+            )
+        except:
+            pass
+
+        pass
+
+    def update_values(
+        self,
+        alerting: Optional[List] = None,
+        alerting_threshold: Optional[float] = None,
+        alert_types: Optional[List] = None,
+        alert_to_webhook_url: Optional[Dict] = None,
+    ):
+        if alerting is not None:
+            self.alerting = alerting
+        if alerting_threshold is not None:
+            self.alerting_threshold = alerting_threshold
+        if alert_types is not None:
+            self.alert_types = alert_types
+
+        if alert_to_webhook_url is not None:
+            # update the dict
+            if self.alert_to_webhook_url is None:
+                self.alert_to_webhook_url = alert_to_webhook_url
+            else:
+                self.alert_to_webhook_url.update(alert_to_webhook_url)
+
+    async def deployment_in_cooldown(self):
+        pass
+
+    async def deployment_removed_from_cooldown(self):
+        pass
+
+    def _all_possible_alert_types(self):
+        # used by the UI to show all supported alert types
+        # Note: This is not the alerts the user has configured, instead it's all possible alert types a user can select
+        return [
+            "llm_exceptions",
+            "llm_too_slow",
+            "llm_requests_hanging",
+            "budget_alerts",
+            "db_exceptions",
+        ]
+
+    def _add_langfuse_trace_id_to_alert(
+        self,
+        request_info: str,
+        request_data: Optional[dict] = None,
+        kwargs: Optional[dict] = None,
+        type: Literal["hanging_request", "slow_response"] = "hanging_request",
+        start_time: Optional[datetime.datetime] = None,
+        end_time: Optional[datetime.datetime] = None,
+    ):
+        import uuid
+
+        # For now: do nothing as we're debugging why this is not working as expected
+        if request_data is not None:
+            trace_id = request_data.get("metadata", {}).get(
+                "trace_id", None
+            )  # get langfuse trace id
+            if trace_id is None:
+                trace_id = "litellm-alert-trace-" + str(uuid.uuid4())
+                request_data["metadata"]["trace_id"] = trace_id
+        elif kwargs is not None:
+            _litellm_params = kwargs.get("litellm_params", {})
+            trace_id = _litellm_params.get("metadata", {}).get(
+                "trace_id", None
+            )  # get langfuse trace id
+            if trace_id is None:
+                trace_id = "litellm-alert-trace-" + str(uuid.uuid4())
+                _litellm_params["metadata"]["trace_id"] = trace_id
+
+        # Log hanging request as an error on langfuse
+        if type == "hanging_request":
+            if self.langfuse_logger is not None:
+                _logging_kwargs = copy.deepcopy(request_data)
+                if _logging_kwargs is None:
+                    _logging_kwargs = {}
+                _logging_kwargs["litellm_params"] = {}
+                request_data = request_data or {}
+                _logging_kwargs["litellm_params"]["metadata"] = request_data.get(
+                    "metadata", {}
+                )
+                # log to langfuse in a separate thread
+                import threading
+
+                threading.Thread(
+                    target=self.langfuse_logger.log_event,
+                    args=(
+                        _logging_kwargs,
+                        None,
+                        start_time,
+                        end_time,
+                        None,
+                        print,
+                        "ERROR",
+                        "Requests is hanging",
+                    ),
+                ).start()
+
+        _langfuse_host = os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com")
+        _langfuse_project_id = os.environ.get("LANGFUSE_PROJECT_ID")
+
+        # langfuse urls look like: https://us.cloud.langfuse.com/project/************/traces/litellm-alert-trace-ididi9dk-09292-************
+
+        _langfuse_url = (
+            f"{_langfuse_host}/project/{_langfuse_project_id}/traces/{trace_id}"
+        )
+        request_info += f"\n🪢 Langfuse Trace: {_langfuse_url}"
+        return request_info
+
+    def _response_taking_too_long_callback(
+        self,
+        kwargs,  # kwargs to completion
+        start_time,
+        end_time,  # start/end time
+    ):
+        try:
+            time_difference = end_time - start_time
+            # Convert the timedelta to float (in seconds)
+            time_difference_float = time_difference.total_seconds()
+            litellm_params = kwargs.get("litellm_params", {})
+            model = kwargs.get("model", "")
+            api_base = litellm.get_api_base(model=model, optional_params=litellm_params)
+            messages = kwargs.get("messages", None)
+            # if messages does not exist fallback to "input"
+            if messages is None:
+                messages = kwargs.get("input", None)
+
+            # only use first 100 chars for alerting
+            _messages = str(messages)[:100]
+
+            return time_difference_float, model, api_base, _messages
+        except Exception as e:
+            raise e
+
+    def _get_deployment_latencies_to_alert(self, metadata=None):
+        if metadata is None:
+            return None
+
+        if "_latency_per_deployment" in metadata:
+            # Translate model_id to -> api_base
+            # _latency_per_deployment is a dictionary that looks like this:
+            """
+            _latency_per_deployment: {
+                api_base: 0.01336697916666667
+            }
+            """
+            _message_to_send = ""
+            _deployment_latencies = metadata["_latency_per_deployment"]
+            if len(_deployment_latencies) == 0:
+                return None
+            try:
+                # try sorting deployments by latency
+                _deployment_latencies = sorted(
+                    _deployment_latencies.items(), key=lambda x: x[1]
+                )
+                _deployment_latencies = dict(_deployment_latencies)
+            except:
+                pass
+            for api_base, latency in _deployment_latencies.items():
+                _message_to_send += f"\n{api_base}: {round(latency,2)}s"
+            _message_to_send = "```" + _message_to_send + "```"
+            return _message_to_send
+
+    async def response_taking_too_long_callback(
+        self,
+        kwargs,  # kwargs to completion
+        completion_response,  # response from completion
+        start_time,
+        end_time,  # start/end time
+    ):
+        if self.alerting is None or self.alert_types is None:
+            return
+
+        time_difference_float, model, api_base, messages = (
+            self._response_taking_too_long_callback(
+                kwargs=kwargs,
+                start_time=start_time,
+                end_time=end_time,
+            )
+        )
+        request_info = f"\nRequest Model: `{model}`\nAPI Base: `{api_base}`\nMessages: `{messages}`"
+        slow_message = f"`Responses are slow - {round(time_difference_float,2)}s response time > Alerting threshold: {self.alerting_threshold}s`"
+        if time_difference_float > self.alerting_threshold:
+            if "langfuse" in litellm.success_callback:
+                request_info = self._add_langfuse_trace_id_to_alert(
+                    request_info=request_info, kwargs=kwargs, type="slow_response"
+                )
+            # add deployment latencies to alert
+            if (
+                kwargs is not None
+                and "litellm_params" in kwargs
+                and "metadata" in kwargs["litellm_params"]
+            ):
+                _metadata = kwargs["litellm_params"]["metadata"]
+
+                _deployment_latency_map = self._get_deployment_latencies_to_alert(
+                    metadata=_metadata
+                )
+                if _deployment_latency_map is not None:
+                    request_info += (
+                        f"\nAvailable Deployment Latencies\n{_deployment_latency_map}"
+                    )
+            await self.send_alert(
+                message=slow_message + request_info,
+                level="Low",
+                alert_type="llm_too_slow",
+            )
+
+    async def log_failure_event(self, original_exception: Exception):
+        pass
+
+    async def response_taking_too_long(
+        self,
+        start_time: Optional[datetime.datetime] = None,
+        end_time: Optional[datetime.datetime] = None,
+        type: Literal["hanging_request", "slow_response"] = "hanging_request",
+        request_data: Optional[dict] = None,
+    ):
+        if self.alerting is None or self.alert_types is None:
+            return
+        if request_data is not None:
+            model = request_data.get("model", "")
+            messages = request_data.get("messages", None)
+            if messages is None:
+                # if messages does not exist fallback to "input"
+                messages = request_data.get("input", None)
+
+            # try casting messages to str and get the first 100 characters, else mark as None
+            try:
+                messages = str(messages)
+                messages = messages[:100]
+            except:
+                messages = ""
+            request_info = f"\nRequest Model: `{model}`\nMessages: `{messages}`"
+        else:
+            request_info = ""
+
+        if type == "hanging_request":
+            await asyncio.sleep(
+                self.alerting_threshold
+            )  # Set it to 5 minutes - i'd imagine this might be different for streaming, non-streaming, non-completion (embedding + img) requests
+            if (
+                request_data is not None
+                and request_data.get("litellm_status", "") != "success"
+                and request_data.get("litellm_status", "") != "fail"
+            ):
+                if request_data.get("deployment", None) is not None and isinstance(
+                    request_data["deployment"], dict
+                ):
+                    _api_base = litellm.get_api_base(
+                        model=model,
+                        optional_params=request_data["deployment"].get(
+                            "litellm_params", {}
+                        ),
+                    )
+
+                    if _api_base is None:
+                        _api_base = ""
+
+                    request_info += f"\nAPI Base: {_api_base}"
+                elif request_data.get("metadata", None) is not None and isinstance(
+                    request_data["metadata"], dict
+                ):
+                    # In hanging requests sometime it has not made it to the point where the deployment is passed to the `request_data``
+                    # in that case we fallback to the api base set in the request metadata
+                    _metadata = request_data["metadata"]
+                    _api_base = _metadata.get("api_base", "")
+                    if _api_base is None:
+                        _api_base = ""
+                    request_info += f"\nAPI Base: `{_api_base}`"
+                # only alert hanging responses if they have not been marked as success
+                alerting_message = (
+                    f"`Requests are hanging - {self.alerting_threshold}s+ request time`"
+                )
+
+                if "langfuse" in litellm.success_callback:
+                    request_info = self._add_langfuse_trace_id_to_alert(
+                        request_info=request_info,
+                        request_data=request_data,
+                        type="hanging_request",
+                        start_time=start_time,
+                        end_time=end_time,
+                    )
+
+                # add deployment latencies to alert
+                _deployment_latency_map = self._get_deployment_latencies_to_alert(
+                    metadata=request_data.get("metadata", {})
+                )
+                if _deployment_latency_map is not None:
+                    request_info += f"\nDeployment Latencies\n{_deployment_latency_map}"
+
+                await self.send_alert(
+                    message=alerting_message + request_info,
+                    level="Medium",
+                    alert_type="llm_requests_hanging",
+                )
+
+    async def budget_alerts(
+        self,
+        type: Literal[
+            "token_budget",
+            "user_budget",
+            "user_and_proxy_budget",
+            "failed_budgets",
+            "failed_tracking",
+            "projected_limit_exceeded",
+        ],
+        user_max_budget: float,
+        user_current_spend: float,
+        user_info=None,
+        error_message="",
+    ):
+        if self.alerting is None or self.alert_types is None:
+            # do nothing if alerting is not switched on
+            return
+        if "budget_alerts" not in self.alert_types:
+            return
+        _id: str = "default_id"  # used for caching
+        if type == "user_and_proxy_budget":
+            user_info = dict(user_info)
+            user_id = user_info["user_id"]
+            _id = user_id
+            max_budget = user_info["max_budget"]
+            spend = user_info["spend"]
+            user_email = user_info["user_email"]
+            user_info = f"""\nUser ID: {user_id}\nMax Budget: ${max_budget}\nSpend: ${spend}\nUser Email: {user_email}"""
+        elif type == "token_budget":
+            token_info = dict(user_info)
+            token = token_info["token"]
+            _id = token
+            spend = token_info["spend"]
+            max_budget = token_info["max_budget"]
+            user_id = token_info["user_id"]
+            user_info = f"""\nToken: {token}\nSpend: ${spend}\nMax Budget: ${max_budget}\nUser ID: {user_id}"""
+        elif type == "failed_tracking":
+            user_id = str(user_info)
+            _id = user_id
+            user_info = f"\nUser ID: {user_id}\n Error {error_message}"
+            message = "Failed Tracking Cost for" + user_info
+            await self.send_alert(
+                message=message, level="High", alert_type="budget_alerts"
+            )
+            return
+        elif type == "projected_limit_exceeded" and user_info is not None:
+            """
+            Input variables:
+            user_info = {
+                "key_alias": key_alias,
+                "projected_spend": projected_spend,
+                "projected_exceeded_date": projected_exceeded_date,
+            }
+            user_max_budget=soft_limit,
+            user_current_spend=new_spend
+            """
+            message = f"""\n🚨 `ProjectedLimitExceededError` 💸\n\n`Key Alias:` {user_info["key_alias"]} \n`Expected Day of Error`: {user_info["projected_exceeded_date"]} \n`Current Spend`: {user_current_spend} \n`Projected Spend at end of month`: {user_info["projected_spend"]} \n`Soft Limit`: {user_max_budget}"""
+            await self.send_alert(
+                message=message, level="High", alert_type="budget_alerts"
+            )
+            return
+        else:
+            user_info = str(user_info)
+
+        # percent of max_budget left to spend
+        if user_max_budget > 0:
+            percent_left = (user_max_budget - user_current_spend) / user_max_budget
+        else:
+            percent_left = 0
+        verbose_proxy_logger.debug(
+            f"Budget Alerts: Percent left: {percent_left} for {user_info}"
+        )
+
+        ## PREVENTITIVE ALERTING ## - https://github.com/BerriAI/litellm/issues/2727
+        # - Alert once within 28d period
+        # - Cache this information
+        # - Don't re-alert, if alert already sent
+        _cache: DualCache = self.internal_usage_cache
+
+        # check if crossed budget
+        if user_current_spend >= user_max_budget:
+            verbose_proxy_logger.debug("Budget Crossed for %s", user_info)
+            message = "Budget Crossed for" + user_info
+            result = await _cache.async_get_cache(key=message)
+            if result is None:
+                await self.send_alert(
+                    message=message, level="High", alert_type="budget_alerts"
+                )
+                await _cache.async_set_cache(key=message, value="SENT", ttl=2419200)
+            return
+
+        # check if 5% of max budget is left
+        if percent_left <= 0.05:
+            message = "5% budget left for" + user_info
+            cache_key = "alerting:{}".format(_id)
+            result = await _cache.async_get_cache(key=cache_key)
+            if result is None:
+                await self.send_alert(
+                    message=message, level="Medium", alert_type="budget_alerts"
+                )
+
+                await _cache.async_set_cache(key=cache_key, value="SENT", ttl=2419200)
+
+            return
+
+        # check if 15% of max budget is left
+        if percent_left <= 0.15:
+            message = "15% budget left for" + user_info
+            result = await _cache.async_get_cache(key=message)
+            if result is None:
+                await self.send_alert(
+                    message=message, level="Low", alert_type="budget_alerts"
+                )
+                await _cache.async_set_cache(key=message, value="SENT", ttl=2419200)
+            return
+
+        return
+
+    async def send_alert(
+        self,
+        message: str,
+        level: Literal["Low", "Medium", "High"],
+        alert_type: Literal[
+            "llm_exceptions",
+            "llm_too_slow",
+            "llm_requests_hanging",
+            "budget_alerts",
+            "db_exceptions",
+        ],
+    ):
+        """
+        Alerting based on thresholds: - https://github.com/BerriAI/litellm/issues/1298
+
+        - Responses taking too long
+        - Requests are hanging
+        - Calls are failing
+        - DB Read/Writes are failing
+        - Proxy Close to max budget
+        - Key Close to max budget
+
+        Parameters:
+            level: str - Low|Medium|High - if calls might fail (Medium) or are failing (High); Currently, no alerts would be 'Low'.
+            message: str - what is the alert about
+        """
+        if self.alerting is None:
+            return
+
+        from datetime import datetime
+        import json
+
+        # Get the current timestamp
+        current_time = datetime.now().strftime("%H:%M:%S")
+        _proxy_base_url = os.getenv("PROXY_BASE_URL", None)
+        formatted_message = (
+            f"Level: `{level}`\nTimestamp: `{current_time}`\n\nMessage: {message}"
+        )
+        if _proxy_base_url is not None:
+            formatted_message += f"\n\nProxy URL: `{_proxy_base_url}`"
+
+        # check if we find the slack webhook url in self.alert_to_webhook_url
+        if (
+            self.alert_to_webhook_url is not None
+            and alert_type in self.alert_to_webhook_url
+        ):
+            slack_webhook_url = self.alert_to_webhook_url[alert_type]
+        else:
+            slack_webhook_url = os.getenv("SLACK_WEBHOOK_URL", None)
+
+        if slack_webhook_url is None:
+            raise Exception("Missing SLACK_WEBHOOK_URL from environment")
+        payload = {"text": formatted_message}
+        headers = {"Content-type": "application/json"}
+
+        response = await self.async_http_handler.post(
+            url=slack_webhook_url,
+            headers=headers,
+            data=json.dumps(payload),
+        )
+        if response.status_code == 200:
+            pass
+        else:
+            print("Error sending slack alert. Error=", response.text)  # noqa
--- a/litellm/llms/aleph_alpha.py
+++ b/litellm/llms/aleph_alpha.py
@ -298,7 +298,7 @@ def completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response


--- a/litellm/llms/anthropic.py
+++ b/litellm/llms/anthropic.py
@ -258,8 +258,9 @@ class AnthropicChatCompletion(BaseLLM):
        self.async_handler = AsyncHTTPHandler(
            timeout=httpx.Timeout(timeout=600.0, connect=5.0)
        )
+        data["stream"] = True
        response = await self.async_handler.post(
-            api_base, headers=headers, data=json.dumps(data)
+            api_base, headers=headers, data=json.dumps(data), stream=True
        )

        if response.status_code != 200:
--- a/litellm/llms/anthropic_text.py
+++ b/litellm/llms/anthropic_text.py
@ -137,7 +137,8 @@ class AnthropicTextCompletion(BaseLLM):
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+
+        setattr(model_response, "usage", usage)

        return model_response

--- a/litellm/llms/azure.py
+++ b/litellm/llms/azure.py
@ -96,6 +96,15 @@ class AzureOpenAIConfig(OpenAIConfig):
            top_p,
        )

+    def get_mapped_special_auth_params(self) -> dict:
+        return {"token": "azure_ad_token"}
+
+    def map_special_auth_params(self, non_default_params: dict, optional_params: dict):
+        for param, value in non_default_params.items():
+            if param == "token":
+                optional_params["azure_ad_token"] = value
+        return optional_params
+

 def select_azure_base_url_or_endpoint(azure_client_params: dict):
    # azure_client_params = {
--- a/litellm/llms/baseten.py
+++ b/litellm/llms/baseten.py
@ -55,9 +55,11 @@ def completion(
        "inputs": prompt,
        "prompt": prompt,
        "parameters": optional_params,
-        "stream": True
-        if "stream" in optional_params and optional_params["stream"] == True
-        else False,
+        "stream": (
+            True
+            if "stream" in optional_params and optional_params["stream"] == True
+            else False
+        ),
    }

    ## LOGGING
@ -71,9 +73,11 @@ def completion(
        completion_url_fragment_1 + model + completion_url_fragment_2,
        headers=headers,
        data=json.dumps(data),
-        stream=True
-        if "stream" in optional_params and optional_params["stream"] == True
-        else False,
+        stream=(
+            True
+            if "stream" in optional_params and optional_params["stream"] == True
+            else False
+        ),
    )
    if "text/event-stream" in response.headers["Content-Type"] or (
        "stream" in optional_params and optional_params["stream"] == True
@ -102,28 +106,28 @@ def completion(
                    and "data" in completion_response["model_output"]
                    and isinstance(completion_response["model_output"]["data"], list)
                ):
-                    model_response["choices"][0]["message"][
-                        "content"
-                    ] = completion_response["model_output"]["data"][0]
+                    model_response["choices"][0]["message"]["content"] = (
+                        completion_response["model_output"]["data"][0]
+                    )
                elif isinstance(completion_response["model_output"], str):
-                    model_response["choices"][0]["message"][
-                        "content"
-                    ] = completion_response["model_output"]
+                    model_response["choices"][0]["message"]["content"] = (
+                        completion_response["model_output"]
+                    )
            elif "completion" in completion_response and isinstance(
                completion_response["completion"], str
            ):
-                model_response["choices"][0]["message"][
-                    "content"
-                ] = completion_response["completion"]
+                model_response["choices"][0]["message"]["content"] = (
+                    completion_response["completion"]
+                )
            elif isinstance(completion_response, list) and len(completion_response) > 0:
                if "generated_text" not in completion_response:
                    raise BasetenError(
                        message=f"Unable to parse response. Original response: {response.text}",
                        status_code=response.status_code,
                    )
-                model_response["choices"][0]["message"][
-                    "content"
-                ] = completion_response[0]["generated_text"]
+                model_response["choices"][0]["message"]["content"] = (
+                    completion_response[0]["generated_text"]
+                )
                ## GETTING LOGPROBS
                if (
                    "details" in completion_response[0]
@ -155,7 +159,8 @@ def completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+
+        setattr(model_response, "usage", usage)
        return model_response


--- a/litellm/llms/bedrock.py
+++ b/litellm/llms/bedrock.py
@ -29,6 +29,24 @@ class BedrockError(Exception):
        )  # Call the base class constructor with the parameters it needs


+class AmazonBedrockGlobalConfig:
+    def __init__(self):
+        pass
+
+    def get_mapped_special_auth_params(self) -> dict:
+        """
+        Mapping of common auth params across bedrock/vertex/azure/watsonx
+        """
+        return {"region_name": "aws_region_name"}
+
+    def map_special_auth_params(self, non_default_params: dict, optional_params: dict):
+        mapped_params = self.get_mapped_special_auth_params()
+        for param, value in non_default_params.items():
+            if param in mapped_params:
+                optional_params[mapped_params[param]] = value
+        return optional_params
+
+
 class AmazonTitanConfig:
    """
    Reference: https://us-west-2.console.aws.amazon.com/bedrock/home?region=us-west-2#/providers?model=titan-text-express-v1
@ -653,6 +671,10 @@ def convert_messages_to_prompt(model, messages, provider, custom_prompt_dict):
        prompt = prompt_factory(
            model=model, messages=messages, custom_llm_provider="bedrock"
        )
+    elif provider == "meta":
+        prompt = prompt_factory(
+            model=model, messages=messages, custom_llm_provider="bedrock"
+        )
    else:
        prompt = ""
        for message in messages:
@ -1028,7 +1050,7 @@ def completion(
                    total_tokens=response_body["usage"]["input_tokens"]
                    + response_body["usage"]["output_tokens"],
                )
-                model_response.usage = _usage
+                setattr(model_response, "usage", _usage)
            else:
                outputText = response_body["completion"]
                model_response["finish_reason"] = response_body["stop_reason"]
@ -1071,8 +1093,10 @@ def completion(
                    status_code=response_metadata.get("HTTPStatusCode", 500),
                )

-        ## CALCULATING USAGE - baseten charges on time, not tokens - have some mapping of cost here.
-        if getattr(model_response.usage, "total_tokens", None) is None:
+        ## CALCULATING USAGE - bedrock charges on time, not tokens - have some mapping of cost here.
+        if not hasattr(model_response, "usage"):
+            setattr(model_response, "usage", Usage())
+        if getattr(model_response.usage, "total_tokens", None) is None:  # type: ignore
            prompt_tokens = response_metadata.get(
                "x-amzn-bedrock-input-token-count", len(encoding.encode(prompt))
            )
@ -1089,7 +1113,7 @@ def completion(
                completion_tokens=completion_tokens,
                total_tokens=prompt_tokens + completion_tokens,
            )
-            model_response.usage = usage
+            setattr(model_response, "usage", usage)

        model_response["created"] = int(time.time())
        model_response["model"] = model
--- a/litellm/llms/cloudflare.py
+++ b/litellm/llms/cloudflare.py
@ -167,7 +167,7 @@ def completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response


--- a/litellm/llms/cohere.py
+++ b/litellm/llms/cohere.py
@ -237,7 +237,7 @@ def completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response


--- a/litellm/llms/cohere_chat.py
+++ b/litellm/llms/cohere_chat.py
@ -43,6 +43,7 @@ class CohereChatConfig:
        presence_penalty (float, optional): Used to reduce repetitiveness of generated tokens.
        tools (List[Dict[str, str]], optional): A list of available tools (functions) that the model may suggest invoking.
        tool_results (List[Dict[str, Any]], optional): A list of results from invoking tools.
+        seed (int, optional): A seed to assist reproducibility of the model's response.
    """

    preamble: Optional[str] = None
@ -62,6 +63,7 @@ class CohereChatConfig:
    presence_penalty: Optional[int] = None
    tools: Optional[list] = None
    tool_results: Optional[list] = None
+    seed: Optional[int] = None

    def __init__(
        self,
@ -82,6 +84,7 @@ class CohereChatConfig:
        presence_penalty: Optional[int] = None,
        tools: Optional[list] = None,
        tool_results: Optional[list] = None,
+        seed: Optional[int] = None,
    ) -> None:
        locals_ = locals()
        for key, value in locals_.items():
@ -302,5 +305,5 @@ def completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response
--- a/litellm/llms/custom_httpx/http_handler.py
+++ b/litellm/llms/custom_httpx/http_handler.py
@ -41,13 +41,12 @@ class AsyncHTTPHandler:
        data: Optional[Union[dict, str]] = None,  # type: ignore
        params: Optional[dict] = None,
        headers: Optional[dict] = None,
+        stream: bool = False,
    ):
-        response = await self.client.post(
-            url,
-            data=data,  # type: ignore
-            params=params,
-            headers=headers,
+        req = self.client.build_request(
+            "POST", url, data=data, params=params, headers=headers  # type: ignore
        )
+        response = await self.client.send(req, stream=stream)
        return response

    def __del__(self) -> None:
--- a/litellm/llms/gemini.py
+++ b/litellm/llms/gemini.py
@ -311,7 +311,7 @@ def completion(
        completion_tokens=completion_tokens,
        total_tokens=prompt_tokens + completion_tokens,
    )
-    model_response.usage = usage
+    setattr(model_response, "usage", usage)
    return model_response


--- a/litellm/llms/maritalk.py
+++ b/litellm/llms/maritalk.py
@ -152,9 +152,9 @@ def completion(
        else:
            try:
                if len(completion_response["answer"]) > 0:
-                    model_response["choices"][0]["message"][
-                        "content"
-                    ] = completion_response["answer"]
+                    model_response["choices"][0]["message"]["content"] = (
+                        completion_response["answer"]
+                    )
            except Exception as e:
                raise MaritalkError(
                    message=response.text, status_code=response.status_code
@ -174,7 +174,7 @@ def completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response


--- a/litellm/llms/nlp_cloud.py
+++ b/litellm/llms/nlp_cloud.py
@ -185,9 +185,9 @@ def completion(
        else:
            try:
                if len(completion_response["generated_text"]) > 0:
-                    model_response["choices"][0]["message"][
-                        "content"
-                    ] = completion_response["generated_text"]
+                    model_response["choices"][0]["message"]["content"] = (
+                        completion_response["generated_text"]
+                    )
            except:
                raise NLPCloudError(
                    message=json.dumps(completion_response),
@ -205,7 +205,7 @@ def completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response


--- a/litellm/llms/ollama.py
+++ b/litellm/llms/ollama.py
@ -228,7 +228,7 @@ def get_ollama_response(
        model_response["choices"][0]["message"]["content"] = response_json["response"]
    model_response["created"] = int(time.time())
    model_response["model"] = "ollama/" + model
-    prompt_tokens = response_json.get("prompt_eval_count", len(encoding.encode(prompt)))  # type: ignore
+    prompt_tokens = response_json.get("prompt_eval_count", len(encoding.encode(prompt, disallowed_special=())))  # type: ignore
    completion_tokens = response_json.get("eval_count", len(response_json.get("message",dict()).get("content", "")))
    model_response["usage"] = litellm.Usage(
        prompt_tokens=prompt_tokens,
@ -330,7 +330,7 @@ async def ollama_acompletion(url, data, model_response, encoding, logging_obj):
                ]
            model_response["created"] = int(time.time())
            model_response["model"] = "ollama/" + data["model"]
-            prompt_tokens = response_json.get("prompt_eval_count", len(encoding.encode(data["prompt"])))  # type: ignore
+            prompt_tokens = response_json.get("prompt_eval_count", len(encoding.encode(data["prompt"], disallowed_special=())))  # type: ignore
            completion_tokens = response_json.get("eval_count", len(response_json.get("message",dict()).get("content", "")))
            model_response["usage"] = litellm.Usage(
                prompt_tokens=prompt_tokens,
--- a/litellm/llms/ollama_chat.py
+++ b/litellm/llms/ollama_chat.py
@ -148,7 +148,7 @@ class OllamaChatConfig:
            if param == "top_p":
                optional_params["top_p"] = value
            if param == "frequency_penalty":
-                optional_params["repeat_penalty"] = param
+                optional_params["repeat_penalty"] = value
            if param == "stop":
                optional_params["stop"] = value
            if param == "response_format" and value["type"] == "json_object":
@ -184,6 +184,7 @@ class OllamaChatConfig:
 # ollama implementation
 def get_ollama_response(
    api_base="http://localhost:11434",
+    api_key: Optional[str] = None,
    model="llama2",
    messages=None,
    optional_params=None,
@ -236,6 +237,7 @@ def get_ollama_response(
        if stream == True:
            response = ollama_async_streaming(
                url=url,
+                api_key=api_key,
                data=data,
                model_response=model_response,
                encoding=encoding,
@ -244,6 +246,7 @@ def get_ollama_response(
        else:
            response = ollama_acompletion(
                url=url,
+                api_key=api_key,
                data=data,
                model_response=model_response,
                encoding=encoding,
@ -252,12 +255,17 @@ def get_ollama_response(
            )
        return response
    elif stream == True:
-        return ollama_completion_stream(url=url, data=data, logging_obj=logging_obj)
+        return ollama_completion_stream(
+            url=url, api_key=api_key, data=data, logging_obj=logging_obj
+        )

-    response = requests.post(
-        url=f"{url}",
-        json=data,
-    )
+    _request = {
+        "url": f"{url}",
+        "json": data,
+    }
+    if api_key is not None:
+        _request["headers"] = "Bearer {}".format(api_key)
+    response = requests.post(**_request)  # type: ignore
    if response.status_code != 200:
        raise OllamaError(status_code=response.status_code, message=response.text)

@ -307,10 +315,16 @@ def get_ollama_response(
    return model_response


-def ollama_completion_stream(url, data, logging_obj):
-    with httpx.stream(
-        url=url, json=data, method="POST", timeout=litellm.request_timeout
-    ) as response:
+def ollama_completion_stream(url, api_key, data, logging_obj):
+    _request = {
+        "url": f"{url}",
+        "json": data,
+        "method": "POST",
+        "timeout": litellm.request_timeout,
+    }
+    if api_key is not None:
+        _request["headers"] = "Bearer {}".format(api_key)
+    with httpx.stream(**_request) as response:
        try:
            if response.status_code != 200:
                raise OllamaError(
@ -329,12 +343,20 @@ def ollama_completion_stream(url, data, logging_obj):
            raise e


-async def ollama_async_streaming(url, data, model_response, encoding, logging_obj):
+async def ollama_async_streaming(
+    url, api_key, data, model_response, encoding, logging_obj
+):
    try:
        client = httpx.AsyncClient()
-        async with client.stream(
-            url=f"{url}", json=data, method="POST", timeout=litellm.request_timeout
-        ) as response:
+        _request = {
+            "url": f"{url}",
+            "json": data,
+            "method": "POST",
+            "timeout": litellm.request_timeout,
+        }
+        if api_key is not None:
+            _request["headers"] = "Bearer {}".format(api_key)
+        async with client.stream(**_request) as response:
            if response.status_code != 200:
                raise OllamaError(
                    status_code=response.status_code, message=response.text
@ -353,13 +375,25 @@ async def ollama_async_streaming(url, data, model_response, encoding, logging_ob


 async def ollama_acompletion(
-    url, data, model_response, encoding, logging_obj, function_name
+    url,
+    api_key: Optional[str],
+    data,
+    model_response,
+    encoding,
+    logging_obj,
+    function_name,
 ):
    data["stream"] = False
    try:
        timeout = aiohttp.ClientTimeout(total=litellm.request_timeout)  # 10 minutes
        async with aiohttp.ClientSession(timeout=timeout) as session:
-            resp = await session.post(url, json=data)
+            _request = {
+                "url": f"{url}",
+                "json": data,
+            }
+            if api_key is not None:
+                _request["headers"] = "Bearer {}".format(api_key)
+            resp = await session.post(**_request)

            if resp.status != 200:
                text = await resp.text()
--- a/litellm/llms/oobabooga.py
+++ b/litellm/llms/oobabooga.py
@ -99,9 +99,9 @@ def completion(
            )
        else:
            try:
-                model_response["choices"][0]["message"][
-                    "content"
-                ] = completion_response["choices"][0]["message"]["content"]
+                model_response["choices"][0]["message"]["content"] = (
+                    completion_response["choices"][0]["message"]["content"]
+                )
            except:
                raise OobaboogaError(
                    message=json.dumps(completion_response),
@ -115,7 +115,7 @@ def completion(
            completion_tokens=completion_response["usage"]["completion_tokens"],
            total_tokens=completion_response["usage"]["total_tokens"],
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response


--- a/litellm/llms/openai.py
+++ b/litellm/llms/openai.py
@ -223,7 +223,7 @@ class OpenAITextCompletionConfig:
            model_response_object.choices = choice_list

            if "usage" in response_object:
-                model_response_object.usage = response_object["usage"]
+                setattr(model_response_object, "usage", response_object["usage"])

            if "id" in response_object:
                model_response_object.id = response_object["id"]
@ -447,6 +447,7 @@ class OpenAIChatCompletion(BaseLLM):
                )
            else:
                openai_aclient = client
+
            ## LOGGING
            logging_obj.pre_call(
                input=data["messages"],
--- a/litellm/llms/palm.py
+++ b/litellm/llms/palm.py
@ -191,7 +191,7 @@ def completion(
        completion_tokens=completion_tokens,
        total_tokens=prompt_tokens + completion_tokens,
    )
-    model_response.usage = usage
+    setattr(model_response, "usage", usage)
    return model_response


--- a/litellm/llms/petals.py
+++ b/litellm/llms/petals.py
@ -41,9 +41,9 @@ class PetalsConfig:
    """

    max_length: Optional[int] = None
-    max_new_tokens: Optional[
-        int
-    ] = litellm.max_tokens  # petals requires max tokens to be set
+    max_new_tokens: Optional[int] = (
+        litellm.max_tokens
+    )  # petals requires max tokens to be set
    do_sample: Optional[bool] = None
    temperature: Optional[float] = None
    top_k: Optional[int] = None
@ -203,7 +203,7 @@ def completion(
        completion_tokens=completion_tokens,
        total_tokens=prompt_tokens + completion_tokens,
    )
-    model_response.usage = usage
+    setattr(model_response, "usage", usage)
    return model_response


--- a/litellm/llms/prompt_templates/factory.py
+++ b/litellm/llms/prompt_templates/factory.py
@ -3,8 +3,14 @@ import requests, traceback
 import json, re, xml.etree.ElementTree as ET
 from jinja2 import Template, exceptions, meta, BaseLoader
 from jinja2.sandbox import ImmutableSandboxedEnvironment
-from typing import Optional, Any
-from typing import List
+from typing import (
+    Any,
+    List,
+    Mapping,
+    MutableMapping,
+    Optional,
+    Sequence,
+)
 import litellm


@ -145,6 +151,12 @@ def mistral_api_pt(messages):
        elif isinstance(m["content"], str):
            texts = m["content"]
        new_m = {"role": m["role"], "content": texts}
+
+        if new_m["role"] == "tool" and m.get("name"):
+            new_m["name"] = m["name"]
+        if m.get("tool_calls"):
+            new_m["tool_calls"] = m["tool_calls"]
+
        new_messages.append(new_m)
    return new_messages

@ -218,6 +230,26 @@ def phind_codellama_pt(messages):
    return prompt


+known_tokenizer_config = {
+    "mistralai/Mistral-7B-Instruct-v0.1": {
+        "tokenizer": {
+            "chat_template": "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token + ' ' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}",
+            "bos_token": "<s>",
+            "eos_token": "</s>",
+        },
+        "status": "success",
+    },
+    "meta-llama/Meta-Llama-3-8B-Instruct": {
+        "tokenizer": {
+            "chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}",
+            "bos_token": "<|begin_of_text|>",
+            "eos_token": "",
+        },
+        "status": "success",
+    },
+}
+
+
 def hf_chat_template(model: str, messages: list, chat_template: Optional[Any] = None):
    # Define Jinja2 environment
    env = ImmutableSandboxedEnvironment()
@ -246,20 +278,23 @@ def hf_chat_template(model: str, messages: list, chat_template: Optional[Any] =
            else:
                return {"status": "failure"}

-        tokenizer_config = _get_tokenizer_config(model)
+        if model in known_tokenizer_config:
+            tokenizer_config = known_tokenizer_config[model]
+        else:
+            tokenizer_config = _get_tokenizer_config(model)
        if (
            tokenizer_config["status"] == "failure"
            or "chat_template" not in tokenizer_config["tokenizer"]
        ):
            raise Exception("No chat template found")
        ## read the bos token, eos token and chat template from the json
-        tokenizer_config = tokenizer_config["tokenizer"]
-        bos_token = tokenizer_config["bos_token"]
-        eos_token = tokenizer_config["eos_token"]
-        chat_template = tokenizer_config["chat_template"]
+        tokenizer_config = tokenizer_config["tokenizer"]  # type: ignore

+        bos_token = tokenizer_config["bos_token"]  # type: ignore
+        eos_token = tokenizer_config["eos_token"]  # type: ignore
+        chat_template = tokenizer_config["chat_template"]  # type: ignore
    try:
-        template = env.from_string(chat_template)
+        template = env.from_string(chat_template)  # type: ignore
    except Exception as e:
        raise e

@ -402,6 +437,35 @@ def format_prompt_togetherai(messages, prompt_format, chat_template):
    return prompt


+### IBM Granite
+
+
+def ibm_granite_pt(messages: list):
+    """
+    IBM's Granite models uses the template:
+    <|system|> {system_message} <|user|> {user_message} <|assistant|> {assistant_message}
+
+    See: https://www.ibm.com/docs/en/watsonx-as-a-service?topic=solutions-supported-foundation-models
+    """
+    return custom_prompt(
+        messages=messages,
+        role_dict={
+            "system": {
+                "pre_message": "<|system|>\n",
+                "post_message": "\n",
+            },
+            "user": {
+                "pre_message": "<|user|>\n",
+                "post_message": "\n",
+            },
+            "assistant": {
+                "pre_message": "<|assistant|>\n",
+                "post_message": "\n",
+            },
+        },
+    ).strip()
+
+
 ### ANTHROPIC ###


@ -466,10 +530,11 @@ def construct_tool_use_system_prompt(
 ):  # from https://github.com/anthropics/anthropic-cookbook/blob/main/function_calling/function_calling.ipynb
    tool_str_list = []
    for tool in tools:
+        tool_function = get_attribute_or_key(tool, "function")
        tool_str = construct_format_tool_for_claude_prompt(
-            tool["function"]["name"],
-            tool["function"].get("description", ""),
-            tool["function"].get("parameters", {}),
+            get_attribute_or_key(tool_function, "name"),
+            get_attribute_or_key(tool_function, "description", ""),
+            get_attribute_or_key(tool_function, "parameters", {}),
        )
        tool_str_list.append(tool_str)
    tool_use_system_prompt = (
@ -593,7 +658,8 @@ def convert_to_anthropic_tool_result_xml(message: dict) -> str:
    </function_results>
    """
    name = message.get("name")
-    content = message.get("content")
+    content = message.get("content", "")
+    content = content.replace("<", "&lt;").replace(">", "&gt;").replace("&", "&amp;")

    # We can't determine from openai message format whether it's a successful or
    # error call result so default to the successful result template
@ -614,13 +680,15 @@ def convert_to_anthropic_tool_result_xml(message: dict) -> str:
 def convert_to_anthropic_tool_invoke_xml(tool_calls: list) -> str:
    invokes = ""
    for tool in tool_calls:
-        if tool["type"] != "function":
+        if get_attribute_or_key(tool, "type") != "function":
            continue

-        tool_name = tool["function"]["name"]
+        tool_function = get_attribute_or_key(tool, "function")
+        tool_name = get_attribute_or_key(tool_function, "name")
+        tool_arguments = get_attribute_or_key(tool_function, "arguments")
        parameters = "".join(
            f"<{param}>{val}</{param}>\n"
-            for param, val in json.loads(tool["function"]["arguments"]).items()
+            for param, val in json.loads(tool_arguments).items()
        )
        invokes += (
            "<invoke>\n"
@ -674,7 +742,7 @@ def anthropic_messages_pt_xml(messages: list):
                    {
                        "type": "text",
                        "text": (
-                            convert_to_anthropic_tool_result(messages[msg_i])
+                            convert_to_anthropic_tool_result_xml(messages[msg_i])
                            if messages[msg_i]["role"] == "tool"
                            else messages[msg_i]["content"]
                        ),
@ -695,7 +763,7 @@ def anthropic_messages_pt_xml(messages: list):
            if messages[msg_i].get(
                "tool_calls", []
            ):  # support assistant tool invoke convertion
-                assistant_text += convert_to_anthropic_tool_invoke(  # type: ignore
+                assistant_text += convert_to_anthropic_tool_invoke_xml(  # type: ignore
                    messages[msg_i]["tool_calls"]
                )

@ -807,12 +875,18 @@ def convert_to_anthropic_tool_invoke(tool_calls: list) -> list:
    anthropic_tool_invoke = [
        {
            "type": "tool_use",
-            "id": tool["id"],
-            "name": tool["function"]["name"],
-            "input": json.loads(tool["function"]["arguments"]),
+            "id": get_attribute_or_key(tool, "id"),
+            "name": get_attribute_or_key(
+                get_attribute_or_key(tool, "function"), "name"
+            ),
+            "input": json.loads(
+                get_attribute_or_key(
+                    get_attribute_or_key(tool, "function"), "arguments"
+                )
+            ),
        }
        for tool in tool_calls
-        if tool["type"] == "function"
+        if get_attribute_or_key(tool, "type") == "function"
    ]

    return anthropic_tool_invoke
@ -978,6 +1052,30 @@ def get_system_prompt(messages):
    return system_prompt, messages


+def convert_to_documents(
+    observations: Any,
+) -> List[MutableMapping]:
+    """Converts observations into a 'document' dict"""
+    documents: List[MutableMapping] = []
+    if isinstance(observations, str):
+        # strings are turned into a key/value pair and a key of 'output' is added.
+        observations = [{"output": observations}]
+    elif isinstance(observations, Mapping):
+        # single mappings are transformed into a list to simplify the rest of the code.
+        observations = [observations]
+    elif not isinstance(observations, Sequence):
+        # all other types are turned into a key/value pair within a list
+        observations = [{"output": observations}]
+
+    for doc in observations:
+        if not isinstance(doc, Mapping):
+            # types that aren't Mapping are turned into a key/value pair.
+            doc = {"output": doc}
+        documents.append(doc)
+
+    return documents
+
+
 def convert_openai_message_to_cohere_tool_result(message):
    """
    OpenAI message with a tool result looks like:
@ -1019,7 +1117,7 @@ def convert_openai_message_to_cohere_tool_result(message):
            "parameters": {"location": "San Francisco, CA"},
            "generation_id": tool_call_id,
        },
-        "outputs": [content],
+        "outputs": convert_to_documents(content),
    }
    return cohere_tool_result

@ -1032,8 +1130,9 @@ def cohere_message_pt(messages: list):
        if message["role"] == "tool":
            tool_result = convert_openai_message_to_cohere_tool_result(message)
            tool_results.append(tool_result)
-        else:
-            prompt += message["content"]
+        elif message.get("content"):
+            prompt += message["content"] + "\n\n"
+    prompt = prompt.rstrip()
    return prompt, tool_results


@ -1107,12 +1206,6 @@ def _gemini_vision_convert_messages(messages: list):
    Returns:
        tuple: A tuple containing the prompt (a string) and the processed images (a list of objects representing the images).
    """
-    try:
-        from PIL import Image
-    except:
-        raise Exception(
-            "gemini image conversion failed please run `pip install Pillow`"
-        )

    try:
        # given messages for gpt-4 vision, convert them for gemini
@ -1139,6 +1232,12 @@ def _gemini_vision_convert_messages(messages: list):
                image = _load_image_from_url(img)
                processed_images.append(image)
            else:
+                try:
+                    from PIL import Image
+                except:
+                    raise Exception(
+                        "gemini image conversion failed please run `pip install Pillow`"
+                    )
                # Case 2: Image filepath (e.g. temp.jpeg) given
                image = Image.open(img)
                processed_images.append(image)
@ -1306,18 +1405,62 @@ def prompt_factory(
                return anthropic_pt(messages=messages)
        elif "mistral." in model:
            return mistral_instruct_pt(messages=messages)
+        elif "llama2" in model and "chat" in model:
+            return llama_2_chat_pt(messages=messages)
+        elif "llama3" in model and "instruct" in model:
+            return hf_chat_template(
+                model="meta-llama/Meta-Llama-3-8B-Instruct",
+                messages=messages,
+            )
+
    elif custom_llm_provider == "clarifai":
        if "claude" in model:
            return anthropic_pt(messages=messages)
+        
    elif custom_llm_provider == "perplexity":
        for message in messages:
            message.pop("name", None)
        return messages
    elif custom_llm_provider == "azure_text":
        return azure_text_pt(messages=messages)
+    elif custom_llm_provider == "watsonx":
+        if "granite" in model and "chat" in model:
+            # granite-13b-chat-v1 and granite-13b-chat-v2 use a specific prompt template
+            return ibm_granite_pt(messages=messages)
+        elif "ibm-mistral" in model and "instruct" in model:
+            # models like ibm-mistral/mixtral-8x7b-instruct-v01-q use the mistral instruct prompt template
+            return mistral_instruct_pt(messages=messages)
+        elif "meta-llama/llama-3" in model and "instruct" in model:
+            # https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
+            return custom_prompt(
+                role_dict={
+                    "system": {
+                        "pre_message": "<|start_header_id|>system<|end_header_id|>\n",
+                        "post_message": "<|eot_id|>",
+                    },
+                    "user": {
+                        "pre_message": "<|start_header_id|>user<|end_header_id|>\n",
+                        "post_message": "<|eot_id|>",
+                    },
+                    "assistant": {
+                        "pre_message": "<|start_header_id|>assistant<|end_header_id|>\n",
+                        "post_message": "<|eot_id|>",
+                    },
+                },
+                messages=messages,
+                initial_prompt_value="<|begin_of_text|>",
+                final_prompt_value="<|start_header_id|>assistant<|end_header_id|>\n",
+            )
    try:
        if "meta-llama/llama-2" in model and "chat" in model:
            return llama_2_chat_pt(messages=messages)
+        elif (
+            "meta-llama/llama-3" in model or "meta-llama-3" in model
+        ) and "instruct" in model:
+            return hf_chat_template(
+                model="meta-llama/Meta-Llama-3-8B-Instruct",
+                messages=messages,
+            )
        elif (
            "tiiuae/falcon" in model
        ):  # Note: for the instruct models, it's best to use a User: .., Assistant:.. approach in your prompt template.
@ -1358,3 +1501,9 @@ def prompt_factory(
        return default_pt(
            messages=messages
        )  # default that covers Bloom, T-5, any non-chat tuned model (e.g. base Llama2)
+
+
+def get_attribute_or_key(tool_or_function, attribute, default=None):
+    if hasattr(tool_or_function, attribute):
+        return getattr(tool_or_function, attribute)
+    return tool_or_function.get(attribute, default)
--- a/litellm/llms/replicate.py
+++ b/litellm/llms/replicate.py
@ -112,10 +112,16 @@ def start_prediction(
    }

    initial_prediction_data = {
-        "version": version_id,
        "input": input_data,
    }

+    if ":" in version_id and len(version_id) > 64:
+        model_parts = version_id.split(":")
+        if (
+            len(model_parts) > 1 and len(model_parts[1]) == 64
+        ):  ## checks if model name has a 64 digit code - e.g. "meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3"
+            initial_prediction_data["version"] = model_parts[1]
+
    ## LOGGING
    logging_obj.pre_call(
        input=input_data["prompt"],
@ -307,9 +313,7 @@ def completion(
        result, logs = handle_prediction_response(
            prediction_url, api_key, print_verbose
        )
-        model_response["ended"] = (
-            time.time()
-        )  # for pricing this must remain right after calling api
+
        ## LOGGING
        logging_obj.post_call(
            input=prompt,
@ -345,7 +349,7 @@ def completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response


--- a/litellm/llms/sagemaker.py
+++ b/litellm/llms/sagemaker.py
@ -399,7 +399,7 @@ def completion(
        completion_tokens=completion_tokens,
        total_tokens=prompt_tokens + completion_tokens,
    )
-    model_response.usage = usage
+    setattr(model_response, "usage", usage)
    return model_response


@ -617,7 +617,7 @@ async def async_completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response


--- a/litellm/llms/together_ai.py
+++ b/litellm/llms/together_ai.py
@ -2,6 +2,7 @@
 Deprecated. We now do together ai calls via the openai client.
 Reference: https://docs.together.ai/docs/openai-api-compatibility
 """
+
 import os, types
 import json
 from enum import Enum
@ -225,7 +226,7 @@ def completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response


--- a/litellm/llms/vertex_ai.py
+++ b/litellm/llms/vertex_ai.py
@ -22,6 +22,35 @@ class VertexAIError(Exception):
        )  # Call the base class constructor with the parameters it needs


+class ExtendedGenerationConfig(dict):
+    """Extended parameters for the generation."""
+
+    def __init__(
+        self,
+        *,
+        temperature: Optional[float] = None,
+        top_p: Optional[float] = None,
+        top_k: Optional[int] = None,
+        candidate_count: Optional[int] = None,
+        max_output_tokens: Optional[int] = None,
+        stop_sequences: Optional[List[str]] = None,
+        response_mime_type: Optional[str] = None,
+        frequency_penalty: Optional[float] = None,
+        presence_penalty: Optional[float] = None,
+    ):
+        super().__init__(
+            temperature=temperature,
+            top_p=top_p,
+            top_k=top_k,
+            candidate_count=candidate_count,
+            max_output_tokens=max_output_tokens,
+            stop_sequences=stop_sequences,
+            response_mime_type=response_mime_type,
+            frequency_penalty=frequency_penalty,
+            presence_penalty=presence_penalty,
+        )
+
+
 class VertexAIConfig:
    """
    Reference: https://cloud.google.com/vertex-ai/docs/generative-ai/chat/test-chat-prompts
@ -43,6 +72,10 @@ class VertexAIConfig:

    - `stop_sequences` (List[str]): The set of character sequences (up to 5) that will stop output generation. If specified, the API will stop at the first appearance of a stop sequence. The stop sequence will not be included as part of the response.

+    - `frequency_penalty` (float): This parameter is used to penalize the model from repeating the same output. The default value is 0.0.
+
+    - `presence_penalty` (float): This parameter is used to penalize the model from generating the same output as the input. The default value is 0.0.
+
    Note: Please make sure to modify the default parameters as required for your use case.
    """

@ -53,6 +86,8 @@ class VertexAIConfig:
    response_mime_type: Optional[str] = None
    candidate_count: Optional[int] = None
    stop_sequences: Optional[list] = None
+    frequency_penalty: Optional[float] = None
+    presence_penalty: Optional[float] = None

    def __init__(
        self,
@ -63,6 +98,8 @@ class VertexAIConfig:
        response_mime_type: Optional[str] = None,
        candidate_count: Optional[int] = None,
        stop_sequences: Optional[list] = None,
+        frequency_penalty: Optional[float] = None,
+        presence_penalty: Optional[float] = None,
    ) -> None:
        locals_ = locals()
        for key, value in locals_.items():
@ -106,7 +143,9 @@ class VertexAIConfig:
                optional_params["temperature"] = value
            if param == "top_p":
                optional_params["top_p"] = value
-            if param == "stream":
+            if (
+                param == "stream" and value == True
+            ):  # sending stream = False, can cause it to get passed unchecked and raise issues
                optional_params["stream"] = value
            if param == "n":
                optional_params["candidate_count"] = value
@ -119,6 +158,10 @@ class VertexAIConfig:
                optional_params["max_output_tokens"] = value
            if param == "response_format" and value["type"] == "json_object":
                optional_params["response_mime_type"] = "application/json"
+            if param == "frequency_penalty":
+                optional_params["frequency_penalty"] = value
+            if param == "presence_penalty":
+                optional_params["presence_penalty"] = value
            if param == "tools" and isinstance(value, list):
                from vertexai.preview import generative_models

@ -141,6 +184,20 @@ class VertexAIConfig:
                pass
        return optional_params

+    def get_mapped_special_auth_params(self) -> dict:
+        """
+        Common auth params across bedrock/vertex_ai/azure/watsonx
+        """
+        return {"project": "vertex_project", "region_name": "vertex_location"}
+
+    def map_special_auth_params(self, non_default_params: dict, optional_params: dict):
+        mapped_params = self.get_mapped_special_auth_params()
+
+        for param, value in non_default_params.items():
+            if param in mapped_params:
+                optional_params[mapped_params[param]] = value
+        return optional_params
+

 import asyncio

@ -184,8 +241,7 @@ def _get_image_bytes_from_url(image_url: str) -> bytes:
        image_bytes = response.content
        return image_bytes
    except requests.exceptions.RequestException as e:
-        # Handle any request exceptions (e.g., connection error, timeout)
-        return b""  # Return an empty bytes object or handle the error as needed
+        raise Exception(f"An exception occurs with this image - {str(e)}")


 def _load_image_from_url(image_url: str):
@ -206,7 +262,8 @@ def _load_image_from_url(image_url: str):
    )

    image_bytes = _get_image_bytes_from_url(image_url)
-    return Image.from_bytes(image_bytes)
+
+    return Image.from_bytes(data=image_bytes)


 def _gemini_vision_convert_messages(messages: list):
@ -363,42 +420,6 @@ def completion(
        from google.cloud.aiplatform_v1beta1.types import content as gapic_content_types  # type: ignore
        import google.auth  # type: ignore

-        class ExtendedGenerationConfig(GenerationConfig):
-            """Extended parameters for the generation."""
-
-            def __init__(
-                self,
-                *,
-                temperature: Optional[float] = None,
-                top_p: Optional[float] = None,
-                top_k: Optional[int] = None,
-                candidate_count: Optional[int] = None,
-                max_output_tokens: Optional[int] = None,
-                stop_sequences: Optional[List[str]] = None,
-                response_mime_type: Optional[str] = None,
-            ):
-                args_spec = inspect.getfullargspec(gapic_content_types.GenerationConfig)
-
-                if "response_mime_type" in args_spec.args:
-                    self._raw_generation_config = gapic_content_types.GenerationConfig(
-                        temperature=temperature,
-                        top_p=top_p,
-                        top_k=top_k,
-                        candidate_count=candidate_count,
-                        max_output_tokens=max_output_tokens,
-                        stop_sequences=stop_sequences,
-                        response_mime_type=response_mime_type,
-                    )
-                else:
-                    self._raw_generation_config = gapic_content_types.GenerationConfig(
-                        temperature=temperature,
-                        top_p=top_p,
-                        top_k=top_k,
-                        candidate_count=candidate_count,
-                        max_output_tokens=max_output_tokens,
-                        stop_sequences=stop_sequences,
-                    )
-
        ## Load credentials with the correct quota project ref: https://github.com/googleapis/python-aiplatform/issues/2557#issuecomment-1709284744
        print_verbose(
            f"VERTEX AI: vertex_project={vertex_project}; vertex_location={vertex_location}"
@ -522,6 +543,7 @@ def completion(
                "instances": instances,
                "vertex_location": vertex_location,
                "vertex_project": vertex_project,
+                "safety_settings": safety_settings,
                **optional_params,
            }
            if optional_params.get("stream", False) is True:
@ -536,8 +558,9 @@ def completion(
            tools = optional_params.pop("tools", None)
            prompt, images = _gemini_vision_convert_messages(messages=messages)
            content = [prompt] + images
-            if "stream" in optional_params and optional_params["stream"] == True:
-                stream = optional_params.pop("stream")
+            stream = optional_params.pop("stream", False)
+            if stream == True:
+
                request_str += f"response = llm_model.generate_content({content}, generation_config=GenerationConfig(**{optional_params}), safety_settings={safety_settings}, stream={stream})\n"
                logging_obj.pre_call(
                    input=prompt,
@ -550,12 +573,12 @@ def completion(

                model_response = llm_model.generate_content(
                    contents=content,
-                    generation_config=ExtendedGenerationConfig(**optional_params),
+                    generation_config=optional_params,
                    safety_settings=safety_settings,
                    stream=True,
                    tools=tools,
                )
-                optional_params["stream"] = True
+
                return model_response

            request_str += f"response = llm_model.generate_content({content})\n"
@ -572,7 +595,7 @@ def completion(
            ## LLM Call
            response = llm_model.generate_content(
                contents=content,
-                generation_config=ExtendedGenerationConfig(**optional_params),
+                generation_config=optional_params,
                safety_settings=safety_settings,
                tools=tools,
            )
@ -627,7 +650,7 @@ def completion(
                    },
                )
                model_response = chat.send_message_streaming(prompt, **optional_params)
-                optional_params["stream"] = True
+
                return model_response

            request_str += f"chat.send_message({prompt}, **{optional_params}).text\n"
@ -659,7 +682,7 @@ def completion(
                    },
                )
                model_response = llm_model.predict_streaming(prompt, **optional_params)
-                optional_params["stream"] = True
+
                return model_response

            request_str += f"llm_model.predict({prompt}, **{optional_params}).text\n"
@ -784,7 +807,7 @@ def completion(
                completion_tokens=completion_tokens,
                total_tokens=prompt_tokens + completion_tokens,
            )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response
    except Exception as e:
        raise VertexAIError(status_code=500, message=str(e))
@ -805,55 +828,18 @@ async def async_completion(
    instances=None,
    vertex_project=None,
    vertex_location=None,
+    safety_settings=None,
    **optional_params,
 ):
    """
    Add support for acompletion calls for gemini-pro
    """
    try:
-        from vertexai.preview.generative_models import GenerationConfig
-        from google.cloud.aiplatform_v1beta1.types import content as gapic_content_types  # type: ignore
-
-        class ExtendedGenerationConfig(GenerationConfig):
-            """Extended parameters for the generation."""
-
-            def __init__(
-                self,
-                *,
-                temperature: Optional[float] = None,
-                top_p: Optional[float] = None,
-                top_k: Optional[int] = None,
-                candidate_count: Optional[int] = None,
-                max_output_tokens: Optional[int] = None,
-                stop_sequences: Optional[List[str]] = None,
-                response_mime_type: Optional[str] = None,
-            ):
-                args_spec = inspect.getfullargspec(gapic_content_types.GenerationConfig)
-
-                if "response_mime_type" in args_spec.args:
-                    self._raw_generation_config = gapic_content_types.GenerationConfig(
-                        temperature=temperature,
-                        top_p=top_p,
-                        top_k=top_k,
-                        candidate_count=candidate_count,
-                        max_output_tokens=max_output_tokens,
-                        stop_sequences=stop_sequences,
-                        response_mime_type=response_mime_type,
-                    )
-                else:
-                    self._raw_generation_config = gapic_content_types.GenerationConfig(
-                        temperature=temperature,
-                        top_p=top_p,
-                        top_k=top_k,
-                        candidate_count=candidate_count,
-                        max_output_tokens=max_output_tokens,
-                        stop_sequences=stop_sequences,
-                    )
-
        if mode == "vision":
-            print_verbose("\nMaking VertexAI Gemini Pro Vision Call")
+            print_verbose("\nMaking VertexAI Gemini Pro/Vision Call")
            print_verbose(f"\nProcessing input messages = {messages}")
            tools = optional_params.pop("tools", None)
+            stream = optional_params.pop("stream", False)

            prompt, images = _gemini_vision_convert_messages(messages=messages)
            content = [prompt] + images
@ -870,9 +856,11 @@ async def async_completion(
            )

            ## LLM Call
+            # print(f"final content: {content}")
            response = await llm_model._generate_content_async(
                contents=content,
-                generation_config=ExtendedGenerationConfig(**optional_params),
+                generation_config=optional_params,
+                safety_settings=safety_settings,
                tools=tools,
            )

@ -1030,7 +1018,7 @@ async def async_completion(
                completion_tokens=completion_tokens,
                total_tokens=prompt_tokens + completion_tokens,
            )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response
    except Exception as e:
        raise VertexAIError(status_code=500, message=str(e))
@ -1051,50 +1039,12 @@ async def async_streaming(
    instances=None,
    vertex_project=None,
    vertex_location=None,
+    safety_settings=None,
    **optional_params,
 ):
    """
    Add support for async streaming calls for gemini-pro
    """
-    from vertexai.preview.generative_models import GenerationConfig
-    from google.cloud.aiplatform_v1beta1.types import content as gapic_content_types  # type: ignore
-
-    class ExtendedGenerationConfig(GenerationConfig):
-        """Extended parameters for the generation."""
-
-        def __init__(
-            self,
-            *,
-            temperature: Optional[float] = None,
-            top_p: Optional[float] = None,
-            top_k: Optional[int] = None,
-            candidate_count: Optional[int] = None,
-            max_output_tokens: Optional[int] = None,
-            stop_sequences: Optional[List[str]] = None,
-            response_mime_type: Optional[str] = None,
-        ):
-            args_spec = inspect.getfullargspec(gapic_content_types.GenerationConfig)
-
-            if "response_mime_type" in args_spec.args:
-                self._raw_generation_config = gapic_content_types.GenerationConfig(
-                    temperature=temperature,
-                    top_p=top_p,
-                    top_k=top_k,
-                    candidate_count=candidate_count,
-                    max_output_tokens=max_output_tokens,
-                    stop_sequences=stop_sequences,
-                    response_mime_type=response_mime_type,
-                )
-            else:
-                self._raw_generation_config = gapic_content_types.GenerationConfig(
-                    temperature=temperature,
-                    top_p=top_p,
-                    top_k=top_k,
-                    candidate_count=candidate_count,
-                    max_output_tokens=max_output_tokens,
-                    stop_sequences=stop_sequences,
-                )
-
    if mode == "vision":
        stream = optional_params.pop("stream")
        tools = optional_params.pop("tools", None)
@ -1115,11 +1065,11 @@ async def async_streaming(

        response = await llm_model._generate_content_streaming_async(
            contents=content,
-            generation_config=ExtendedGenerationConfig(**optional_params),
+            generation_config=optional_params,
+            safety_settings=safety_settings,
            tools=tools,
        )
-        optional_params["stream"] = True
-        optional_params["tools"] = tools
+
    elif mode == "chat":
        chat = llm_model.start_chat()
        optional_params.pop(
@ -1138,7 +1088,7 @@ async def async_streaming(
            },
        )
        response = chat.send_message_streaming_async(prompt, **optional_params)
-        optional_params["stream"] = True
+
    elif mode == "text":
        optional_params.pop(
            "stream", None
--- a/litellm/llms/vertex_ai_anthropic.py
+++ b/litellm/llms/vertex_ai_anthropic.py
@ -349,7 +349,7 @@ def completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response
    except Exception as e:
        raise VertexAIError(status_code=500, message=str(e))
@ -422,7 +422,7 @@ async def async_completion(
        completion_tokens=completion_tokens,
        total_tokens=prompt_tokens + completion_tokens,
    )
-    model_response.usage = usage
+    setattr(model_response, "usage", usage)
    return model_response


--- a/litellm/llms/vllm.py
+++ b/litellm/llms/vllm.py
@ -104,7 +104,7 @@ def completion(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        return model_response


@ -186,7 +186,7 @@ def batch_completions(
            completion_tokens=completion_tokens,
            total_tokens=prompt_tokens + completion_tokens,
        )
-        model_response.usage = usage
+        setattr(model_response, "usage", usage)
        final_outputs.append(model_response)
    return final_outputs

--- a/litellm/llms/watsonx.py
+++ b/litellm/llms/watsonx.py
@ -0,0 +1,609 @@
+from enum import Enum
+import json, types, time  # noqa: E401
+from contextlib import contextmanager
+from typing import Callable, Dict, Optional, Any, Union, List
+
+import httpx
+import requests
+import litellm
+from litellm.utils import ModelResponse, get_secret, Usage
+
+from .base import BaseLLM
+from .prompt_templates import factory as ptf
+
+
+class WatsonXAIError(Exception):
+    def __init__(self, status_code, message, url: Optional[str] = None):
+        self.status_code = status_code
+        self.message = message
+        url = url or "https://https://us-south.ml.cloud.ibm.com"
+        self.request = httpx.Request(method="POST", url=url)
+        self.response = httpx.Response(status_code=status_code, request=self.request)
+        super().__init__(
+            self.message
+        )  # Call the base class constructor with the parameters it needs
+
+
+class IBMWatsonXAIConfig:
+    """
+    Reference: https://cloud.ibm.com/apidocs/watsonx-ai#text-generation
+    (See ibm_watsonx_ai.metanames.GenTextParamsMetaNames for a list of all available params)
+
+    Supported params for all available watsonx.ai foundational models.
+
+    - `decoding_method` (str): One of "greedy" or "sample"
+
+    - `temperature` (float): Sets the model temperature for sampling - not available when decoding_method='greedy'.
+
+    - `max_new_tokens` (integer): Maximum length of the generated tokens.
+
+    - `min_new_tokens` (integer): Maximum length of input tokens. Any more than this will be truncated.
+
+    - `length_penalty` (dict): A dictionary with keys "decay_factor" and "start_index".
+
+    - `stop_sequences` (string[]): list of strings to use as stop sequences.
+
+    - `top_k` (integer): top k for sampling - not available when decoding_method='greedy'.
+
+    - `top_p` (integer): top p for sampling - not available when decoding_method='greedy'.
+
+    - `repetition_penalty` (float): token repetition penalty during text generation.
+
+    - `truncate_input_tokens` (integer): Truncate input tokens to this length.
+
+    - `include_stop_sequences` (bool): If True, the stop sequence will be included at the end of the generated text in the case of a match.
+
+    - `return_options` (dict): A dictionary of options to return. Options include "input_text", "generated_tokens", "input_tokens", "token_ranks". Values are boolean.
+
+    - `random_seed` (integer): Random seed for text generation.
+
+    - `moderations` (dict): Dictionary of properties that control the moderations, for usages such as Hate and profanity (HAP) and PII filtering.
+
+    - `stream` (bool): If True, the model will return a stream of responses.
+    """
+
+    decoding_method: Optional[str] = "sample"
+    temperature: Optional[float] = None
+    max_new_tokens: Optional[int] = None  # litellm.max_tokens
+    min_new_tokens: Optional[int] = None
+    length_penalty: Optional[dict] = None  # e.g {"decay_factor": 2.5, "start_index": 5}
+    stop_sequences: Optional[List[str]] = None  # e.g ["}", ")", "."]
+    top_k: Optional[int] = None
+    top_p: Optional[float] = None
+    repetition_penalty: Optional[float] = None
+    truncate_input_tokens: Optional[int] = None
+    include_stop_sequences: Optional[bool] = False
+    return_options: Optional[Dict[str, bool]] = None
+    random_seed: Optional[int] = None  # e.g 42
+    moderations: Optional[dict] = None
+    stream: Optional[bool] = False
+
+    def __init__(
+        self,
+        decoding_method: Optional[str] = None,
+        temperature: Optional[float] = None,
+        max_new_tokens: Optional[int] = None,
+        min_new_tokens: Optional[int] = None,
+        length_penalty: Optional[dict] = None,
+        stop_sequences: Optional[List[str]] = None,
+        top_k: Optional[int] = None,
+        top_p: Optional[float] = None,
+        repetition_penalty: Optional[float] = None,
+        truncate_input_tokens: Optional[int] = None,
+        include_stop_sequences: Optional[bool] = None,
+        return_options: Optional[dict] = None,
+        random_seed: Optional[int] = None,
+        moderations: Optional[dict] = None,
+        stream: Optional[bool] = None,
+        **kwargs,
+    ) -> None:
+        locals_ = locals()
+        for key, value in locals_.items():
+            if key != "self" and value is not None:
+                setattr(self.__class__, key, value)
+
+    @classmethod
+    def get_config(cls):
+        return {
+            k: v
+            for k, v in cls.__dict__.items()
+            if not k.startswith("__")
+            and not isinstance(
+                v,
+                (
+                    types.FunctionType,
+                    types.BuiltinFunctionType,
+                    classmethod,
+                    staticmethod,
+                ),
+            )
+            and v is not None
+        }
+
+    def get_supported_openai_params(self):
+        return [
+            "temperature",  # equivalent to temperature
+            "max_tokens",  # equivalent to max_new_tokens
+            "top_p",  # equivalent to top_p
+            "frequency_penalty",  # equivalent to repetition_penalty
+            "stop",  # equivalent to stop_sequences
+            "seed",  # equivalent to random_seed
+            "stream",  # equivalent to stream
+        ]
+
+    def get_mapped_special_auth_params(self) -> dict:
+        """
+        Common auth params across bedrock/vertex_ai/azure/watsonx
+        """
+        return {
+            "project": "watsonx_project",
+            "region_name": "watsonx_region_name",
+            "token": "watsonx_token",
+        }
+
+    def map_special_auth_params(self, non_default_params: dict, optional_params: dict):
+        mapped_params = self.get_mapped_special_auth_params()
+
+        for param, value in non_default_params.items():
+            if param in mapped_params:
+                optional_params[mapped_params[param]] = value
+        return optional_params
+
+
+def convert_messages_to_prompt(model, messages, provider, custom_prompt_dict):
+    # handle anthropic prompts and amazon titan prompts
+    if model in custom_prompt_dict:
+        # check if the model has a registered custom prompt
+        model_prompt_dict = custom_prompt_dict[model]
+        prompt = ptf.custom_prompt(
+            messages=messages,
+            role_dict=model_prompt_dict.get(
+                "role_dict", model_prompt_dict.get("roles")
+            ),
+            initial_prompt_value=model_prompt_dict.get("initial_prompt_value", ""),
+            final_prompt_value=model_prompt_dict.get("final_prompt_value", ""),
+            bos_token=model_prompt_dict.get("bos_token", ""),
+            eos_token=model_prompt_dict.get("eos_token", ""),
+        )
+        return prompt
+    elif provider == "ibm":
+        prompt = ptf.prompt_factory(
+            model=model, messages=messages, custom_llm_provider="watsonx"
+        )
+    elif provider == "ibm-mistralai":
+        prompt = ptf.mistral_instruct_pt(messages=messages)
+    else:
+        prompt = ptf.prompt_factory(
+            model=model, messages=messages, custom_llm_provider="watsonx"
+        )
+    return prompt
+
+
+class WatsonXAIEndpoint(str, Enum):
+    TEXT_GENERATION = "/ml/v1/text/generation"
+    TEXT_GENERATION_STREAM = "/ml/v1/text/generation_stream"
+    DEPLOYMENT_TEXT_GENERATION = "/ml/v1/deployments/{deployment_id}/text/generation"
+    DEPLOYMENT_TEXT_GENERATION_STREAM = (
+        "/ml/v1/deployments/{deployment_id}/text/generation_stream"
+    )
+    EMBEDDINGS = "/ml/v1/text/embeddings"
+    PROMPTS = "/ml/v1/prompts"
+
+
+class IBMWatsonXAI(BaseLLM):
+    """
+    Class to interface with IBM Watsonx.ai API for text generation and embeddings.
+
+    Reference: https://cloud.ibm.com/apidocs/watsonx-ai
+    """
+
+    api_version = "2024-03-13"
+
+    def __init__(self) -> None:
+        super().__init__()
+
+    def _prepare_text_generation_req(
+        self,
+        model_id: str,
+        prompt: str,
+        stream: bool,
+        optional_params: dict,
+        print_verbose: Optional[Callable] = None,
+    ) -> dict:
+        """
+        Get the request parameters for text generation.
+        """
+        api_params = self._get_api_params(optional_params, print_verbose=print_verbose)
+        # build auth headers
+        api_token = api_params.get("token")
+
+        headers = {
+            "Authorization": f"Bearer {api_token}",
+            "Content-Type": "application/json",
+            "Accept": "application/json",
+        }
+        extra_body_params = optional_params.pop("extra_body", {})
+        optional_params.update(extra_body_params)
+        # init the payload to the text generation call
+        payload = {
+            "input": prompt,
+            "moderations": optional_params.pop("moderations", {}),
+            "parameters": optional_params,
+        }
+        request_params = dict(version=api_params["api_version"])
+        # text generation endpoint deployment or model / stream or not
+        if model_id.startswith("deployment/"):
+            # deployment models are passed in as 'deployment/<deployment_id>'
+            if api_params.get("space_id") is None:
+                raise WatsonXAIError(
+                    status_code=401,
+                    url=api_params["url"],
+                    message="Error: space_id is required for models called using the 'deployment/' endpoint. Pass in the space_id as a parameter or set it in the WX_SPACE_ID environment variable.",
+                )
+            deployment_id = "/".join(model_id.split("/")[1:])
+            endpoint = (
+                WatsonXAIEndpoint.DEPLOYMENT_TEXT_GENERATION_STREAM.value
+                if stream
+                else WatsonXAIEndpoint.DEPLOYMENT_TEXT_GENERATION.value
+            )
+            endpoint = endpoint.format(deployment_id=deployment_id)
+        else:
+            payload["model_id"] = model_id
+            payload["project_id"] = api_params["project_id"]
+            endpoint = (
+                WatsonXAIEndpoint.TEXT_GENERATION_STREAM
+                if stream
+                else WatsonXAIEndpoint.TEXT_GENERATION
+            )
+        url = api_params["url"].rstrip("/") + endpoint
+        return dict(
+            method="POST", url=url, headers=headers, json=payload, params=request_params
+        )
+
+    def _get_api_params(
+        self, params: dict, print_verbose: Optional[Callable] = None
+    ) -> dict:
+        """
+        Find watsonx.ai credentials in the params or environment variables and return the headers for authentication.
+        """
+        # Load auth variables from params
+        url = params.pop("url", params.pop("api_base", params.pop("base_url", None)))
+        api_key = params.pop("apikey", None)
+        token = params.pop("token", None)
+        project_id = params.pop(
+            "project_id", params.pop("watsonx_project", None)
+        )  # watsonx.ai project_id - allow 'watsonx_project' to be consistent with how vertex project implementation works -> reduce provider-specific params
+        space_id = params.pop("space_id", None)  # watsonx.ai deployment space_id
+        region_name = params.pop("region_name", params.pop("region", None))
+        if region_name is None:
+            region_name = params.pop(
+                "watsonx_region_name", params.pop("watsonx_region", None)
+            )  # consistent with how vertex ai + aws regions are accepted
+        wx_credentials = params.pop(
+            "wx_credentials",
+            params.pop(
+                "watsonx_credentials", None
+            ),  # follow {provider}_credentials, same as vertex ai
+        )
+        api_version = params.pop("api_version", IBMWatsonXAI.api_version)
+        # Load auth variables from environment variables
+        if url is None:
+            url = (
+                get_secret("WATSONX_API_BASE")  # consistent with 'AZURE_API_BASE'
+                or get_secret("WATSONX_URL")
+                or get_secret("WX_URL")
+                or get_secret("WML_URL")
+            )
+        if api_key is None:
+            api_key = (
+                get_secret("WATSONX_APIKEY")
+                or get_secret("WATSONX_API_KEY")
+                or get_secret("WX_API_KEY")
+            )
+        if token is None:
+            token = get_secret("WATSONX_TOKEN") or get_secret("WX_TOKEN")
+        if project_id is None:
+            project_id = (
+                get_secret("WATSONX_PROJECT_ID")
+                or get_secret("WX_PROJECT_ID")
+                or get_secret("PROJECT_ID")
+            )
+        if region_name is None:
+            region_name = (
+                get_secret("WATSONX_REGION")
+                or get_secret("WX_REGION")
+                or get_secret("REGION")
+            )
+        if space_id is None:
+            space_id = (
+                get_secret("WATSONX_DEPLOYMENT_SPACE_ID")
+                or get_secret("WATSONX_SPACE_ID")
+                or get_secret("WX_SPACE_ID")
+                or get_secret("SPACE_ID")
+            )
+
+        # credentials parsing
+        if wx_credentials is not None:
+            url = wx_credentials.get("url", url)
+            api_key = wx_credentials.get(
+                "apikey", wx_credentials.get("api_key", api_key)
+            )
+            token = wx_credentials.get(
+                "token",
+                wx_credentials.get(
+                    "watsonx_token", token
+                ),  # follow format of {provider}_token, same as azure - e.g. 'azure_ad_token=..'
+            )
+
+        # verify that all required credentials are present
+        if url is None:
+            raise WatsonXAIError(
+                status_code=401,
+                message="Error: Watsonx URL not set. Set WX_URL in environment variables or pass in as a parameter.",
+            )
+        if token is None and api_key is not None:
+            # generate the auth token
+            if print_verbose:
+                print_verbose("Generating IAM token for Watsonx.ai")
+            token = self.generate_iam_token(api_key)
+        elif token is None and api_key is None:
+            raise WatsonXAIError(
+                status_code=401,
+                url=url,
+                message="Error: API key or token not found. Set WX_API_KEY or WX_TOKEN in environment variables or pass in as a parameter.",
+            )
+        if project_id is None:
+            raise WatsonXAIError(
+                status_code=401,
+                url=url,
+                message="Error: Watsonx project_id not set. Set WX_PROJECT_ID in environment variables or pass in as a parameter.",
+            )
+
+        return {
+            "url": url,
+            "api_key": api_key,
+            "token": token,
+            "project_id": project_id,
+            "space_id": space_id,
+            "region_name": region_name,
+            "api_version": api_version,
+        }
+
+    def completion(
+        self,
+        model: str,
+        messages: list,
+        custom_prompt_dict: dict,
+        model_response: ModelResponse,
+        print_verbose: Callable,
+        encoding,
+        logging_obj,
+        optional_params: dict,
+        litellm_params: Optional[dict] = None,
+        logger_fn=None,
+        timeout: Optional[float] = None,
+    ):
+        """
+        Send a text generation request to the IBM Watsonx.ai API.
+        Reference: https://cloud.ibm.com/apidocs/watsonx-ai#text-generation
+        """
+        stream = optional_params.pop("stream", False)
+
+        # Load default configs
+        config = IBMWatsonXAIConfig.get_config()
+        for k, v in config.items():
+            if k not in optional_params:
+                optional_params[k] = v
+
+        # Make prompt to send to model
+        provider = model.split("/")[0]
+        # model_name = "/".join(model.split("/")[1:])
+        prompt = convert_messages_to_prompt(
+            model, messages, provider, custom_prompt_dict
+        )
+
+        def process_text_request(request_params: dict) -> ModelResponse:
+            with self._manage_response(
+                request_params, logging_obj=logging_obj, input=prompt, timeout=timeout
+            ) as resp:
+                json_resp = resp.json()
+
+            generated_text = json_resp["results"][0]["generated_text"]
+            prompt_tokens = json_resp["results"][0]["input_token_count"]
+            completion_tokens = json_resp["results"][0]["generated_token_count"]
+            model_response["choices"][0]["message"]["content"] = generated_text
+            model_response["finish_reason"] = json_resp["results"][0]["stop_reason"]
+            model_response["created"] = int(time.time())
+            model_response["model"] = model
+            setattr(
+                model_response,
+                "usage",
+                Usage(
+                    prompt_tokens=prompt_tokens,
+                    completion_tokens=completion_tokens,
+                    total_tokens=prompt_tokens + completion_tokens,
+                ),
+            )
+            return model_response
+
+        def process_stream_request(
+            request_params: dict,
+        ) -> litellm.CustomStreamWrapper:
+            # stream the response - generated chunks will be handled
+            # by litellm.utils.CustomStreamWrapper.handle_watsonx_stream
+            with self._manage_response(
+                request_params,
+                logging_obj=logging_obj,
+                stream=True,
+                input=prompt,
+                timeout=timeout,
+            ) as resp:
+                response = litellm.CustomStreamWrapper(
+                    resp.iter_lines(),
+                    model=model,
+                    custom_llm_provider="watsonx",
+                    logging_obj=logging_obj,
+                )
+            return response
+
+        try:
+            ## Get the response from the model
+            req_params = self._prepare_text_generation_req(
+                model_id=model,
+                prompt=prompt,
+                stream=stream,
+                optional_params=optional_params,
+                print_verbose=print_verbose,
+            )
+            if stream:
+                return process_stream_request(req_params)
+            else:
+                return process_text_request(req_params)
+        except WatsonXAIError as e:
+            raise e
+        except Exception as e:
+            raise WatsonXAIError(status_code=500, message=str(e))
+
+    def embedding(
+        self,
+        model: str,
+        input: Union[list, str],
+        api_key: Optional[str] = None,
+        logging_obj=None,
+        model_response=None,
+        optional_params=None,
+        encoding=None,
+    ):
+        """
+        Send a text embedding request to the IBM Watsonx.ai API.
+        """
+        if optional_params is None:
+            optional_params = {}
+        # Load default configs
+        config = IBMWatsonXAIConfig.get_config()
+        for k, v in config.items():
+            if k not in optional_params:
+                optional_params[k] = v
+
+        # Load auth variables from environment variables
+        if isinstance(input, str):
+            input = [input]
+        if api_key is not None:
+            optional_params["api_key"] = api_key
+        api_params = self._get_api_params(optional_params)
+        # build auth headers
+        api_token = api_params.get("token")
+        headers = {
+            "Authorization": f"Bearer {api_token}",
+            "Content-Type": "application/json",
+            "Accept": "application/json",
+        }
+        # init the payload to the text generation call
+        payload = {
+            "inputs": input,
+            "model_id": model,
+            "project_id": api_params["project_id"],
+            "parameters": optional_params,
+        }
+        request_params = dict(version=api_params["api_version"])
+        url = api_params["url"].rstrip("/") + WatsonXAIEndpoint.EMBEDDINGS
+        # request = httpx.Request(
+        #     "POST", url, headers=headers, json=payload, params=request_params
+        # )
+        req_params = {
+            "method": "POST",
+            "url": url,
+            "headers": headers,
+            "json": payload,
+            "params": request_params,
+        }
+        with self._manage_response(
+            req_params, logging_obj=logging_obj, input=input
+        ) as resp:
+            json_resp = resp.json()
+
+        results = json_resp.get("results", [])
+        embedding_response = []
+        for idx, result in enumerate(results):
+            embedding_response.append(
+                {"object": "embedding", "index": idx, "embedding": result["embedding"]}
+            )
+        model_response["object"] = "list"
+        model_response["data"] = embedding_response
+        model_response["model"] = model
+        input_tokens = json_resp.get("input_token_count", 0)
+        model_response.usage = Usage(
+            prompt_tokens=input_tokens, completion_tokens=0, total_tokens=input_tokens
+        )
+        return model_response
+
+    def generate_iam_token(self, api_key=None, **params):
+        headers = {}
+        headers["Content-Type"] = "application/x-www-form-urlencoded"
+        if api_key is None:
+            api_key = get_secret("WX_API_KEY") or get_secret("WATSONX_API_KEY")
+        if api_key is None:
+            raise ValueError("API key is required")
+        headers["Accept"] = "application/json"
+        data = {
+            "grant_type": "urn:ibm:params:oauth:grant-type:apikey",
+            "apikey": api_key,
+        }
+        response = httpx.post(
+            "https://iam.cloud.ibm.com/identity/token", data=data, headers=headers
+        )
+        response.raise_for_status()
+        json_data = response.json()
+        iam_access_token = json_data["access_token"]
+        self.token = iam_access_token
+        return iam_access_token
+
+    @contextmanager
+    def _manage_response(
+        self,
+        request_params: dict,
+        logging_obj: Any,
+        stream: bool = False,
+        input: Optional[Any] = None,
+        timeout: Optional[float] = None,
+    ):
+        request_str = (
+            f"response = {request_params['method']}(\n"
+            f"\turl={request_params['url']},\n"
+            f"\tjson={request_params['json']},\n"
+            f")"
+        )
+        logging_obj.pre_call(
+            input=input,
+            api_key=request_params["headers"].get("Authorization"),
+            additional_args={
+                "complete_input_dict": request_params["json"],
+                "request_str": request_str,
+            },
+        )
+        if timeout:
+            request_params["timeout"] = timeout
+        try:
+            if stream:
+                resp = requests.request(
+                    **request_params,
+                    stream=True,
+                )
+                resp.raise_for_status()
+                yield resp
+            else:
+                resp = requests.request(**request_params)
+                resp.raise_for_status()
+                yield resp
+        except Exception as e:
+            raise WatsonXAIError(status_code=500, message=str(e))
+        if not stream:
+            logging_obj.post_call(
+                input=input,
+                api_key=request_params["headers"].get("Authorization"),
+                original_response=json.dumps(resp.json()),
+                additional_args={
+                    "status_code": resp.status_code,
+                    "complete_input_dict": request_params["json"],
+                },
+            )
--- a/litellm/main.py
+++ b/litellm/main.py
@ -12,7 +12,6 @@ from typing import Any, Literal, Union, BinaryIO
 from functools import partial
 import dotenv, traceback, random, asyncio, time, contextvars
 from copy import deepcopy
-
 import httpx
 import litellm
 from ._logging import verbose_logger
@ -64,6 +63,7 @@ from .llms import (
    vertex_ai,
    vertex_ai_anthropic,
    maritalk,
+    watsonx,
 )
 from .llms.openai import OpenAIChatCompletion, OpenAITextCompletion
 from .llms.azure import AzureChatCompletion
@ -343,6 +343,7 @@ async def acompletion(
            custom_llm_provider=custom_llm_provider,
            original_exception=e,
            completion_kwargs=completion_kwargs,
+            extra_kwargs=kwargs,
        )


@ -408,8 +409,10 @@ def mock_completion(
        model_response["created"] = int(time.time())
        model_response["model"] = model

-        model_response.usage = Usage(
-            prompt_tokens=10, completion_tokens=20, total_tokens=30
+        setattr(
+            model_response,
+            "usage",
+            Usage(prompt_tokens=10, completion_tokens=20, total_tokens=30),
        )

        try:
@ -609,6 +612,7 @@ def completion(
        "client",
        "rpm",
        "tpm",
+        "max_parallel_requests",
        "input_cost_per_token",
        "output_cost_per_token",
        "input_cost_per_second",
@ -652,6 +656,7 @@ def completion(
                model
            ]  # update the model to the actual value if an alias has been passed in
        model_response = ModelResponse()
+        setattr(model_response, "usage", litellm.Usage())
        if (
            kwargs.get("azure", False) == True
        ):  # don't remove flag check, to remain backwards compatible for repos like Codium
@ -1732,13 +1737,14 @@ def completion(
                or optional_params.pop("vertex_ai_credentials", None)
                or get_secret("VERTEXAI_CREDENTIALS")
            )
+            new_params = deepcopy(optional_params)
            if "claude-3" in model:
                model_response = vertex_ai_anthropic.completion(
                    model=model,
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
-                    optional_params=optional_params,
+                    optional_params=new_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
@ -1754,7 +1760,7 @@ def completion(
                    messages=messages,
                    model_response=model_response,
                    print_verbose=print_verbose,
-                    optional_params=optional_params,
+                    optional_params=new_params,
                    litellm_params=litellm_params,
                    logger_fn=logger_fn,
                    encoding=encoding,
@ -1907,6 +1913,43 @@ def completion(

            ## RESPONSE OBJECT
            response = response
+        elif custom_llm_provider == "watsonx":
+            custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
+            response = watsonx.IBMWatsonXAI().completion(
+                model=model,
+                messages=messages,
+                custom_prompt_dict=custom_prompt_dict,
+                model_response=model_response,
+                print_verbose=print_verbose,
+                optional_params=optional_params,
+                litellm_params=litellm_params,  # type: ignore
+                logger_fn=logger_fn,
+                encoding=encoding,
+                logging_obj=logging,
+                timeout=timeout,
+            )
+            if (
+                "stream" in optional_params
+                and optional_params["stream"] == True
+                and not isinstance(response, CustomStreamWrapper)
+            ):
+                # don't try to access stream object,
+                response = CustomStreamWrapper(
+                    iter(response),
+                    model,
+                    custom_llm_provider="watsonx",
+                    logging_obj=logging,
+                )
+
+            if optional_params.get("stream", False):
+                ## LOGGING
+                logging.post_call(
+                    input=messages,
+                    api_key=None,
+                    original_response=response,
+                )
+            ## RESPONSE OBJECT
+            response = response
        elif custom_llm_provider == "vllm":
            custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
            model_response = vllm.completion(
@ -1990,9 +2033,16 @@ def completion(
                or "http://localhost:11434"
            )

+            api_key = (
+                api_key
+                or litellm.ollama_key
+                or os.environ.get("OLLAMA_API_KEY")
+                or litellm.api_key
+            )
            ## LOGGING
            generator = ollama_chat.get_ollama_response(
                api_base,
+                api_key,
                model,
                messages,
                optional_params,
@ -2188,6 +2238,7 @@ def completion(
            custom_llm_provider=custom_llm_provider,
            original_exception=e,
            completion_kwargs=args,
+            extra_kwargs=kwargs,
        )


@ -2549,6 +2600,7 @@ async def aembedding(*args, **kwargs):
            custom_llm_provider=custom_llm_provider,
            original_exception=e,
            completion_kwargs=args,
+            extra_kwargs=kwargs,
        )


@ -2600,6 +2652,7 @@ def embedding(
    client = kwargs.pop("client", None)
    rpm = kwargs.pop("rpm", None)
    tpm = kwargs.pop("tpm", None)
+    max_parallel_requests = kwargs.pop("max_parallel_requests", None)
    model_info = kwargs.get("model_info", None)
    metadata = kwargs.get("metadata", None)
    encoding_format = kwargs.get("encoding_format", None)
@ -2657,6 +2710,7 @@ def embedding(
        "client",
        "rpm",
        "tpm",
+        "max_parallel_requests",
        "input_cost_per_token",
        "output_cost_per_token",
        "input_cost_per_second",
@ -2975,6 +3029,15 @@ def embedding(
                client=client,
                aembedding=aembedding,
            )
+        elif custom_llm_provider == "watsonx":
+            response = watsonx.IBMWatsonXAI().embedding(
+                model=model,
+                input=input,
+                encoding=encoding,
+                logging_obj=logging,
+                optional_params=optional_params,
+                model_response=EmbeddingResponse(),
+            )
        else:
            args = locals()
            raise ValueError(f"No valid embedding model args passed in - {args}")
@ -2990,7 +3053,10 @@ def embedding(
        )
        ## Map to OpenAI Exception
        raise exception_type(
-            model=model, original_exception=e, custom_llm_provider=custom_llm_provider
+            model=model,
+            original_exception=e,
+            custom_llm_provider=custom_llm_provider,
+            extra_kwargs=kwargs,
        )


@ -3084,6 +3150,7 @@ async def atext_completion(*args, **kwargs):
            custom_llm_provider=custom_llm_provider,
            original_exception=e,
            completion_kwargs=args,
+            extra_kwargs=kwargs,
        )


@ -3421,6 +3488,7 @@ async def aimage_generation(*args, **kwargs):
            custom_llm_provider=custom_llm_provider,
            original_exception=e,
            completion_kwargs=args,
+            extra_kwargs=kwargs,
        )


@ -3511,6 +3579,7 @@ def image_generation(
            "client",
            "rpm",
            "tpm",
+            "max_parallel_requests",
            "input_cost_per_token",
            "output_cost_per_token",
            "hf_model_name",
@ -3620,6 +3689,7 @@ def image_generation(
            custom_llm_provider=custom_llm_provider,
            original_exception=e,
            completion_kwargs=locals(),
+            extra_kwargs=kwargs,
        )


@ -3669,6 +3739,7 @@ async def atranscription(*args, **kwargs):
            custom_llm_provider=custom_llm_provider,
            original_exception=e,
            completion_kwargs=args,
+            extra_kwargs=kwargs,
        )


--- a/litellm/model_prices_and_context_window_backup.json
+++ b/litellm/model_prices_and_context_window_backup.json
@ -650,6 +650,7 @@
        "input_cost_per_token": 0.000002,
        "output_cost_per_token": 0.000006,
        "litellm_provider": "mistral",
+        "supports_function_calling": true,
        "mode": "chat"
    },
    "mistral/mistral-small-latest": {
@ -659,6 +660,7 @@
        "input_cost_per_token": 0.000002,
        "output_cost_per_token": 0.000006,
        "litellm_provider": "mistral",
+        "supports_function_calling": true,
        "mode": "chat"
    },
    "mistral/mistral-medium": {
@ -735,6 +737,26 @@
        "mode": "chat",
        "supports_function_calling": true
    },
+    "groq/llama3-8b-8192": {
+        "max_tokens": 8192,
+        "max_input_tokens": 8192,
+        "max_output_tokens": 8192,
+        "input_cost_per_token": 0.00000010,
+        "output_cost_per_token": 0.00000010,
+        "litellm_provider": "groq",
+        "mode": "chat",
+        "supports_function_calling": true
+    },
+    "groq/llama3-70b-8192": {
+        "max_tokens": 8192,
+        "max_input_tokens": 8192,
+        "max_output_tokens": 8192,
+        "input_cost_per_token": 0.00000064,
+        "output_cost_per_token": 0.00000080,
+        "litellm_provider": "groq",
+        "mode": "chat",
+        "supports_function_calling": true
+    },
    "groq/mixtral-8x7b-32768": {
        "max_tokens": 32768,
        "max_input_tokens": 32768,
@ -789,7 +811,9 @@
        "input_cost_per_token": 0.00000025,
        "output_cost_per_token": 0.00000125,
        "litellm_provider": "anthropic",
-        "mode": "chat"
+        "mode": "chat",
+        "supports_function_calling": true,
+        "tool_use_system_prompt_tokens": 264
    },
    "claude-3-opus-20240229": {
        "max_tokens": 4096,
@ -798,7 +822,9 @@
        "input_cost_per_token": 0.000015,
        "output_cost_per_token": 0.000075,
        "litellm_provider": "anthropic",
-        "mode": "chat"
+        "mode": "chat",
+        "supports_function_calling": true,
+        "tool_use_system_prompt_tokens": 395
    },
    "claude-3-sonnet-20240229": {
        "max_tokens": 4096,
@ -807,7 +833,9 @@
        "input_cost_per_token": 0.000003,
        "output_cost_per_token": 0.000015,
        "litellm_provider": "anthropic",
-        "mode": "chat"
+        "mode": "chat",
+        "supports_function_calling": true,
+        "tool_use_system_prompt_tokens": 159
    },
    "text-bison": {
        "max_tokens": 1024,
@ -1113,7 +1141,8 @@
        "input_cost_per_token": 0.000003,
        "output_cost_per_token": 0.000015,
        "litellm_provider": "vertex_ai-anthropic_models",
-        "mode": "chat"
+        "mode": "chat",
+        "supports_function_calling": true
    },
    "vertex_ai/claude-3-haiku@20240307": {
        "max_tokens": 4096, 
@ -1122,7 +1151,8 @@
        "input_cost_per_token": 0.00000025,
        "output_cost_per_token": 0.00000125,
        "litellm_provider": "vertex_ai-anthropic_models",
-        "mode": "chat"
+        "mode": "chat",
+        "supports_function_calling": true
    },
    "vertex_ai/claude-3-opus@20240229": {
        "max_tokens": 4096,
@ -1131,7 +1161,8 @@
        "input_cost_per_token": 0.0000015,
        "output_cost_per_token": 0.0000075,
        "litellm_provider": "vertex_ai-anthropic_models",
-        "mode": "chat"
+        "mode": "chat",
+        "supports_function_calling": true
    },
    "textembedding-gecko": {
        "max_tokens": 3072,
@ -1387,6 +1418,123 @@
        "litellm_provider": "replicate",
        "mode": "chat"
    },
+    "replicate/meta/llama-2-13b": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.0000001,
+        "output_cost_per_token": 0.0000005,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/meta/llama-2-13b-chat": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.0000001,
+        "output_cost_per_token": 0.0000005,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/meta/llama-2-70b": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000065,
+        "output_cost_per_token": 0.00000275,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/meta/llama-2-70b-chat": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000065,
+        "output_cost_per_token": 0.00000275,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/meta/llama-2-7b": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000005,
+        "output_cost_per_token": 0.00000025,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/meta/llama-2-7b-chat": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000005,
+        "output_cost_per_token": 0.00000025,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/meta/llama-3-70b": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000065,
+        "output_cost_per_token": 0.00000275,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/meta/llama-3-70b-instruct": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000065,
+        "output_cost_per_token": 0.00000275,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/meta/llama-3-8b": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000005,
+        "output_cost_per_token": 0.00000025,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/meta/llama-3-8b-instruct": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000005,
+        "output_cost_per_token": 0.00000025,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/mistralai/mistral-7b-v0.1": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000005,
+        "output_cost_per_token": 0.00000025,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/mistralai/mistral-7b-instruct-v0.2": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.00000005,
+        "output_cost_per_token": 0.00000025,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
+    "replicate/mistralai/mixtral-8x7b-instruct-v0.1": {
+        "max_tokens": 4096,
+        "max_input_tokens": 4096,
+        "max_output_tokens": 4096,
+        "input_cost_per_token": 0.0000003,
+        "output_cost_per_token": 0.000001,
+        "litellm_provider": "replicate",
+        "mode": "chat"
+    },
    "openrouter/openai/gpt-3.5-turbo": {
        "max_tokens": 4095,
        "input_cost_per_token": 0.0000015,
@ -1515,6 +1663,13 @@
        "litellm_provider": "openrouter",
        "mode": "chat"
    },
+    "openrouter/meta-llama/llama-3-70b-instruct": {
+        "max_tokens": 8192,
+        "input_cost_per_token": 0.0000008,
+        "output_cost_per_token": 0.0000008,
+        "litellm_provider": "openrouter",
+        "mode": "chat"
+    },
    "j2-ultra": {
        "max_tokens": 8192,
        "max_input_tokens": 8192,
@ -1762,7 +1917,8 @@
        "input_cost_per_token": 0.000003,
        "output_cost_per_token": 0.000015,
        "litellm_provider": "bedrock",
-        "mode": "chat"
+        "mode": "chat",
+        "supports_function_calling": true
    },
    "anthropic.claude-3-haiku-20240307-v1:0": {
        "max_tokens": 4096, 
@ -1771,7 +1927,8 @@
        "input_cost_per_token": 0.00000025,
        "output_cost_per_token": 0.00000125,
        "litellm_provider": "bedrock",
-        "mode": "chat"
+        "mode": "chat",
+        "supports_function_calling": true
    },
    "anthropic.claude-3-opus-20240229-v1:0": {
        "max_tokens": 4096,
@ -1780,7 +1937,8 @@
        "input_cost_per_token": 0.000015,
        "output_cost_per_token": 0.000075,
        "litellm_provider": "bedrock",
-        "mode": "chat"
+        "mode": "chat",
+        "supports_function_calling": true
    },
    "anthropic.claude-v1": {
        "max_tokens": 8191, 
--- a/litellm/proxy/_experimental/out/404.html
+++ b/litellm/proxy/_experimental/out/404.html
--- a/litellm/proxy/_experimental/out/_next/static/chunks/447-9f8d32190ff7d16d.js
+++ b/litellm/proxy/_experimental/out/_next/static/chunks/447-9f8d32190ff7d16d.js
--- a/litellm/proxy/_experimental/out/_next/static/chunks/69-e49705773ae41779.js
+++ b/litellm/proxy/_experimental/out/_next/static/chunks/69-e49705773ae41779.js
--- a/litellm/proxy/_experimental/out/_next/static/chunks/app/_not-found-a166f65a6af292d9.js
+++ b/litellm/proxy/_experimental/out/_next/static/chunks/app/_not-found-a166f65a6af292d9.js
--- a/litellm/proxy/_experimental/out/_next/static/chunks/app/layout-9df54c3ecb71a71d.js
+++ b/litellm/proxy/_experimental/out/_next/static/chunks/app/layout-9df54c3ecb71a71d.js
--- a/litellm/proxy/_experimental/out/_next/static/chunks/app/page-508c39694bd40fe9.js
+++ b/litellm/proxy/_experimental/out/_next/static/chunks/app/page-508c39694bd40fe9.js
--- a/litellm/proxy/_experimental/out/_next/static/chunks/app/page-cb85da9a307105a0.js
+++ b/litellm/proxy/_experimental/out/_next/static/chunks/app/page-cb85da9a307105a0.js
--- a/litellm/proxy/_experimental/out/_next/static/chunks/fd9d1056-dafd44dfa2da140c.js
+++ b/litellm/proxy/_experimental/out/_next/static/chunks/fd9d1056-dafd44dfa2da140c.js
--- a/litellm/proxy/_experimental/out/_next/static/chunks/webpack-ccae12a25017afa5.js
+++ b/litellm/proxy/_experimental/out/_next/static/chunks/webpack-ccae12a25017afa5.js
@ -1 +1 @@
-!function(){"use strict";var e,t,n,r,o,u,i,c,f,a={},l={};function d(e){var t=l[e];if(void 0!==t)return t.exports;var n=l[e]={id:e,loaded:!1,exports:{}},r=!0;try{a[e](n,n.exports,d),r=!1}finally{r&&delete l[e]}return n.loaded=!0,n.exports}d.m=a,e=[],d.O=function(t,n,r,o){if(n){o=o||0;for(var u=e.length;u>0&&e[u-1][2]>o;u--)e[u]=e[u-1];e[u]=[n,r,o];return}for(var i=1/0,u=0;u<e.length;u++){for(var n=e[u][0],r=e[u][1],o=e[u][2],c=!0,f=0;f<n.length;f++)i>=o&&Object.keys(d.O).every(function(e){return d.O[e](n[f])})?n.splice(f--,1):(c=!1,o<i&&(i=o));if(c){e.splice(u--,1);var a=r();void 0!==a&&(t=a)}}return t},d.n=function(e){var t=e&&e.__esModule?function(){return e.default}:function(){return e};return d.d(t,{a:t}),t},n=Object.getPrototypeOf?function(e){return Object.getPrototypeOf(e)}:function(e){return e.__proto__},d.t=function(e,r){if(1&r&&(e=this(e)),8&r||"object"==typeof e&&e&&(4&r&&e.__esModule||16&r&&"function"==typeof e.then))return e;var o=Object.create(null);d.r(o);var u={};t=t||[null,n({}),n([]),n(n)];for(var i=2&r&&e;"object"==typeof i&&!~t.indexOf(i);i=n(i))Object.getOwnPropertyNames(i).forEach(function(t){u[t]=function(){return e[t]}});return u.default=function(){return e},d.d(o,u),o},d.d=function(e,t){for(var n in t)d.o(t,n)&&!d.o(e,n)&&Object.defineProperty(e,n,{enumerable:!0,get:t[n]})},d.f={},d.e=function(e){return Promise.all(Object.keys(d.f).reduce(function(t,n){return d.f[n](e,t),t},[]))},d.u=function(e){},d.miniCssF=function(e){return"static/css/dc347b0d22ffde5d.css"},d.g=function(){if("object"==typeof globalThis)return globalThis;try{return this||Function("return this")()}catch(e){if("object"==typeof window)return window}}(),d.o=function(e,t){return Object.prototype.hasOwnProperty.call(e,t)},r={},o="_N_E:",d.l=function(e,t,n,u){if(r[e]){r[e].push(t);return}if(void 0!==n)for(var i,c,f=document.getElementsByTagName("script"),a=0;a<f.length;a++){var l=f[a];if(l.getAttribute("src")==e||l.getAttribute("data-webpack")==o+n){i=l;break}}i||(c=!0,(i=document.createElement("script")).charset="utf-8",i.timeout=120,d.nc&&i.setAttribute("nonce",d.nc),i.setAttribute("data-webpack",o+n),i.src=d.tu(e)),r[e]=[t];var s=function(t,n){i.onerror=i.onload=null,clearTimeout(p);var o=r[e];if(delete r[e],i.parentNode&&i.parentNode.removeChild(i),o&&o.forEach(function(e){return e(n)}),t)return t(n)},p=setTimeout(s.bind(null,void 0,{type:"timeout",target:i}),12e4);i.onerror=s.bind(null,i.onerror),i.onload=s.bind(null,i.onload),c&&document.head.appendChild(i)},d.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},d.nmd=function(e){return e.paths=[],e.children||(e.children=[]),e},d.tt=function(){return void 0===u&&(u={createScriptURL:function(e){return e}},"undefined"!=typeof trustedTypes&&trustedTypes.createPolicy&&(u=trustedTypes.createPolicy("nextjs#bundler",u))),u},d.tu=function(e){return d.tt().createScriptURL(e)},d.p="/ui/_next/",i={272:0},d.f.j=function(e,t){var n=d.o(i,e)?i[e]:void 0;if(0!==n){if(n)t.push(n[2]);else if(272!=e){var r=new Promise(function(t,r){n=i[e]=[t,r]});t.push(n[2]=r);var o=d.p+d.u(e),u=Error();d.l(o,function(t){if(d.o(i,e)&&(0!==(n=i[e])&&(i[e]=void 0),n)){var r=t&&("load"===t.type?"missing":t.type),o=t&&t.target&&t.target.src;u.message="Loading chunk "+e+" failed.\n("+r+": "+o+")",u.name="ChunkLoadError",u.type=r,u.request=o,n[1](u)}},"chunk-"+e,e)}else i[e]=0}},d.O.j=function(e){return 0===i[e]},c=function(e,t){var n,r,o=t[0],u=t[1],c=t[2],f=0;if(o.some(function(e){return 0!==i[e]})){for(n in u)d.o(u,n)&&(d.m[n]=u[n]);if(c)var a=c(d)}for(e&&e(t);f<o.length;f++)r=o[f],d.o(i,r)&&i[r]&&i[r][0](),i[r]=0;return d.O(a)},(f=self.webpackChunk_N_E=self.webpackChunk_N_E||[]).forEach(c.bind(null,0)),f.push=c.bind(null,f.push.bind(f))}();
+!function(){"use strict";var e,t,n,r,o,u,i,c,f,a={},l={};function d(e){var t=l[e];if(void 0!==t)return t.exports;var n=l[e]={id:e,loaded:!1,exports:{}},r=!0;try{a[e](n,n.exports,d),r=!1}finally{r&&delete l[e]}return n.loaded=!0,n.exports}d.m=a,e=[],d.O=function(t,n,r,o){if(n){o=o||0;for(var u=e.length;u>0&&e[u-1][2]>o;u--)e[u]=e[u-1];e[u]=[n,r,o];return}for(var i=1/0,u=0;u<e.length;u++){for(var n=e[u][0],r=e[u][1],o=e[u][2],c=!0,f=0;f<n.length;f++)i>=o&&Object.keys(d.O).every(function(e){return d.O[e](n[f])})?n.splice(f--,1):(c=!1,o<i&&(i=o));if(c){e.splice(u--,1);var a=r();void 0!==a&&(t=a)}}return t},d.n=function(e){var t=e&&e.__esModule?function(){return e.default}:function(){return e};return d.d(t,{a:t}),t},n=Object.getPrototypeOf?function(e){return Object.getPrototypeOf(e)}:function(e){return e.__proto__},d.t=function(e,r){if(1&r&&(e=this(e)),8&r||"object"==typeof e&&e&&(4&r&&e.__esModule||16&r&&"function"==typeof e.then))return e;var o=Object.create(null);d.r(o);var u={};t=t||[null,n({}),n([]),n(n)];for(var i=2&r&&e;"object"==typeof i&&!~t.indexOf(i);i=n(i))Object.getOwnPropertyNames(i).forEach(function(t){u[t]=function(){return e[t]}});return u.default=function(){return e},d.d(o,u),o},d.d=function(e,t){for(var n in t)d.o(t,n)&&!d.o(e,n)&&Object.defineProperty(e,n,{enumerable:!0,get:t[n]})},d.f={},d.e=function(e){return Promise.all(Object.keys(d.f).reduce(function(t,n){return d.f[n](e,t),t},[]))},d.u=function(e){},d.miniCssF=function(e){return"static/css/5e699db73bf6f8c2.css"},d.g=function(){if("object"==typeof globalThis)return globalThis;try{return this||Function("return this")()}catch(e){if("object"==typeof window)return window}}(),d.o=function(e,t){return Object.prototype.hasOwnProperty.call(e,t)},r={},o="_N_E:",d.l=function(e,t,n,u){if(r[e]){r[e].push(t);return}if(void 0!==n)for(var i,c,f=document.getElementsByTagName("script"),a=0;a<f.length;a++){var l=f[a];if(l.getAttribute("src")==e||l.getAttribute("data-webpack")==o+n){i=l;break}}i||(c=!0,(i=document.createElement("script")).charset="utf-8",i.timeout=120,d.nc&&i.setAttribute("nonce",d.nc),i.setAttribute("data-webpack",o+n),i.src=d.tu(e)),r[e]=[t];var s=function(t,n){i.onerror=i.onload=null,clearTimeout(p);var o=r[e];if(delete r[e],i.parentNode&&i.parentNode.removeChild(i),o&&o.forEach(function(e){return e(n)}),t)return t(n)},p=setTimeout(s.bind(null,void 0,{type:"timeout",target:i}),12e4);i.onerror=s.bind(null,i.onerror),i.onload=s.bind(null,i.onload),c&&document.head.appendChild(i)},d.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},d.nmd=function(e){return e.paths=[],e.children||(e.children=[]),e},d.tt=function(){return void 0===u&&(u={createScriptURL:function(e){return e}},"undefined"!=typeof trustedTypes&&trustedTypes.createPolicy&&(u=trustedTypes.createPolicy("nextjs#bundler",u))),u},d.tu=function(e){return d.tt().createScriptURL(e)},d.p="/ui/_next/",i={272:0},d.f.j=function(e,t){var n=d.o(i,e)?i[e]:void 0;if(0!==n){if(n)t.push(n[2]);else if(272!=e){var r=new Promise(function(t,r){n=i[e]=[t,r]});t.push(n[2]=r);var o=d.p+d.u(e),u=Error();d.l(o,function(t){if(d.o(i,e)&&(0!==(n=i[e])&&(i[e]=void 0),n)){var r=t&&("load"===t.type?"missing":t.type),o=t&&t.target&&t.target.src;u.message="Loading chunk "+e+" failed.\n("+r+": "+o+")",u.name="ChunkLoadError",u.type=r,u.request=o,n[1](u)}},"chunk-"+e,e)}else i[e]=0}},d.O.j=function(e){return 0===i[e]},c=function(e,t){var n,r,o=t[0],u=t[1],c=t[2],f=0;if(o.some(function(e){return 0!==i[e]})){for(n in u)d.o(u,n)&&(d.m[n]=u[n]);if(c)var a=c(d)}for(e&&e(t);f<o.length;f++)r=o[f],d.o(i,r)&&i[r]&&i[r][0](),i[r]=0;return d.O(a)},(f=self.webpackChunk_N_E=self.webpackChunk_N_E||[]).forEach(c.bind(null,0)),f.push=c.bind(null,f.push.bind(f))}();
--- a/litellm/proxy/_experimental/out/_next/static/css/5e699db73bf6f8c2.css
+++ b/litellm/proxy/_experimental/out/_next/static/css/5e699db73bf6f8c2.css
--- a/litellm/proxy/_experimental/out/_next/static/css/dc347b0d22ffde5d.css
+++ b/litellm/proxy/_experimental/out/_next/static/css/dc347b0d22ffde5d.css
--- a/litellm/proxy/_experimental/out/_next/static/kbGdRQFfI6W3bEwfzmJDI/_buildManifest.js
+++ b/litellm/proxy/_experimental/out/_next/static/kbGdRQFfI6W3bEwfzmJDI/_buildManifest.js
--- a/litellm/proxy/_experimental/out/_next/static/kbGdRQFfI6W3bEwfzmJDI/_ssgManifest.js
+++ b/litellm/proxy/_experimental/out/_next/static/kbGdRQFfI6W3bEwfzmJDI/_ssgManifest.js
--- a/litellm/proxy/_experimental/out/index.html
+++ b/litellm/proxy/_experimental/out/index.html
@ -1 +1 @@
-<!DOCTYPE html><html id="__next_error__"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width, initial-scale=1"/><link rel="preload" as="script" fetchPriority="low" href="/ui/_next/static/chunks/webpack-75b5d58291566cf9.js" crossorigin=""/><script src="/ui/_next/static/chunks/fd9d1056-bcf69420342937de.js" async="" crossorigin=""></script><script src="/ui/_next/static/chunks/69-442a9c01c3fd20f9.js" async="" crossorigin=""></script><script src="/ui/_next/static/chunks/main-app-9b4fb13a7db53edf.js" async="" crossorigin=""></script><title>LiteLLM Dashboard</title><meta name="description" content="LiteLLM Proxy Admin UI"/><link rel="icon" href="/ui/favicon.ico" type="image/x-icon" sizes="16x16"/><meta name="next-size-adjust"/><script src="/ui/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js" crossorigin="" noModule=""></script></head><body><script src="/ui/_next/static/chunks/webpack-75b5d58291566cf9.js" crossorigin="" async=""></script><script>(self.__next_f=self.__next_f||[]).push([0]);self.__next_f.push([2,null])</script><script>self.__next_f.push([1,"1:HL[\"/ui/_next/static/media/c9a5bc6a7c948fb0-s.p.woff2\",\"font\",{\"crossOrigin\":\"\",\"type\":\"font/woff2\"}]\n2:HL[\"/ui/_next/static/css/dc347b0d22ffde5d.css\",\"style\",{\"crossOrigin\":\"\"}]\n0:\"$L3\"\n"])</script><script>self.__next_f.push([1,"4:I[47690,[],\"\"]\n6:I[77831,[],\"\"]\n7:I[21225,[\"289\",\"static/chunks/289-04be6cb9636840d2.js\",\"931\",\"static/chunks/app/page-cb85da9a307105a0.js\"],\"\"]\n8:I[5613,[],\"\"]\n9:I[31778,[],\"\"]\nb:I[48955,[],\"\"]\nc:[]\n"])</script><script>self.__next_f.push([1,"3:[[[\"$\",\"link\",\"0\",{\"rel\":\"stylesheet\",\"href\":\"/ui/_next/static/css/dc347b0d22ffde5d.css\",\"precedence\":\"next\",\"crossOrigin\":\"\"}]],[\"$\",\"$L4\",null,{\"buildId\":\"y7Wf8hfvd5KooOO87je1n\",\"assetPrefix\":\"/ui\",\"initialCanonicalUrl\":\"/\",\"initialTree\":[\"\",{\"children\":[\"__PAGE__\",{}]},\"$undefined\",\"$undefined\",true],\"initialSeedData\":[\"\",{\"children\":[\"__PAGE__\",{},[\"$L5\",[\"$\",\"$L6\",null,{\"propsForComponent\":{\"params\":{}},\"Component\":\"$7\",\"isStaticGeneration\":true}],null]]},[null,[\"$\",\"html\",null,{\"lang\":\"en\",\"children\":[\"$\",\"body\",null,{\"className\":\"__className_c23dc8\",\"children\":[\"$\",\"$L8\",null,{\"parallelRouterKey\":\"children\",\"segmentPath\":[\"children\"],\"loading\":\"$undefined\",\"loadingStyles\":\"$undefined\",\"loadingScripts\":\"$undefined\",\"hasLoading\":false,\"error\":\"$undefined\",\"errorStyles\":\"$undefined\",\"errorScripts\":\"$undefined\",\"template\":[\"$\",\"$L9\",null,{}],\"templateStyles\":\"$undefined\",\"templateScripts\":\"$undefined\",\"notFound\":[[\"$\",\"title\",null,{\"children\":\"404: This page could not be found.\"}],[\"$\",\"div\",null,{\"style\":{\"fontFamily\":\"system-ui,\\\"Segoe UI\\\",Roboto,Helvetica,Arial,sans-serif,\\\"Apple Color Emoji\\\",\\\"Segoe UI Emoji\\\"\",\"height\":\"100vh\",\"textAlign\":\"center\",\"display\":\"flex\",\"flexDirection\":\"column\",\"alignItems\":\"center\",\"justifyContent\":\"center\"},\"children\":[\"$\",\"div\",null,{\"children\":[[\"$\",\"style\",null,{\"dangerouslySetInnerHTML\":{\"__html\":\"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}\"}}],[\"$\",\"h1\",null,{\"className\":\"next-error-h1\",\"style\":{\"display\":\"inline-block\",\"margin\":\"0 20px 0 0\",\"padding\":\"0 23px 0 0\",\"fontSize\":24,\"fontWeight\":500,\"verticalAlign\":\"top\",\"lineHeight\":\"49px\"},\"children\":\"404\"}],[\"$\",\"div\",null,{\"style\":{\"display\":\"inline-block\"},\"children\":[\"$\",\"h2\",null,{\"style\":{\"fontSize\":14,\"fontWeight\":400,\"lineHeight\":\"49px\",\"margin\":0},\"children\":\"This page could not be found.\"}]}]]}]}]],\"notFoundStyles\":[],\"styles\":null}]}]}],null]],\"initialHead\":[false,\"$La\"],\"globalErrorComponent\":\"$b\",\"missingSlots\":\"$Wc\"}]]\n"])</script><script>self.__next_f.push([1,"a:[[\"$\",\"meta\",\"0\",{\"name\":\"viewport\",\"content\":\"width=device-width, initial-scale=1\"}],[\"$\",\"meta\",\"1\",{\"charSet\":\"utf-8\"}],[\"$\",\"title\",\"2\",{\"children\":\"LiteLLM Dashboard\"}],[\"$\",\"meta\",\"3\",{\"name\":\"description\",\"content\":\"LiteLLM Proxy Admin UI\"}],[\"$\",\"link\",\"4\",{\"rel\":\"icon\",\"href\":\"/ui/favicon.ico\",\"type\":\"image/x-icon\",\"sizes\":\"16x16\"}],[\"$\",\"meta\",\"5\",{\"name\":\"next-size-adjust\"}]]\n5:null\n"])</script><script>self.__next_f.push([1,""])</script></body></html>
+<!DOCTYPE html><html id="__next_error__"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width, initial-scale=1"/><link rel="preload" as="script" fetchPriority="low" href="/ui/_next/static/chunks/webpack-ccae12a25017afa5.js" crossorigin=""/><script src="/ui/_next/static/chunks/fd9d1056-dafd44dfa2da140c.js" async="" crossorigin=""></script><script src="/ui/_next/static/chunks/69-e49705773ae41779.js" async="" crossorigin=""></script><script src="/ui/_next/static/chunks/main-app-9b4fb13a7db53edf.js" async="" crossorigin=""></script><title>LiteLLM Dashboard</title><meta name="description" content="LiteLLM Proxy Admin UI"/><link rel="icon" href="/ui/favicon.ico" type="image/x-icon" sizes="16x16"/><meta name="next-size-adjust"/><script src="/ui/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js" crossorigin="" noModule=""></script></head><body><script src="/ui/_next/static/chunks/webpack-ccae12a25017afa5.js" crossorigin="" async=""></script><script>(self.__next_f=self.__next_f||[]).push([0]);self.__next_f.push([2,null])</script><script>self.__next_f.push([1,"1:HL[\"/ui/_next/static/media/c9a5bc6a7c948fb0-s.p.woff2\",\"font\",{\"crossOrigin\":\"\",\"type\":\"font/woff2\"}]\n2:HL[\"/ui/_next/static/css/5e699db73bf6f8c2.css\",\"style\",{\"crossOrigin\":\"\"}]\n0:\"$L3\"\n"])</script><script>self.__next_f.push([1,"4:I[47690,[],\"\"]\n6:I[77831,[],\"\"]\n7:I[27125,[\"447\",\"static/chunks/447-9f8d32190ff7d16d.js\",\"931\",\"static/chunks/app/page-508c39694bd40fe9.js\"],\"\"]\n8:I[5613,[],\"\"]\n9:I[31778,[],\"\"]\nb:I[48955,[],\"\"]\nc:[]\n"])</script><script>self.__next_f.push([1,"3:[[[\"$\",\"link\",\"0\",{\"rel\":\"stylesheet\",\"href\":\"/ui/_next/static/css/5e699db73bf6f8c2.css\",\"precedence\":\"next\",\"crossOrigin\":\"\"}]],[\"$\",\"$L4\",null,{\"buildId\":\"kbGdRQFfI6W3bEwfzmJDI\",\"assetPrefix\":\"/ui\",\"initialCanonicalUrl\":\"/\",\"initialTree\":[\"\",{\"children\":[\"__PAGE__\",{}]},\"$undefined\",\"$undefined\",true],\"initialSeedData\":[\"\",{\"children\":[\"__PAGE__\",{},[\"$L5\",[\"$\",\"$L6\",null,{\"propsForComponent\":{\"params\":{}},\"Component\":\"$7\",\"isStaticGeneration\":true}],null]]},[null,[\"$\",\"html\",null,{\"lang\":\"en\",\"children\":[\"$\",\"body\",null,{\"className\":\"__className_c23dc8\",\"children\":[\"$\",\"$L8\",null,{\"parallelRouterKey\":\"children\",\"segmentPath\":[\"children\"],\"loading\":\"$undefined\",\"loadingStyles\":\"$undefined\",\"loadingScripts\":\"$undefined\",\"hasLoading\":false,\"error\":\"$undefined\",\"errorStyles\":\"$undefined\",\"errorScripts\":\"$undefined\",\"template\":[\"$\",\"$L9\",null,{}],\"templateStyles\":\"$undefined\",\"templateScripts\":\"$undefined\",\"notFound\":[[\"$\",\"title\",null,{\"children\":\"404: This page could not be found.\"}],[\"$\",\"div\",null,{\"style\":{\"fontFamily\":\"system-ui,\\\"Segoe UI\\\",Roboto,Helvetica,Arial,sans-serif,\\\"Apple Color Emoji\\\",\\\"Segoe UI Emoji\\\"\",\"height\":\"100vh\",\"textAlign\":\"center\",\"display\":\"flex\",\"flexDirection\":\"column\",\"alignItems\":\"center\",\"justifyContent\":\"center\"},\"children\":[\"$\",\"div\",null,{\"children\":[[\"$\",\"style\",null,{\"dangerouslySetInnerHTML\":{\"__html\":\"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}\"}}],[\"$\",\"h1\",null,{\"className\":\"next-error-h1\",\"style\":{\"display\":\"inline-block\",\"margin\":\"0 20px 0 0\",\"padding\":\"0 23px 0 0\",\"fontSize\":24,\"fontWeight\":500,\"verticalAlign\":\"top\",\"lineHeight\":\"49px\"},\"children\":\"404\"}],[\"$\",\"div\",null,{\"style\":{\"display\":\"inline-block\"},\"children\":[\"$\",\"h2\",null,{\"style\":{\"fontSize\":14,\"fontWeight\":400,\"lineHeight\":\"49px\",\"margin\":0},\"children\":\"This page could not be found.\"}]}]]}]}]],\"notFoundStyles\":[],\"styles\":null}]}]}],null]],\"initialHead\":[false,\"$La\"],\"globalErrorComponent\":\"$b\",\"missingSlots\":\"$Wc\"}]]\n"])</script><script>self.__next_f.push([1,"a:[[\"$\",\"meta\",\"0\",{\"name\":\"viewport\",\"content\":\"width=device-width, initial-scale=1\"}],[\"$\",\"meta\",\"1\",{\"charSet\":\"utf-8\"}],[\"$\",\"title\",\"2\",{\"children\":\"LiteLLM Dashboard\"}],[\"$\",\"meta\",\"3\",{\"name\":\"description\",\"content\":\"LiteLLM Proxy Admin UI\"}],[\"$\",\"link\",\"4\",{\"rel\":\"icon\",\"href\":\"/ui/favicon.ico\",\"type\":\"image/x-icon\",\"sizes\":\"16x16\"}],[\"$\",\"meta\",\"5\",{\"name\":\"next-size-adjust\"}]]\n5:null\n"])</script><script>self.__next_f.push([1,""])</script></body></html>
--- a/litellm/proxy/_experimental/out/index.txt
+++ b/litellm/proxy/_experimental/out/index.txt
@ -1,7 +1,7 @@
 2:I[77831,[],""]
-3:I[21225,["289","static/chunks/289-04be6cb9636840d2.js","931","static/chunks/app/page-cb85da9a307105a0.js"],""]
+3:I[27125,["447","static/chunks/447-9f8d32190ff7d16d.js","931","static/chunks/app/page-508c39694bd40fe9.js"],""]
 4:I[5613,[],""]
 5:I[31778,[],""]
-0:["y7Wf8hfvd5KooOO87je1n",[[["",{"children":["__PAGE__",{}]},"$undefined","$undefined",true],["",{"children":["__PAGE__",{},["$L1",["$","$L2",null,{"propsForComponent":{"params":{}},"Component":"$3","isStaticGeneration":true}],null]]},[null,["$","html",null,{"lang":"en","children":["$","body",null,{"className":"__className_c23dc8","children":["$","$L4",null,{"parallelRouterKey":"children","segmentPath":["children"],"loading":"$undefined","loadingStyles":"$undefined","loadingScripts":"$undefined","hasLoading":false,"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L5",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[],"styles":null}]}]}],null]],[[["$","link","0",{"rel":"stylesheet","href":"/ui/_next/static/css/dc347b0d22ffde5d.css","precedence":"next","crossOrigin":""}]],"$L6"]]]]
+0:["kbGdRQFfI6W3bEwfzmJDI",[[["",{"children":["__PAGE__",{}]},"$undefined","$undefined",true],["",{"children":["__PAGE__",{},["$L1",["$","$L2",null,{"propsForComponent":{"params":{}},"Component":"$3","isStaticGeneration":true}],null]]},[null,["$","html",null,{"lang":"en","children":["$","body",null,{"className":"__className_c23dc8","children":["$","$L4",null,{"parallelRouterKey":"children","segmentPath":["children"],"loading":"$undefined","loadingStyles":"$undefined","loadingScripts":"$undefined","hasLoading":false,"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L5",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[],"styles":null}]}]}],null]],[[["$","link","0",{"rel":"stylesheet","href":"/ui/_next/static/css/5e699db73bf6f8c2.css","precedence":"next","crossOrigin":""}]],"$L6"]]]]
 6:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"LiteLLM Dashboard"}],["$","meta","3",{"name":"description","content":"LiteLLM Proxy Admin UI"}],["$","link","4",{"rel":"icon","href":"/ui/favicon.ico","type":"image/x-icon","sizes":"16x16"}],["$","meta","5",{"name":"next-size-adjust"}]]
 1:null
--- a/litellm/proxy/_new_secret_config.yaml
+++ b/litellm/proxy/_new_secret_config.yaml
@ -1,52 +1,61 @@
-model_list:
- model_name: fake-openai-endpoint
-  litellm_params:
-    model: openai/my-fake-model
-    api_key: my-fake-key
-    # api_base: https://openai-function-calling-workers.tasslexyz.workers.dev/
-    api_base: http://0.0.0.0:8080
-    stream_timeout: 0.001
-    rpm: 10
- litellm_params:
-      model: azure/chatgpt-v-2
-      api_base: os.environ/AZURE_API_BASE
-      api_key: os.environ/AZURE_API_KEY
-      api_version: "2023-07-01-preview"
-      stream_timeout: 0.001
-  model_name: azure-gpt-3.5
-# - model_name: text-embedding-ada-002
-#   litellm_params:
-#     model: text-embedding-ada-002
-#     api_key: os.environ/OPENAI_API_KEY
- model_name: gpt-instruct
-  litellm_params:
-    model: text-completion-openai/gpt-3.5-turbo-instruct
-    # api_key: my-fake-key
-    # api_base: https://exampleopenaiendpoint-production.up.railway.app/
-
+environment_variables:
+  LANGFUSE_PUBLIC_KEY: Q6K8MQN6L7sPYSJiFKM9eNrETOx6V/FxVPup4FqdKsZK1hyR4gyanlQ2KHLg5D5afng99uIt0JCEQ2jiKF9UxFvtnb4BbJ4qpeceH+iK8v/bdg==
+  LANGFUSE_SECRET_KEY: 5xQ7KMa6YMLsm+H/Pf1VmlqWq1NON5IoCxABhkUBeSck7ftsj2CmpkL2ZwrxwrktgiTUBH+3gJYBX+XBk7lqOOUpvmiLjol/E5lCqq0M1CqLWA==
+  SLACK_WEBHOOK_URL: RJjhS0Hhz0/s07sCIf1OTXmTGodpK9L2K9p953Z+fOX0l2SkPFT6mB9+yIrLufmlwEaku5NNEBKy//+AG01yOd+7wV1GhK65vfj3B/gTN8t5cuVnR4vFxKY5Rx4eSGLtzyAs+aIBTp4GoNXDIjroCqfCjPkItEZWCg==
+general_settings:
+  alerting:
+  - slack
+  alerting_threshold: 300
+  database_connection_pool_limit: 100
+  database_connection_timeout: 60
+  disable_master_key_return: true
+  health_check_interval: 300
+  proxy_batch_write_at: 60
+  ui_access_mode: all
+  # master_key: sk-1234
 litellm_settings:
-  success_callback: ["prometheus"]
-  service_callback: ["prometheus_system"]
-  upperbound_key_generate_params: 
-    max_budget: os.environ/LITELLM_UPPERBOUND_KEYS_MAX_BUDGET
-
+  allowed_fails: 3
+  failure_callback:
+  - prometheus
+  num_retries: 3
+  service_callback:
+  - prometheus_system
+  success_callback:
+  - langfuse
+  - prometheus
+  - langsmith
+model_list:
+- litellm_params:
+    model: gpt-3.5-turbo
+  model_name: gpt-3.5-turbo
+- litellm_params:
+    api_base: https://openai-function-calling-workers.tasslexyz.workers.dev/
+    api_key: my-fake-key
+    model: openai/my-fake-model
+    stream_timeout: 0.001
+  model_name: fake-openai-endpoint
+- litellm_params:
+    api_base: https://openai-function-calling-workers.tasslexyz.workers.dev/
+    api_key: my-fake-key
+    model: openai/my-fake-model-2
+    stream_timeout: 0.001
+  model_name: fake-openai-endpoint
+- litellm_params:
+    api_base: os.environ/AZURE_API_BASE
+    api_key: os.environ/AZURE_API_KEY
+    api_version: 2023-07-01-preview
+    model: azure/chatgpt-v-2
+    stream_timeout: 0.001
+  model_name: azure-gpt-3.5
+- litellm_params:
+    api_key: os.environ/OPENAI_API_KEY
+    model: text-embedding-ada-002
+  model_name: text-embedding-ada-002
+- litellm_params:
+    model: text-completion-openai/gpt-3.5-turbo-instruct
+  model_name: gpt-instruct
 router_settings:
-  routing_strategy: usage-based-routing-v2 
+  enable_pre_call_checks: true
  redis_host: os.environ/REDIS_HOST
  redis_password: os.environ/REDIS_PASSWORD
  redis_port: os.environ/REDIS_PORT
-  enable_pre_call_checks: True
-
-general_settings:
-  master_key: sk-1234
-  allow_user_auth: true
-  alerting: ["slack"]
-  store_model_in_db: True // set via environment variable - os.environ["STORE_MODEL_IN_DB"] = "True"
-  proxy_batch_write_at: 5 # 👈 Frequency of batch writing logs to server (in seconds)
-  enable_jwt_auth: True
-  alerting: ["slack"]
-  litellm_jwtauth:
-    admin_jwt_scope: "litellm_proxy_admin"
-    public_key_ttl: os.environ/LITELLM_PUBLIC_KEY_TTL
-    user_id_jwt_field: "sub"
-    org_id_jwt_field: "azp"
--- a/litellm/proxy/_super_secret_config.yaml
+++ b/litellm/proxy/_super_secret_config.yaml
@ -0,0 +1,19 @@
+model_list:
+- litellm_params:
+    api_base: http://0.0.0.0:8080
+    api_key: my-fake-key
+    model: openai/my-fake-model
+    rpm: 100
+  model_name: fake-openai-endpoint
+- litellm_params:
+    api_base: http://0.0.0.0:8081
+    api_key: my-fake-key
+    model: openai/my-fake-model-2
+    rpm: 100
+  model_name: fake-openai-endpoint
+router_settings:
+  num_retries: 0
+  enable_pre_call_checks: true
+  redis_host: os.environ/REDIS_HOST
+  redis_password: os.environ/REDIS_PASSWORD
+  redis_port: os.environ/REDIS_PORT
--- a/Show more
+++ b/Show more