almost working llmonitor

2023-08-21 16:26:47 +02:00 · 2023-08-21 16:26:47 +02:00 · 3675d3e029
commit 3675d3e029
parent 22c7e38de5
5 changed files with 425 additions and 326 deletions
--- a/cookbook/proxy-server/readme.md
+++ b/cookbook/proxy-server/readme.md
@ -1,6 +1,7 @@
 # liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching
 ### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models
 [![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/)
 [![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/)
 ![Downloads](https://img.shields.io/pypi/dm/litellm)
@ -11,34 +12,36 @@
 ![4BC6491E-86D0-4833-B061-9F54524B2579](https://github.com/BerriAI/litellm/assets/17561003/f5dd237b-db5e-42e1-b1ac-f05683b1d724)
 ## What does liteLLM proxy do
 - Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face**
-  
+
  Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
  ```json
  {
    "model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
    "messages": [
-                    { 
+      {
-                        "content": "Hello, whats the weather in San Francisco??",
+        "content": "Hello, whats the weather in San Francisco??",
-                        "role": "user"
+        "role": "user"
-                    }
+      }
-                ]
+    ]
  }
  ```
 - **Consistent Input/Output** Format
    - Call all models using the OpenAI format - `completion(model, messages)`
    - Text responses will always be available at `['choices'][0]['message']['content']`
 - **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
 - **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
- **Example: Logs sent to Supabase**
+- **Consistent Input/Output** Format
  - Call all models using the OpenAI format - `completion(model, messages)`
  - Text responses will always be available at `['choices'][0]['message']['content']`
 - **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
 - **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `LLMonitor,` `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
  **Example: Logs sent to Supabase**
  <img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08">
 - **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model
 - **Caching** - Implementation of Semantic Caching
 - **Streaming & Async Support** - Return generators to stream text responses
 ## API Endpoints
 ### `/chat/completions` (POST)
@ -46,34 +49,37 @@
 This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
 #### Input
 This API endpoint accepts all inputs in raw JSON and expects the following inputs
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/): 
+
- eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
+- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
  eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
 - `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role).
 - Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/
 #### Example JSON body
 For claude-2
 ```json
 {
-    "model": "claude-2",
+  "model": "claude-2",
-    "messages": [
+  "messages": [
-                    { 
+    {
-                        "content": "Hello, whats the weather in San Francisco??",
+      "content": "Hello, whats the weather in San Francisco??",
-                        "role": "user"
+      "role": "user"
-                    }
+    }
-                ]
+  ]
 }
 ```
 ### Making an API request to the Proxy Server
 ```python
 import requests
 import json
-# TODO: use your URL 
+# TODO: use your URL
 url = "http://localhost:5000/chat/completions"
 payload = json.dumps({
@ -94,34 +100,38 @@ print(response.text)
 ```
 ### Output [Response Format]
-Responses from the server are given in the following format. 
+
 Responses from the server are given in the following format.
 All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
 ```json
 {
-    "choices": [
+  "choices": [
-        {
+    {
-            "finish_reason": "stop",
+      "finish_reason": "stop",
-            "index": 0,
+      "index": 0,
-            "message": {
+      "message": {
-                "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
+        "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
-                "role": "assistant"
+        "role": "assistant"
-            }
+      }
        }
    ],
    "created": 1691790381,
    "id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
    "model": "gpt-3.5-turbo-0613",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 41,
        "prompt_tokens": 16,
        "total_tokens": 57
    }
  ],
  "created": 1691790381,
  "id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 41,
    "prompt_tokens": 16,
    "total_tokens": 57
  }
 }
 ```
 ## Installation & Usage
 ### Running Locally
 1. Clone liteLLM repository to your local machine:
   ```
   git clone https://github.com/BerriAI/liteLLM-proxy
@ -141,24 +151,24 @@ All responses from the server are returned in the following format (for all LLM
   python main.py
   ```
 ## Deploying
 1. Quick Start: Deploy on Railway
   [![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/DYqQAW?referralCode=t3ukrU)
-   
+
-2. `GCP`, `AWS`, `Azure` 
+2. `GCP`, `AWS`, `Azure`
-This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
+   This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
 # Support / Talk with founders
 - [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
 - [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
 - Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
 - Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
 ## Roadmap
 - [ ] Support hosted db (e.g. Supabase)
 - [ ] Easily send data to places like posthog and sentry.
 - [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings
--- a/litellm/integrations/llmonitor.py
+++ b/litellm/integrations/llmonitor.py
@ -5,6 +5,7 @@ import traceback
 import dotenv
 import os
 import requests
 dotenv.load_dotenv()  # Loading env variables using dotenv
@ -14,45 +15,34 @@ class LLMonitorLogger:
        # Instance variables
        self.api_url = os.getenv(
            "LLMONITOR_API_URL") or "https://app.llmonitor.com"
-        self.account_id = os.getenv("LLMONITOR_APP_ID")
+        self.app_id = os.getenv("LLMONITOR_APP_ID")
-    def log_event(self, model, messages, response_obj, start_time, end_time, print_verbose):
+    def log_event(self, type, run_id, error, usage, model, messages,
                  response_obj, user_id, time, print_verbose):
        # Method definition
        try:
            print_verbose(
-                f"LLMonitor Logging - Enters logging function for model {model}")
+                f"LLMonitor Logging - Enters logging function for model {model}"
            )
-            print(model, messages, response_obj, start_time, end_time)
+            print(type, model, messages, response_obj, time, end_user)
-            # headers = {
+            headers = {'Content-Type': 'application/json'}
            #     'Content-Type': 'application/json'
            # }
-            # prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar = self.price_calculator(
+            data = {
-            #     model, response_obj, start_time, end_time)
+                "type": "llm",
-            # total_cost = prompt_tokens_cost_usd_dollar + completion_tokens_cost_usd_dollar
+                "name": model,
                "runId": run_id,
                "app": self.app_id,
                "error": error,
                "event": type,
                "timestamp": time.isoformat(),
                "userId": user_id,
                "input": messages,
                "output": response_obj['choices'][0]['message']['content'],
            }
-            # response_time = (end_time-start_time).total_seconds()
+            print_verbose(f"LLMonitor Logging - final data object: {data}")
            # if "response" in response_obj:
            #     data = [{
            #         "response_time": response_time,
            #         "model_id": response_obj["model"],
            #         "total_cost": total_cost,
            #         "messages": messages,
            #         "response": response_obj['choices'][0]['message']['content'],
            #         "account_id": self.account_id
            #     }]
            # elif "error" in response_obj:
            #     data = [{
            #         "response_time": response_time,
            #         "model_id": response_obj["model"],
            #         "total_cost": total_cost,
            #         "messages": messages,
            #         "error": response_obj['error'],
            #         "account_id": self.account_id
            #     }]
            # print_verbose(f"BerriSpend Logging - final data object: {data}")
            # response = requests.post(url, headers=headers, json=data)
        except:
            # traceback.print_exc()
--- a/litellm/tests/test_llmonitor_integration.py
+++ b/litellm/tests/test_llmonitor_integration.py
@ -1,28 +1,36 @@
 #### What this tests ####
-#    This tests if logging to the helicone integration actually works
+#    This tests if logging to the llmonitor integration actually works
-
+# Adds the parent directory to the system path
 from litellm import embedding, completion
 import litellm
 import sys
 import os
 import traceback
 import pytest
 # Adds the parent directory to the system path
 sys.path.insert(0, os.path.abspath('../..'))
 from litellm import completion
 import litellm
 litellm.input_callback = ["llmonitor"]
 litellm.success_callback = ["llmonitor"]
 litellm.error_callback = ["llmonitor"]
 litellm.set_verbose = True
 user_message = "Hello, how are you?"
 messages = [{"content": user_message, "role": "user"}]
 # openai call
-response = completion(model="gpt-3.5-turbo",
+# response = completion(model="gpt-3.5-turbo",
-                      messages=[{"role": "user", "content": "Hi 👋 - i'm openai"}])
+#                       messages=[{
 #                           "role": "user",
 #                           "content": "Hi 👋 - i'm openai"
 #                       }])
 # print(response)
 # #bad request call
 # response = completion(model="chatgpt-test", messages=[{"role": "user", "content": "Hi 👋 - i'm a bad request"}])
 # cohere call
-# response = completion(model="command-nightly",
+response = completion(model="command-nightly",
-# messages=[{"role": "user", "content": "Hi 👋 - i'm cohere"}])
+                      messages=[{
                          "role": "user",
                          "content": "Hi 👋 - i'm cohere"
                      }])
 print(response)
--- a/litellm/utils.py
+++ b/litellm/utils.py
@ -1,20 +1,7 @@
-import sys
+import aiohttp
-import dotenv, json, traceback, threading
+import subprocess
-import subprocess, os
+import importlib
-import litellm, openai
+from typing import List, Dict, Union, Optional
 import random, uuid, requests
 import datetime, time
 import tiktoken
 encoding = tiktoken.get_encoding("cl100k_base")
 import pkg_resources
 from .integrations.helicone import HeliconeLogger
 from .integrations.aispend import AISpendLogger
 from .integrations.berrispend import BerriSpendLogger
 from .integrations.supabase import Supabase
 from .integrations.litedebugger import LiteDebugger
 from openai.error import OpenAIError as OriginalError
 from openai.openai_object import OpenAIObject
 from .exceptions import (
    AuthenticationError,
    InvalidRequestError,
@ -22,7 +9,32 @@ from .exceptions import (
    ServiceUnavailableError,
    OpenAIError,
 )
-from typing import List, Dict, Union, Optional
+from openai.openai_object import OpenAIObject
 from openai.error import OpenAIError as OriginalError
 from .integrations.llmonitor import LLMonitorLogger
 from .integrations.litedebugger import LiteDebugger
 from .integrations.supabase import Supabase
 from .integrations.berrispend import BerriSpendLogger
 from .integrations.aispend import AISpendLogger
 from .integrations.helicone import HeliconeLogger
 import pkg_resources
 import sys
 import dotenv
 import json
 import traceback
 import threading
 import subprocess
 import os
 import litellm
 import openai
 import random
 import uuid
 import requests
 import datetime
 import time
 import tiktoken
 encoding = tiktoken.get_encoding("cl100k_base")
 ####### ENVIRONMENT VARIABLES ###################
 dotenv.load_dotenv()  # Loading env variables using dotenv
@ -37,6 +49,7 @@ aispendLogger = None
 berrispendLogger = None
 supabaseClient = None
 liteDebuggerClient = None
 llmonitorLogger = None
 callback_list: Optional[List[str]] = []
 user_logger_fn = None
 additional_details: Optional[Dict[str, str]] = {}
@ -63,6 +76,7 @@ local_cache: Optional[Dict[str, str]] = {}
 class Message(OpenAIObject):
    def __init__(self, content="default", role="assistant", **params):
        super(Message, self).__init__(**params)
        self.content = content
@ -70,7 +84,12 @@ class Message(OpenAIObject):
 class Choices(OpenAIObject):
-    def __init__(self, finish_reason="stop", index=0, message=Message(), **params):
+
    def __init__(self,
                 finish_reason="stop",
                 index=0,
                 message=Message(),
                 **params):
        super(Choices, self).__init__(**params)
        self.finish_reason = finish_reason
        self.index = index
@ -78,20 +97,22 @@ class Choices(OpenAIObject):
 class ModelResponse(OpenAIObject):
-    def __init__(self, choices=None, created=None, model=None, usage=None, **params):
+
    def __init__(self,
                 choices=None,
                 created=None,
                 model=None,
                 usage=None,
                 **params):
        super(ModelResponse, self).__init__(**params)
        self.choices = choices if choices else [Choices()]
        self.created = created
        self.model = model
-        self.usage = (
+        self.usage = (usage if usage else {
-            usage
+            "prompt_tokens": None,
-            if usage
+            "completion_tokens": None,
-            else {
+            "total_tokens": None,
-                "prompt_tokens": None,
+        })
                "completion_tokens": None,
                "total_tokens": None,
            }
        )
    def to_dict_recursive(self):
        d = super().to_dict_recursive()
@ -108,8 +129,6 @@ def print_verbose(print_statement):
 ####### Package Import Handler ###################
 import importlib
 import subprocess
 def install_and_import(package: str):
@ -139,6 +158,7 @@ def install_and_import(package: str):
 # Logging function -> log the exact model details + what's being sent | Non-Blocking
 class Logging:
    global supabaseClient, liteDebuggerClient
    def __init__(self, model, messages, optional_params, litellm_params):
        self.model = model
        self.messages = messages
@ -146,20 +166,20 @@ class Logging:
        self.litellm_params = litellm_params
        self.logger_fn = litellm_params["logger_fn"]
        self.model_call_details = {
-            "model": model, 
+            "model": model,
-            "messages": messages, 
+            "messages": messages,
            "optional_params": self.optional_params,
            "litellm_params": self.litellm_params,
        }
-        
+
    def pre_call(self, input, api_key, additional_args={}):
        try:
            print(f"logging pre call for model: {self.model}")
            self.model_call_details["input"] = input
            self.model_call_details["api_key"] = api_key
            self.model_call_details["additional_args"] = additional_args
-            
+
-            ## User Logging -> if you pass in a custom logging function
+            # User Logging -> if you pass in a custom logging function
            print_verbose(
                f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}"
            )
@ -173,7 +193,7 @@ class Logging:
                        f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while logging {traceback.format_exc()}"
                    )
-            ## Input Integration Logging -> If you want to log the fact that an attempt to call the model was made 
+            # Input Integration Logging -> If you want to log the fact that an attempt to call the model was made
            for callback in litellm.input_callback:
                try:
                    if callback == "supabase":
@ -185,7 +205,21 @@ class Logging:
                            model=model,
                            messages=messages,
                            end_user=litellm._thread_context.user,
-                            litellm_call_id=self.litellm_params["litellm_call_id"],
+                            litellm_call_id=self.
                            litellm_params["litellm_call_id"],
                            print_verbose=print_verbose,
                        )
                    elif callback == "llmonitor":
                        print_verbose("reaches llmonitor for logging!")
                        model = self.model
                        messages = self.messages
                        print(f"liteDebuggerClient: {liteDebuggerClient}")
                        llmonitorLogger.log_event(
                            type="start",
                            model=model,
                            messages=messages,
                            user_id=litellm._thread_context.user,
                            run_id=self.litellm_params["litellm_call_id"],
                            print_verbose=print_verbose,
                        )
                    elif callback == "lite_debugger":
@ -197,15 +231,18 @@ class Logging:
                            model=model,
                            messages=messages,
                            end_user=litellm._thread_context.user,
-                            litellm_call_id=self.litellm_params["litellm_call_id"],
+                            litellm_call_id=self.
                            litellm_params["litellm_call_id"],
                            print_verbose=print_verbose,
                        )
                except Exception as e:
-                    print_verbose(f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while input logging with integrations {traceback.format_exc()}")
+                    print_verbose(
                        f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while input logging with integrations {traceback.format_exc()}"
                    )
                    print_verbose(
                        f"LiteLLM.Logging: is sentry capture exception initialized {capture_exception}"
                    )
-                    if capture_exception: # log this error to sentry for debugging 
+                    if capture_exception:  # log this error to sentry for debugging
                        capture_exception(e)
        except:
            print_verbose(
@ -214,9 +251,9 @@ class Logging:
            print_verbose(
                f"LiteLLM.Logging: is sentry capture exception initialized {capture_exception}"
            )
-            if capture_exception: # log this error to sentry for debugging 
+            if capture_exception:  # log this error to sentry for debugging
                capture_exception(e)
-        
+
    def post_call(self, input, api_key, original_response, additional_args={}):
        # Do something here
        try:
@ -224,8 +261,8 @@ class Logging:
            self.model_call_details["api_key"] = api_key
            self.model_call_details["original_response"] = original_response
            self.model_call_details["additional_args"] = additional_args
-            
+
-            ## User Logging -> if you pass in a custom logging function
+            # User Logging -> if you pass in a custom logging function
            print_verbose(
                f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}"
            )
@ -243,9 +280,9 @@ class Logging:
                f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while logging {traceback.format_exc()}"
            )
            pass
-    
+
    # Add more methods as needed
-    
+
 def exception_logging(
    additional_args={},
@ -257,7 +294,7 @@ def exception_logging(
        if exception:
            model_call_details["exception"] = exception
        model_call_details["additional_args"] = additional_args
-        ## User Logging -> if you pass in a custom logging function or want to use sentry breadcrumbs
+        # User Logging -> if you pass in a custom logging function or want to use sentry breadcrumbs
        print_verbose(
            f"Logging Details: logger_fn - {logger_fn} | callable(logger_fn) - {callable(logger_fn)}"
        )
@ -280,20 +317,20 @@ def exception_logging(
 ####### CLIENT ###################
 # make it easy to log if completion/embedding runs succeeded or failed + see what happened | Non-Blocking
 def client(original_function):
    def function_setup(
        *args, **kwargs
    ):  # just run once to check if user wants to send their data anywhere - PostHog/Sentry/Slack/etc.
        try:
            global callback_list, add_breadcrumb, user_logger_fn
-            if (
+            if (len(litellm.input_callback) > 0
-                len(litellm.input_callback) > 0 or len(litellm.success_callback) > 0 or len(litellm.failure_callback) > 0
+                    or len(litellm.success_callback) > 0
-            ) and len(callback_list) == 0:
+                    or len(litellm.failure_callback)
                    > 0) and len(callback_list) == 0:
                callback_list = list(
-                    set(litellm.input_callback + litellm.success_callback + litellm.failure_callback)
+                    set(litellm.input_callback + litellm.success_callback +
-                )
+                        litellm.failure_callback))
-                set_callbacks(
+                set_callbacks(callback_list=callback_list, )
                    callback_list=callback_list,
                )
            if add_breadcrumb:
                add_breadcrumb(
                    category="litellm.llm_call",
@ -310,12 +347,11 @@ def client(original_function):
        if litellm.telemetry:
            try:
                model = args[0] if len(args) > 0 else kwargs["model"]
-                exception = kwargs["exception"] if "exception" in kwargs else None
+                exception = kwargs[
-                custom_llm_provider = (
+                    "exception"] if "exception" in kwargs else None
-                    kwargs["custom_llm_provider"]
+                custom_llm_provider = (kwargs["custom_llm_provider"]
-                    if "custom_llm_provider" in kwargs
+                                       if "custom_llm_provider" in kwargs else
-                    else None
+                                       None)
                )
                safe_crash_reporting(
                    model=model,
                    exception=exception,
@ -340,15 +376,12 @@ def client(original_function):
    def check_cache(*args, **kwargs):
        try:  # never block execution
            prompt = get_prompt(*args, **kwargs)
-            if (
+            if (prompt != None and prompt
-                prompt != None and prompt in local_cache
+                    in local_cache):  # check if messages / prompt exists
            ):  # check if messages / prompt exists
                if litellm.caching_with_models:
                    # if caching with model names is enabled, key is prompt + model name
-                    if (
+                    if ("model" in kwargs and kwargs["model"]
-                        "model" in kwargs
+                            in local_cache[prompt]["models"]):
                        and kwargs["model"] in local_cache[prompt]["models"]
                    ):
                        cache_key = prompt + kwargs["model"]
                        return local_cache[cache_key]
                else:  # caching only with prompts
@ -363,10 +396,8 @@ def client(original_function):
        try:  # never block execution
            prompt = get_prompt(*args, **kwargs)
            if litellm.caching_with_models:  # caching with model + prompt
-                if (
+                if ("model" in kwargs
-                    "model" in kwargs
+                        and kwargs["model"] in local_cache[prompt]["models"]):
                    and kwargs["model"] in local_cache[prompt]["models"]
                ):
                    cache_key = prompt + kwargs["model"]
                    local_cache[cache_key] = result
            else:  # caching based only on prompts
@ -381,24 +412,24 @@ def client(original_function):
            function_setup(*args, **kwargs)
            litellm_call_id = str(uuid.uuid4())
            kwargs["litellm_call_id"] = litellm_call_id
-            ## [OPTIONAL] CHECK CACHE
+            # [OPTIONAL] CHECK CACHE
            start_time = datetime.datetime.now()
            if (litellm.caching or litellm.caching_with_models) and (
-                cached_result := check_cache(*args, **kwargs)
+                    cached_result := check_cache(*args, **kwargs)) is not None:
            ) is not None:
                result = cached_result
            else:
-                ## MODEL CALL
+                # MODEL CALL
                result = original_function(*args, **kwargs)
            end_time = datetime.datetime.now()
-            ## Add response to CACHE
+            # Add response to CACHE
            if litellm.caching:
                add_cache(result, *args, **kwargs)
-            ## LOG SUCCESS
+            # LOG SUCCESS
            crash_reporting(*args, **kwargs)
            my_thread = threading.Thread(
-                target=handle_success, args=(args, kwargs, result, start_time, end_time)
+                target=handle_success,
-            )  # don't interrupt execution of main thread
+                args=(args, kwargs, result, start_time,
                      end_time))  # don't interrupt execution of main thread
            my_thread.start()
            return result
        except Exception as e:
@ -407,7 +438,8 @@ def client(original_function):
            end_time = datetime.datetime.now()
            my_thread = threading.Thread(
                target=handle_failure,
-                args=(e, traceback_exception, start_time, end_time, args, kwargs),
+                args=(e, traceback_exception, start_time, end_time, args,
                      kwargs),
            )  # don't interrupt execution of main thread
            my_thread.start()
            raise e
@ -432,18 +464,18 @@ def token_counter(model, text):
    return num_tokens
-def cost_per_token(model="gpt-3.5-turbo", prompt_tokens=0, completion_tokens=0):
+def cost_per_token(model="gpt-3.5-turbo",
-    ## given
+                   prompt_tokens=0,
                   completion_tokens=0):
    # given
    prompt_tokens_cost_usd_dollar = 0
    completion_tokens_cost_usd_dollar = 0
    model_cost_ref = litellm.model_cost
    if model in model_cost_ref:
        prompt_tokens_cost_usd_dollar = (
-            model_cost_ref[model]["input_cost_per_token"] * prompt_tokens
+            model_cost_ref[model]["input_cost_per_token"] * prompt_tokens)
        )
        completion_tokens_cost_usd_dollar = (
-            model_cost_ref[model]["output_cost_per_token"] * completion_tokens
+            model_cost_ref[model]["output_cost_per_token"] * completion_tokens)
        )
        return prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar
    else:
        # calculate average input cost
@ -464,8 +496,9 @@ def completion_cost(model="gpt-3.5-turbo", prompt="", completion=""):
    prompt_tokens = token_counter(model=model, text=prompt)
    completion_tokens = token_counter(model=model, text=completion)
    prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar = cost_per_token(
-        model=model, prompt_tokens=prompt_tokens, completion_tokens=completion_tokens
+        model=model,
-    )
+        prompt_tokens=prompt_tokens,
        completion_tokens=completion_tokens)
    return prompt_tokens_cost_usd_dollar + completion_tokens_cost_usd_dollar
@ -557,9 +590,8 @@ def get_optional_params(
            optional_params["max_tokens"] = max_tokens
        if frequency_penalty != 0:
            optional_params["frequency_penalty"] = frequency_penalty
-    elif (
+    elif (model == "chat-bison"
-        model == "chat-bison"
+          ):  # chat-bison has diff args from chat-bison@001 ty Google
    ):  # chat-bison has diff args from chat-bison@001 ty Google
        if temperature != 1:
            optional_params["temperature"] = temperature
        if top_p != 1:
@ -619,7 +651,10 @@ def load_test_model(
        test_prompt = prompt
    if num_calls:
        test_calls = num_calls
-    messages = [[{"role": "user", "content": test_prompt}] for _ in range(test_calls)]
+    messages = [[{
        "role": "user",
        "content": test_prompt
    }] for _ in range(test_calls)]
    start_time = time.time()
    try:
        litellm.batch_completion(
@ -649,7 +684,7 @@ def load_test_model(
 def set_callbacks(callback_list):
-    global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, heliconeLogger, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient
+    global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, heliconeLogger, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient, llmonitorLogger
    try:
        for callback in callback_list:
            print(f"callback: {callback}")
@ -657,17 +692,15 @@ def set_callbacks(callback_list):
                try:
                    import sentry_sdk
                except ImportError:
-                    print_verbose("Package 'sentry_sdk' is missing. Installing it...")
+                    print_verbose(
                        "Package 'sentry_sdk' is missing. Installing it...")
                    subprocess.check_call(
-                        [sys.executable, "-m", "pip", "install", "sentry_sdk"]
+                        [sys.executable, "-m", "pip", "install", "sentry_sdk"])
                    )
                    import sentry_sdk
                sentry_sdk_instance = sentry_sdk
-                sentry_trace_rate = (
+                sentry_trace_rate = (os.environ.get("SENTRY_API_TRACE_RATE")
-                    os.environ.get("SENTRY_API_TRACE_RATE")
+                                     if "SENTRY_API_TRACE_RATE" in os.environ
-                    if "SENTRY_API_TRACE_RATE" in os.environ
+                                     else "1.0")
                    else "1.0"
                )
                sentry_sdk_instance.init(
                    dsn=os.environ.get("SENTRY_API_URL"),
                    traces_sample_rate=float(sentry_trace_rate),
@ -678,10 +711,10 @@ def set_callbacks(callback_list):
                try:
                    from posthog import Posthog
                except ImportError:
-                    print_verbose("Package 'posthog' is missing. Installing it...")
+                    print_verbose(
                        "Package 'posthog' is missing. Installing it...")
                    subprocess.check_call(
-                        [sys.executable, "-m", "pip", "install", "posthog"]
+                        [sys.executable, "-m", "pip", "install", "posthog"])
                    )
                    from posthog import Posthog
                posthog = Posthog(
                    project_api_key=os.environ.get("POSTHOG_API_KEY"),
@ -691,10 +724,10 @@ def set_callbacks(callback_list):
                try:
                    from slack_bolt import App
                except ImportError:
-                    print_verbose("Package 'slack_bolt' is missing. Installing it...")
+                    print_verbose(
                        "Package 'slack_bolt' is missing. Installing it...")
                    subprocess.check_call(
-                        [sys.executable, "-m", "pip", "install", "slack_bolt"]
+                        [sys.executable, "-m", "pip", "install", "slack_bolt"])
                    )
                    from slack_bolt import App
                slack_app = App(
                    token=os.environ.get("SLACK_API_TOKEN"),
@ -704,6 +737,8 @@ def set_callbacks(callback_list):
                print_verbose(f"Initialized Slack App: {slack_app}")
            elif callback == "helicone":
                heliconeLogger = HeliconeLogger()
            elif callback == "llmonitor":
                llmonitorLogger = LLMonitorLogger()
            elif callback == "aispend":
                aispendLogger = AISpendLogger()
            elif callback == "berrispend":
@ -718,7 +753,8 @@ def set_callbacks(callback_list):
        raise e
-def handle_failure(exception, traceback_exception, start_time, end_time, args, kwargs):
+def handle_failure(exception, traceback_exception, start_time, end_time, args,
                   kwargs):
    global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient
    try:
        # print_verbose(f"handle_failure args: {args}")
@ -728,8 +764,7 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
        failure_handler = additional_details.pop("failure_handler", None)
        additional_details["Event_Name"] = additional_details.pop(
-            "failed_event_name", "litellm.failed_query"
+            "failed_event_name", "litellm.failed_query")
        )
        print_verbose(f"self.failure_callback: {litellm.failure_callback}")
        # print_verbose(f"additional_details: {additional_details}")
@ -746,9 +781,8 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                    for detail in additional_details:
                        slack_msg += f"{detail}: {additional_details[detail]}\n"
                    slack_msg += f"Traceback: {traceback_exception}"
-                    slack_app.client.chat_postMessage(
+                    slack_app.client.chat_postMessage(channel=alerts_channel,
-                        channel=alerts_channel, text=slack_msg
+                                                      text=slack_msg)
                    )
                elif callback == "sentry":
                    capture_exception(exception)
                elif callback == "posthog":
@ -767,9 +801,8 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                    print_verbose(f"ph_obj: {ph_obj}")
                    print_verbose(f"PostHog Event Name: {event_name}")
                    if "user_id" in additional_details:
-                        posthog.capture(
+                        posthog.capture(additional_details["user_id"],
-                            additional_details["user_id"], event_name, ph_obj
+                                        event_name, ph_obj)
                        )
                    else:  # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python
                        unique_id = str(uuid.uuid4())
                        posthog.capture(unique_id, event_name)
@ -783,10 +816,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                        "created": time.time(),
                        "error": traceback_exception,
                        "usage": {
-                            "prompt_tokens": prompt_token_calculator(
+                            "prompt_tokens":
-                                model, messages=messages
+                            prompt_token_calculator(model, messages=messages),
-                            ),
+                            "completion_tokens":
-                            "completion_tokens": 0,
+                            0,
                        },
                    }
                    berrispendLogger.log_event(
@ -805,10 +838,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                        "model": model,
                        "created": time.time(),
                        "usage": {
-                            "prompt_tokens": prompt_token_calculator(
+                            "prompt_tokens":
-                                model, messages=messages
+                            prompt_token_calculator(model, messages=messages),
-                            ),
+                            "completion_tokens":
-                            "completion_tokens": 0,
+                            0,
                        },
                    }
                    aispendLogger.log_event(
@ -818,6 +851,27 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                        end_time=end_time,
                        print_verbose=print_verbose,
                    )
                elif callback == "llmonitor":
                    print_verbose("reaches llmonitor for logging!")
                    model = args[0] if len(args) > 0 else kwargs["model"]
                    messages = args[1] if len(args) > 1 else kwargs["messages"]
                    usage = {
                        "prompt_tokens":
                        prompt_token_calculator(model, messages=messages),
                        "completion_tokens":
                        0,
                    }
                    llmonitorLogger.log_event(
                        type="error",
                        user_id=litellm._thread_context.user,
                        model=model,
                        error=traceback_exception,
                        response_obj=result,
                        run_id=kwargs["litellm_call_id"],
                        timestamp=end_time,
                        usage=usage,
                        print_verbose=print_verbose,
                    )
                elif callback == "supabase":
                    print_verbose("reaches supabase for logging!")
                    print_verbose(f"supabaseClient: {supabaseClient}")
@ -828,10 +882,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                        "created": time.time(),
                        "error": traceback_exception,
                        "usage": {
-                            "prompt_tokens": prompt_token_calculator(
+                            "prompt_tokens":
-                                model, messages=messages
+                            prompt_token_calculator(model, messages=messages),
-                            ),
+                            "completion_tokens":
-                            "completion_tokens": 0,
+                            0,
                        },
                    }
                    supabaseClient.log_event(
@ -854,10 +908,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                        "created": time.time(),
                        "error": traceback_exception,
                        "usage": {
-                            "prompt_tokens": prompt_token_calculator(
+                            "prompt_tokens":
-                                model, messages=messages
+                            prompt_token_calculator(model, messages=messages),
-                            ),
+                            "completion_tokens":
-                            "completion_tokens": 0,
+                            0,
                        },
                    }
                    liteDebuggerClient.log_event(
@ -884,19 +938,18 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
            failure_handler(call_details)
        pass
    except Exception as e:
-        ## LOGGING
+        # LOGGING
        exception_logging(logger_fn=user_logger_fn, exception=e)
        pass
 def handle_success(args, kwargs, result, start_time, end_time):
-    global heliconeLogger, aispendLogger, supabaseClient, liteDebuggerClient
+    global heliconeLogger, aispendLogger, supabaseClient, liteDebuggerClient, llmonitorLogger
    try:
        success_handler = additional_details.pop("success_handler", None)
        failure_handler = additional_details.pop("failure_handler", None)
        additional_details["Event_Name"] = additional_details.pop(
-            "successful_event_name", "litellm.succes_query"
+            "successful_event_name", "litellm.succes_query")
        )
        for callback in litellm.success_callback:
            try:
                if callback == "posthog":
@ -905,9 +958,8 @@ def handle_success(args, kwargs, result, start_time, end_time):
                        ph_obj[detail] = additional_details[detail]
                    event_name = additional_details["Event_Name"]
                    if "user_id" in additional_details:
-                        posthog.capture(
+                        posthog.capture(additional_details["user_id"],
-                            additional_details["user_id"], event_name, ph_obj
+                                        event_name, ph_obj)
                        )
                    else:  # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python
                        unique_id = str(uuid.uuid4())
                        posthog.capture(unique_id, event_name, ph_obj)
@ -916,9 +968,8 @@ def handle_success(args, kwargs, result, start_time, end_time):
                    slack_msg = ""
                    for detail in additional_details:
                        slack_msg += f"{detail}: {additional_details[detail]}\n"
-                    slack_app.client.chat_postMessage(
+                    slack_app.client.chat_postMessage(channel=alerts_channel,
-                        channel=alerts_channel, text=slack_msg
+                                                      text=slack_msg)
                    )
                elif callback == "helicone":
                    print_verbose("reaches helicone for logging!")
                    model = args[0] if len(args) > 0 else kwargs["model"]
@ -931,6 +982,22 @@ def handle_success(args, kwargs, result, start_time, end_time):
                        end_time=end_time,
                        print_verbose=print_verbose,
                    )
                elif callback == "llmonitor":
                    print_verbose("reaches llmonitor for logging!")
                    model = args[0] if len(args) > 0 else kwargs["model"]
                    messages = args[1] if len(args) > 1 else kwargs["messages"]
                    usage = kwargs["usage"]
                    llmonitorLogger.log_event(
                        type="end",
                        model=model,
                        messages=messages,
                        user_id=litellm._thread_context.user,
                        response_obj=result,
                        time=end_time,
                        usage=usage,
                        run_id=kwargs["litellm_call_id"],
                        print_verbose=print_verbose,
                    )
                elif callback == "aispend":
                    print_verbose("reaches aispend for logging!")
                    model = args[0] if len(args) > 0 else kwargs["model"]
@ -984,7 +1051,7 @@ def handle_success(args, kwargs, result, start_time, end_time):
                        print_verbose=print_verbose,
                    )
            except Exception as e:
-                ## LOGGING
+                # LOGGING
                exception_logging(logger_fn=user_logger_fn, exception=e)
                print_verbose(
                    f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}"
@ -995,7 +1062,7 @@ def handle_success(args, kwargs, result, start_time, end_time):
            success_handler(args, kwargs)
        pass
    except Exception as e:
-        ## LOGGING
+        # LOGGING
        exception_logging(logger_fn=user_logger_fn, exception=e)
        print_verbose(
            f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}"
@ -1046,33 +1113,36 @@ def exception_type(model, original_exception, custom_llm_provider):
                exception_type = ""
            if "claude" in model:  # one of the anthropics
                if hasattr(original_exception, "status_code"):
-                    print_verbose(f"status_code: {original_exception.status_code}")
+                    print_verbose(
                        f"status_code: {original_exception.status_code}")
                    if original_exception.status_code == 401:
                        exception_mapping_worked = True
                        raise AuthenticationError(
-                            message=f"AnthropicException - {original_exception.message}",
+                            message=
                            f"AnthropicException - {original_exception.message}",
                            llm_provider="anthropic",
                        )
                    elif original_exception.status_code == 400:
                        exception_mapping_worked = True
                        raise InvalidRequestError(
-                            message=f"AnthropicException - {original_exception.message}",
+                            message=
                            f"AnthropicException - {original_exception.message}",
                            model=model,
                            llm_provider="anthropic",
                        )
                    elif original_exception.status_code == 429:
                        exception_mapping_worked = True
                        raise RateLimitError(
-                            message=f"AnthropicException - {original_exception.message}",
+                            message=
                            f"AnthropicException - {original_exception.message}",
                            llm_provider="anthropic",
                        )
-                elif (
+                elif ("Could not resolve authentication method. Expected either api_key or auth_token to be set."
-                    "Could not resolve authentication method. Expected either api_key or auth_token to be set."
+                      in error_str):
                    in error_str
                ):
                    exception_mapping_worked = True
                    raise AuthenticationError(
-                        message=f"AnthropicException - {original_exception.message}",
+                        message=
                        f"AnthropicException - {original_exception.message}",
                        llm_provider="anthropic",
                    )
            elif "replicate" in model:
@ -1096,35 +1166,36 @@ def exception_type(model, original_exception, custom_llm_provider):
                        llm_provider="replicate",
                    )
                elif (
-                    exception_type == "ReplicateError"
+                        exception_type == "ReplicateError"
-                ):  ## ReplicateError implies an error on Replicate server side, not user side
+                ):  # ReplicateError implies an error on Replicate server side, not user side
                    raise ServiceUnavailableError(
                        message=f"ReplicateException - {error_str}",
                        llm_provider="replicate",
                    )
            elif model == "command-nightly":  # Cohere
-                if (
+                if ("invalid api token" in error_str
-                    "invalid api token" in error_str
+                        or "No API key provided." in error_str):
                    or "No API key provided." in error_str
                ):
                    exception_mapping_worked = True
                    raise AuthenticationError(
-                        message=f"CohereException - {original_exception.message}",
+                        message=
                        f"CohereException - {original_exception.message}",
                        llm_provider="cohere",
                    )
                elif "too many tokens" in error_str:
                    exception_mapping_worked = True
                    raise InvalidRequestError(
-                        message=f"CohereException - {original_exception.message}",
+                        message=
                        f"CohereException - {original_exception.message}",
                        model=model,
                        llm_provider="cohere",
                    )
                elif (
-                    "CohereConnectionError" in exception_type
+                        "CohereConnectionError" in exception_type
                ):  # cohere seems to fire these errors when we load test it (1k+ messages / min)
                    exception_mapping_worked = True
                    raise RateLimitError(
-                        message=f"CohereException - {original_exception.message}",
+                        message=
                        f"CohereException - {original_exception.message}",
                        llm_provider="cohere",
                    )
            elif custom_llm_provider == "huggingface":
@ -1132,27 +1203,30 @@ def exception_type(model, original_exception, custom_llm_provider):
                    if original_exception.status_code == 401:
                        exception_mapping_worked = True
                        raise AuthenticationError(
-                            message=f"HuggingfaceException - {original_exception.message}",
+                            message=
                            f"HuggingfaceException - {original_exception.message}",
                            llm_provider="huggingface",
                        )
                    elif original_exception.status_code == 400:
                        exception_mapping_worked = True
                        raise InvalidRequestError(
-                            message=f"HuggingfaceException - {original_exception.message}",
+                            message=
                            f"HuggingfaceException - {original_exception.message}",
                            model=model,
                            llm_provider="huggingface",
                        )
                    elif original_exception.status_code == 429:
                        exception_mapping_worked = True
                        raise RateLimitError(
-                            message=f"HuggingfaceException - {original_exception.message}",
+                            message=
                            f"HuggingfaceException - {original_exception.message}",
                            llm_provider="huggingface",
                        )
            raise original_exception  # base case - return the original exception
        else:
            raise original_exception
    except Exception as e:
-        ## LOGGING
+        # LOGGING
        exception_logging(
            logger_fn=user_logger_fn,
            additional_args={
@ -1173,7 +1247,7 @@ def safe_crash_reporting(model=None, exception=None, custom_llm_provider=None):
        "exception": str(exception),
        "custom_llm_provider": custom_llm_provider,
    }
-    threading.Thread(target=litellm_telemetry, args=(data,)).start()
+    threading.Thread(target=litellm_telemetry, args=(data, )).start()
 def litellm_telemetry(data):
@ -1223,11 +1297,13 @@ def get_secret(secret_name):
    if litellm.secret_manager_client != None:
        # TODO: check which secret manager is being used
        # currently only supports Infisical
-        secret = litellm.secret_manager_client.get_secret(secret_name).secret_value
+        secret = litellm.secret_manager_client.get_secret(
            secret_name).secret_value
        if secret != None:
            return secret  # if secret found in secret manager return it
        else:
-            raise ValueError(f"Secret '{secret_name}' not found in secret manager")
+            raise ValueError(
                f"Secret '{secret_name}' not found in secret manager")
    elif litellm.api_key != None:  # if users use litellm default key
        return litellm.api_key
    else:
@ -1238,6 +1314,7 @@ def get_secret(secret_name):
 # wraps the completion stream to return the correct format for the model
 # replicate/anthropic/cohere
 class CustomStreamWrapper:
    def __init__(self, completion_stream, model, custom_llm_provider=None):
        self.model = model
        self.custom_llm_provider = custom_llm_provider
@ -1288,7 +1365,8 @@ class CustomStreamWrapper:
        elif self.model == "replicate":
            chunk = next(self.completion_stream)
            completion_obj["content"] = chunk
-        elif (self.model == "together_ai") or ("togethercomputer" in self.model):
+        elif (self.model == "together_ai") or ("togethercomputer"
                                               in self.model):
            chunk = next(self.completion_stream)
            text_data = self.handle_together_ai_chunk(chunk)
            if text_data == "":
@ -1321,12 +1399,11 @@ def read_config_args(config_path):
 ########## ollama implementation ############################
 import aiohttp
-async def get_ollama_response_stream(
+async def get_ollama_response_stream(api_base="http://localhost:11434",
-    api_base="http://localhost:11434", model="llama2", prompt="Why is the sky blue?"
+                                     model="llama2",
-):
+                                     prompt="Why is the sky blue?"):
    session = aiohttp.ClientSession()
    url = f"{api_base}/api/generate"
    data = {
@ -1349,7 +1426,11 @@ async def get_ollama_response_stream(
                                        "content": "",
                                    }
                                    completion_obj["content"] = j["response"]
-                                    yield {"choices": [{"delta": completion_obj}]}
+                                    yield {
                                        "choices": [{
                                            "delta": completion_obj
                                        }]
                                    }
                                    # self.responses.append(j["response"])
                                    # yield "blank"
                    except Exception as e:
--- a/proxy-server/readme.md
+++ b/proxy-server/readme.md
@ -1,6 +1,7 @@
 # liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching
 ### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models
 [![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/)
 [![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/)
 ![Downloads](https://img.shields.io/pypi/dm/litellm)
@ -11,34 +12,36 @@
 ![4BC6491E-86D0-4833-B061-9F54524B2579](https://github.com/BerriAI/litellm/assets/17561003/f5dd237b-db5e-42e1-b1ac-f05683b1d724)
 ## What does liteLLM proxy do
 - Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face**
-  
+
  Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
  ```json
  {
    "model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
    "messages": [
-                    { 
+      {
-                        "content": "Hello, whats the weather in San Francisco??",
+        "content": "Hello, whats the weather in San Francisco??",
-                        "role": "user"
+        "role": "user"
-                    }
+      }
-                ]
+    ]
  }
  ```
 - **Consistent Input/Output** Format
    - Call all models using the OpenAI format - `completion(model, messages)`
    - Text responses will always be available at `['choices'][0]['message']['content']`
 - **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
 - **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
- **Example: Logs sent to Supabase**
+- **Consistent Input/Output** Format
  - Call all models using the OpenAI format - `completion(model, messages)`
  - Text responses will always be available at `['choices'][0]['message']['content']`
 - **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
 - **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone`, `LLMonitor` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
  **Example: Logs sent to Supabase**
  <img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08">
 - **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model
 - **Caching** - Implementation of Semantic Caching
 - **Streaming & Async Support** - Return generators to stream text responses
 ## API Endpoints
 ### `/chat/completions` (POST)
@ -46,34 +49,37 @@
 This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
 #### Input
 This API endpoint accepts all inputs in raw JSON and expects the following inputs
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/): 
+
- eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
+- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
  eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
 - `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role).
 - Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/
 #### Example JSON body
 For claude-2
 ```json
 {
-    "model": "claude-2",
+  "model": "claude-2",
-    "messages": [
+  "messages": [
-                    { 
+    {
-                        "content": "Hello, whats the weather in San Francisco??",
+      "content": "Hello, whats the weather in San Francisco??",
-                        "role": "user"
+      "role": "user"
-                    }
+    }
-                ]
+  ]
 }
 ```
 ### Making an API request to the Proxy Server
 ```python
 import requests
 import json
-# TODO: use your URL 
+# TODO: use your URL
 url = "http://localhost:5000/chat/completions"
 payload = json.dumps({
@ -94,34 +100,38 @@ print(response.text)
 ```
 ### Output [Response Format]
-Responses from the server are given in the following format. 
+
 Responses from the server are given in the following format.
 All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
 ```json
 {
-    "choices": [
+  "choices": [
-        {
+    {
-            "finish_reason": "stop",
+      "finish_reason": "stop",
-            "index": 0,
+      "index": 0,
-            "message": {
+      "message": {
-                "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
+        "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
-                "role": "assistant"
+        "role": "assistant"
-            }
+      }
        }
    ],
    "created": 1691790381,
    "id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
    "model": "gpt-3.5-turbo-0613",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 41,
        "prompt_tokens": 16,
        "total_tokens": 57
    }
  ],
  "created": 1691790381,
  "id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 41,
    "prompt_tokens": 16,
    "total_tokens": 57
  }
 }
 ```
 ## Installation & Usage
 ### Running Locally
 1. Clone liteLLM repository to your local machine:
   ```
   git clone https://github.com/BerriAI/liteLLM-proxy
@ -141,24 +151,24 @@ All responses from the server are returned in the following format (for all LLM
   python main.py
   ```
 ## Deploying
 1. Quick Start: Deploy on Railway
   [![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/DYqQAW?referralCode=t3ukrU)
-   
+
-2. `GCP`, `AWS`, `Azure` 
+2. `GCP`, `AWS`, `Azure`
-This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
+   This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
 # Support / Talk with founders
 - [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
 - [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
 - Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
 - Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
 ## Roadmap
 - [ ] Support hosted db (e.g. Supabase)
 - [ ] Easily send data to places like posthog and sentry.
 - [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings