almost working llmonitor

2023-08-21 16:26:47 +02:00 · 2023-08-21 16:26:47 +02:00 · 3675d3e029
commit 3675d3e029
parent 22c7e38de5
5 changed files with 425 additions and 326 deletions
--- a/cookbook/proxy-server/readme.md
+++ b/cookbook/proxy-server/readme.md
@ -1,6 +1,7 @@
-
 # liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching
+
 ### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models
+
 [![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/)
 [![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/)
 ![Downloads](https://img.shields.io/pypi/dm/litellm)
@ -11,34 +12,36 @@
 ![4BC6491E-86D0-4833-B061-9F54524B2579](https://github.com/BerriAI/litellm/assets/17561003/f5dd237b-db5e-42e1-b1ac-f05683b1d724)

 ## What does liteLLM proxy do
+
 - Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face**
-  
+
  Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
+
  ```json
  {
    "model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
    "messages": [
-                    { 
-                        "content": "Hello, whats the weather in San Francisco??",
-                        "role": "user"
-                    }
-                ]
+      {
+        "content": "Hello, whats the weather in San Francisco??",
+        "role": "user"
+      }
+    ]
  }
  ```
- **Consistent Input/Output** Format
-    - Call all models using the OpenAI format - `completion(model, messages)`
-    - Text responses will always be available at `['choices'][0]['message']['content']`
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/

- **Example: Logs sent to Supabase**
+- **Consistent Input/Output** Format
+  - Call all models using the OpenAI format - `completion(model, messages)`
+  - Text responses will always be available at `['choices'][0]['message']['content']`
+- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
+- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `LLMonitor,` `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
+
+  **Example: Logs sent to Supabase**
  <img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08">

 - **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model
 - **Caching** - Implementation of Semantic Caching
 - **Streaming & Async Support** - Return generators to stream text responses

-
 ## API Endpoints

 ### `/chat/completions` (POST)
@ -46,34 +49,37 @@
 This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc

 #### Input
+
 This API endpoint accepts all inputs in raw JSON and expects the following inputs
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/): 
- eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
+
+- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
+  eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
 - `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role).
 - Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/

-
 #### Example JSON body
+
 For claude-2
+
 ```json
 {
-    "model": "claude-2",
-    "messages": [
-                    { 
-                        "content": "Hello, whats the weather in San Francisco??",
-                        "role": "user"
-                    }
-                ]
-    
+  "model": "claude-2",
+  "messages": [
+    {
+      "content": "Hello, whats the weather in San Francisco??",
+      "role": "user"
+    }
+  ]
 }
 ```

 ### Making an API request to the Proxy Server
+
 ```python
 import requests
 import json

-# TODO: use your URL 
+# TODO: use your URL
 url = "http://localhost:5000/chat/completions"

 payload = json.dumps({
@ -94,34 +100,38 @@ print(response.text)
 ```

 ### Output [Response Format]
-Responses from the server are given in the following format. 
+
+Responses from the server are given in the following format.
 All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
+
 ```json
 {
-    "choices": [
-        {
-            "finish_reason": "stop",
-            "index": 0,
-            "message": {
-                "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
-                "role": "assistant"
-            }
-        }
-    ],
-    "created": 1691790381,
-    "id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
-    "model": "gpt-3.5-turbo-0613",
-    "object": "chat.completion",
-    "usage": {
-        "completion_tokens": 41,
-        "prompt_tokens": 16,
-        "total_tokens": 57
+  "choices": [
+    {
+      "finish_reason": "stop",
+      "index": 0,
+      "message": {
+        "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
+        "role": "assistant"
+      }
    }
+  ],
+  "created": 1691790381,
+  "id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
+  "model": "gpt-3.5-turbo-0613",
+  "object": "chat.completion",
+  "usage": {
+    "completion_tokens": 41,
+    "prompt_tokens": 16,
+    "total_tokens": 57
+  }
 }
 ```

 ## Installation & Usage
+
 ### Running Locally
+
 1. Clone liteLLM repository to your local machine:
   ```
   git clone https://github.com/BerriAI/liteLLM-proxy
@ -141,24 +151,24 @@ All responses from the server are returned in the following format (for all LLM
   python main.py
   ```

-   
-
 ## Deploying
+
 1. Quick Start: Deploy on Railway

   [![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/DYqQAW?referralCode=t3ukrU)
-   
-2. `GCP`, `AWS`, `Azure` 
-This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
+
+2. `GCP`, `AWS`, `Azure`
+   This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers

 # Support / Talk with founders
+
 - [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
 - [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
 - Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
 - Our emails ✉️ ishaan@berri.ai / krrish@berri.ai

-
 ## Roadmap
+
 - [ ] Support hosted db (e.g. Supabase)
 - [ ] Easily send data to places like posthog and sentry.
 - [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings
--- a/litellm/integrations/llmonitor.py
+++ b/litellm/integrations/llmonitor.py
@ -5,6 +5,7 @@ import traceback
 import dotenv
 import os
 import requests
+
 dotenv.load_dotenv()  # Loading env variables using dotenv


@ -14,45 +15,34 @@ class LLMonitorLogger:
        # Instance variables
        self.api_url = os.getenv(
            "LLMONITOR_API_URL") or "https://app.llmonitor.com"
-        self.account_id = os.getenv("LLMONITOR_APP_ID")
+        self.app_id = os.getenv("LLMONITOR_APP_ID")

-    def log_event(self, model, messages, response_obj, start_time, end_time, print_verbose):
+    def log_event(self, type, run_id, error, usage, model, messages,
+                  response_obj, user_id, time, print_verbose):
        # Method definition
        try:
            print_verbose(
-                f"LLMonitor Logging - Enters logging function for model {model}")
+                f"LLMonitor Logging - Enters logging function for model {model}"
+            )

-            print(model, messages, response_obj, start_time, end_time)
+            print(type, model, messages, response_obj, time, end_user)

-            # headers = {
-            #     'Content-Type': 'application/json'
-            # }
+            headers = {'Content-Type': 'application/json'}

-            # prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar = self.price_calculator(
-            #     model, response_obj, start_time, end_time)
-            # total_cost = prompt_tokens_cost_usd_dollar + completion_tokens_cost_usd_dollar
+            data = {
+                "type": "llm",
+                "name": model,
+                "runId": run_id,
+                "app": self.app_id,
+                "error": error,
+                "event": type,
+                "timestamp": time.isoformat(),
+                "userId": user_id,
+                "input": messages,
+                "output": response_obj['choices'][0]['message']['content'],
+            }

-            # response_time = (end_time-start_time).total_seconds()
-            # if "response" in response_obj:
-            #     data = [{
-            #         "response_time": response_time,
-            #         "model_id": response_obj["model"],
-            #         "total_cost": total_cost,
-            #         "messages": messages,
-            #         "response": response_obj['choices'][0]['message']['content'],
-            #         "account_id": self.account_id
-            #     }]
-            # elif "error" in response_obj:
-            #     data = [{
-            #         "response_time": response_time,
-            #         "model_id": response_obj["model"],
-            #         "total_cost": total_cost,
-            #         "messages": messages,
-            #         "error": response_obj['error'],
-            #         "account_id": self.account_id
-            #     }]
-
-            # print_verbose(f"BerriSpend Logging - final data object: {data}")
+            print_verbose(f"LLMonitor Logging - final data object: {data}")
            # response = requests.post(url, headers=headers, json=data)
        except:
            # traceback.print_exc()
--- a/litellm/tests/test_llmonitor_integration.py
+++ b/litellm/tests/test_llmonitor_integration.py
@ -1,28 +1,36 @@
 #### What this tests ####
-#    This tests if logging to the helicone integration actually works
-
-from litellm import embedding, completion
-import litellm
+#    This tests if logging to the llmonitor integration actually works
+# Adds the parent directory to the system path
 import sys
 import os
-import traceback
-import pytest

-# Adds the parent directory to the system path
 sys.path.insert(0, os.path.abspath('../..'))

+from litellm import completion
+import litellm
+
+litellm.input_callback = ["llmonitor"]
 litellm.success_callback = ["llmonitor"]
+litellm.error_callback = ["llmonitor"]

 litellm.set_verbose = True

-user_message = "Hello, how are you?"
-messages = [{"content": user_message, "role": "user"}]
-
-
 # openai call
-response = completion(model="gpt-3.5-turbo",
-                      messages=[{"role": "user", "content": "Hi 👋 - i'm openai"}])
+# response = completion(model="gpt-3.5-turbo",
+#                       messages=[{
+#                           "role": "user",
+#                           "content": "Hi 👋 - i'm openai"
+#                       }])
+
+# print(response)
+
+# #bad request call
+# response = completion(model="chatgpt-test", messages=[{"role": "user", "content": "Hi 👋 - i'm a bad request"}])

 # cohere call
-# response = completion(model="command-nightly",
-# messages=[{"role": "user", "content": "Hi 👋 - i'm cohere"}])
+response = completion(model="command-nightly",
+                      messages=[{
+                          "role": "user",
+                          "content": "Hi 👋 - i'm cohere"
+                      }])
+print(response)
--- a/litellm/utils.py
+++ b/litellm/utils.py
@ -1,20 +1,7 @@
-import sys
-import dotenv, json, traceback, threading
-import subprocess, os
-import litellm, openai
-import random, uuid, requests
-import datetime, time
-import tiktoken
-
-encoding = tiktoken.get_encoding("cl100k_base")
-import pkg_resources
-from .integrations.helicone import HeliconeLogger
-from .integrations.aispend import AISpendLogger
-from .integrations.berrispend import BerriSpendLogger
-from .integrations.supabase import Supabase
-from .integrations.litedebugger import LiteDebugger
-from openai.error import OpenAIError as OriginalError
-from openai.openai_object import OpenAIObject
+import aiohttp
+import subprocess
+import importlib
+from typing import List, Dict, Union, Optional
 from .exceptions import (
    AuthenticationError,
    InvalidRequestError,
@ -22,7 +9,32 @@ from .exceptions import (
    ServiceUnavailableError,
    OpenAIError,
 )
-from typing import List, Dict, Union, Optional
+from openai.openai_object import OpenAIObject
+from openai.error import OpenAIError as OriginalError
+from .integrations.llmonitor import LLMonitorLogger
+from .integrations.litedebugger import LiteDebugger
+from .integrations.supabase import Supabase
+from .integrations.berrispend import BerriSpendLogger
+from .integrations.aispend import AISpendLogger
+from .integrations.helicone import HeliconeLogger
+import pkg_resources
+import sys
+import dotenv
+import json
+import traceback
+import threading
+import subprocess
+import os
+import litellm
+import openai
+import random
+import uuid
+import requests
+import datetime
+import time
+import tiktoken
+
+encoding = tiktoken.get_encoding("cl100k_base")

 ####### ENVIRONMENT VARIABLES ###################
 dotenv.load_dotenv()  # Loading env variables using dotenv
@ -37,6 +49,7 @@ aispendLogger = None
 berrispendLogger = None
 supabaseClient = None
 liteDebuggerClient = None
+llmonitorLogger = None
 callback_list: Optional[List[str]] = []
 user_logger_fn = None
 additional_details: Optional[Dict[str, str]] = {}
@ -63,6 +76,7 @@ local_cache: Optional[Dict[str, str]] = {}


 class Message(OpenAIObject):
+
    def __init__(self, content="default", role="assistant", **params):
        super(Message, self).__init__(**params)
        self.content = content
@ -70,7 +84,12 @@ class Message(OpenAIObject):


 class Choices(OpenAIObject):
-    def __init__(self, finish_reason="stop", index=0, message=Message(), **params):
+
+    def __init__(self,
+                 finish_reason="stop",
+                 index=0,
+                 message=Message(),
+                 **params):
        super(Choices, self).__init__(**params)
        self.finish_reason = finish_reason
        self.index = index
@ -78,20 +97,22 @@ class Choices(OpenAIObject):


 class ModelResponse(OpenAIObject):
-    def __init__(self, choices=None, created=None, model=None, usage=None, **params):
+
+    def __init__(self,
+                 choices=None,
+                 created=None,
+                 model=None,
+                 usage=None,
+                 **params):
        super(ModelResponse, self).__init__(**params)
        self.choices = choices if choices else [Choices()]
        self.created = created
        self.model = model
-        self.usage = (
-            usage
-            if usage
-            else {
-                "prompt_tokens": None,
-                "completion_tokens": None,
-                "total_tokens": None,
-            }
-        )
+        self.usage = (usage if usage else {
+            "prompt_tokens": None,
+            "completion_tokens": None,
+            "total_tokens": None,
+        })

    def to_dict_recursive(self):
        d = super().to_dict_recursive()
@ -108,8 +129,6 @@ def print_verbose(print_statement):


 ####### Package Import Handler ###################
-import importlib
-import subprocess


 def install_and_import(package: str):
@ -139,6 +158,7 @@ def install_and_import(package: str):
 # Logging function -> log the exact model details + what's being sent | Non-Blocking
 class Logging:
    global supabaseClient, liteDebuggerClient
+
    def __init__(self, model, messages, optional_params, litellm_params):
        self.model = model
        self.messages = messages
@ -146,20 +166,20 @@ class Logging:
        self.litellm_params = litellm_params
        self.logger_fn = litellm_params["logger_fn"]
        self.model_call_details = {
-            "model": model, 
-            "messages": messages, 
+            "model": model,
+            "messages": messages,
            "optional_params": self.optional_params,
            "litellm_params": self.litellm_params,
        }
-        
+
    def pre_call(self, input, api_key, additional_args={}):
        try:
            print(f"logging pre call for model: {self.model}")
            self.model_call_details["input"] = input
            self.model_call_details["api_key"] = api_key
            self.model_call_details["additional_args"] = additional_args
-            
-            ## User Logging -> if you pass in a custom logging function
+
+            # User Logging -> if you pass in a custom logging function
            print_verbose(
                f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}"
            )
@ -173,7 +193,7 @@ class Logging:
                        f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while logging {traceback.format_exc()}"
                    )

-            ## Input Integration Logging -> If you want to log the fact that an attempt to call the model was made 
+            # Input Integration Logging -> If you want to log the fact that an attempt to call the model was made
            for callback in litellm.input_callback:
                try:
                    if callback == "supabase":
@ -185,7 +205,21 @@ class Logging:
                            model=model,
                            messages=messages,
                            end_user=litellm._thread_context.user,
-                            litellm_call_id=self.litellm_params["litellm_call_id"],
+                            litellm_call_id=self.
+                            litellm_params["litellm_call_id"],
+                            print_verbose=print_verbose,
+                        )
+                    elif callback == "llmonitor":
+                        print_verbose("reaches llmonitor for logging!")
+                        model = self.model
+                        messages = self.messages
+                        print(f"liteDebuggerClient: {liteDebuggerClient}")
+                        llmonitorLogger.log_event(
+                            type="start",
+                            model=model,
+                            messages=messages,
+                            user_id=litellm._thread_context.user,
+                            run_id=self.litellm_params["litellm_call_id"],
                            print_verbose=print_verbose,
                        )
                    elif callback == "lite_debugger":
@ -197,15 +231,18 @@ class Logging:
                            model=model,
                            messages=messages,
                            end_user=litellm._thread_context.user,
-                            litellm_call_id=self.litellm_params["litellm_call_id"],
+                            litellm_call_id=self.
+                            litellm_params["litellm_call_id"],
                            print_verbose=print_verbose,
                        )
                except Exception as e:
-                    print_verbose(f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while input logging with integrations {traceback.format_exc()}")
+                    print_verbose(
+                        f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while input logging with integrations {traceback.format_exc()}"
+                    )
                    print_verbose(
                        f"LiteLLM.Logging: is sentry capture exception initialized {capture_exception}"
                    )
-                    if capture_exception: # log this error to sentry for debugging 
+                    if capture_exception:  # log this error to sentry for debugging
                        capture_exception(e)
        except:
            print_verbose(
@ -214,9 +251,9 @@ class Logging:
            print_verbose(
                f"LiteLLM.Logging: is sentry capture exception initialized {capture_exception}"
            )
-            if capture_exception: # log this error to sentry for debugging 
+            if capture_exception:  # log this error to sentry for debugging
                capture_exception(e)
-        
+
    def post_call(self, input, api_key, original_response, additional_args={}):
        # Do something here
        try:
@ -224,8 +261,8 @@ class Logging:
            self.model_call_details["api_key"] = api_key
            self.model_call_details["original_response"] = original_response
            self.model_call_details["additional_args"] = additional_args
-            
-            ## User Logging -> if you pass in a custom logging function
+
+            # User Logging -> if you pass in a custom logging function
            print_verbose(
                f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}"
            )
@ -243,9 +280,9 @@ class Logging:
                f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while logging {traceback.format_exc()}"
            )
            pass
-    
+
    # Add more methods as needed
-    
+

 def exception_logging(
    additional_args={},
@ -257,7 +294,7 @@ def exception_logging(
        if exception:
            model_call_details["exception"] = exception
        model_call_details["additional_args"] = additional_args
-        ## User Logging -> if you pass in a custom logging function or want to use sentry breadcrumbs
+        # User Logging -> if you pass in a custom logging function or want to use sentry breadcrumbs
        print_verbose(
            f"Logging Details: logger_fn - {logger_fn} | callable(logger_fn) - {callable(logger_fn)}"
        )
@ -280,20 +317,20 @@ def exception_logging(
 ####### CLIENT ###################
 # make it easy to log if completion/embedding runs succeeded or failed + see what happened | Non-Blocking
 def client(original_function):
+
    def function_setup(
        *args, **kwargs
    ):  # just run once to check if user wants to send their data anywhere - PostHog/Sentry/Slack/etc.
        try:
            global callback_list, add_breadcrumb, user_logger_fn
-            if (
-                len(litellm.input_callback) > 0 or len(litellm.success_callback) > 0 or len(litellm.failure_callback) > 0
-            ) and len(callback_list) == 0:
+            if (len(litellm.input_callback) > 0
+                    or len(litellm.success_callback) > 0
+                    or len(litellm.failure_callback)
+                    > 0) and len(callback_list) == 0:
                callback_list = list(
-                    set(litellm.input_callback + litellm.success_callback + litellm.failure_callback)
-                )
-                set_callbacks(
-                    callback_list=callback_list,
-                )
+                    set(litellm.input_callback + litellm.success_callback +
+                        litellm.failure_callback))
+                set_callbacks(callback_list=callback_list, )
            if add_breadcrumb:
                add_breadcrumb(
                    category="litellm.llm_call",
@ -310,12 +347,11 @@ def client(original_function):
        if litellm.telemetry:
            try:
                model = args[0] if len(args) > 0 else kwargs["model"]
-                exception = kwargs["exception"] if "exception" in kwargs else None
-                custom_llm_provider = (
-                    kwargs["custom_llm_provider"]
-                    if "custom_llm_provider" in kwargs
-                    else None
-                )
+                exception = kwargs[
+                    "exception"] if "exception" in kwargs else None
+                custom_llm_provider = (kwargs["custom_llm_provider"]
+                                       if "custom_llm_provider" in kwargs else
+                                       None)
                safe_crash_reporting(
                    model=model,
                    exception=exception,
@ -340,15 +376,12 @@ def client(original_function):
    def check_cache(*args, **kwargs):
        try:  # never block execution
            prompt = get_prompt(*args, **kwargs)
-            if (
-                prompt != None and prompt in local_cache
-            ):  # check if messages / prompt exists
+            if (prompt != None and prompt
+                    in local_cache):  # check if messages / prompt exists
                if litellm.caching_with_models:
                    # if caching with model names is enabled, key is prompt + model name
-                    if (
-                        "model" in kwargs
-                        and kwargs["model"] in local_cache[prompt]["models"]
-                    ):
+                    if ("model" in kwargs and kwargs["model"]
+                            in local_cache[prompt]["models"]):
                        cache_key = prompt + kwargs["model"]
                        return local_cache[cache_key]
                else:  # caching only with prompts
@ -363,10 +396,8 @@ def client(original_function):
        try:  # never block execution
            prompt = get_prompt(*args, **kwargs)
            if litellm.caching_with_models:  # caching with model + prompt
-                if (
-                    "model" in kwargs
-                    and kwargs["model"] in local_cache[prompt]["models"]
-                ):
+                if ("model" in kwargs
+                        and kwargs["model"] in local_cache[prompt]["models"]):
                    cache_key = prompt + kwargs["model"]
                    local_cache[cache_key] = result
            else:  # caching based only on prompts
@ -381,24 +412,24 @@ def client(original_function):
            function_setup(*args, **kwargs)
            litellm_call_id = str(uuid.uuid4())
            kwargs["litellm_call_id"] = litellm_call_id
-            ## [OPTIONAL] CHECK CACHE
+            # [OPTIONAL] CHECK CACHE
            start_time = datetime.datetime.now()
            if (litellm.caching or litellm.caching_with_models) and (
-                cached_result := check_cache(*args, **kwargs)
-            ) is not None:
+                    cached_result := check_cache(*args, **kwargs)) is not None:
                result = cached_result
            else:
-                ## MODEL CALL
+                # MODEL CALL
                result = original_function(*args, **kwargs)
            end_time = datetime.datetime.now()
-            ## Add response to CACHE
+            # Add response to CACHE
            if litellm.caching:
                add_cache(result, *args, **kwargs)
-            ## LOG SUCCESS
+            # LOG SUCCESS
            crash_reporting(*args, **kwargs)
            my_thread = threading.Thread(
-                target=handle_success, args=(args, kwargs, result, start_time, end_time)
-            )  # don't interrupt execution of main thread
+                target=handle_success,
+                args=(args, kwargs, result, start_time,
+                      end_time))  # don't interrupt execution of main thread
            my_thread.start()
            return result
        except Exception as e:
@ -407,7 +438,8 @@ def client(original_function):
            end_time = datetime.datetime.now()
            my_thread = threading.Thread(
                target=handle_failure,
-                args=(e, traceback_exception, start_time, end_time, args, kwargs),
+                args=(e, traceback_exception, start_time, end_time, args,
+                      kwargs),
            )  # don't interrupt execution of main thread
            my_thread.start()
            raise e
@ -432,18 +464,18 @@ def token_counter(model, text):
    return num_tokens


-def cost_per_token(model="gpt-3.5-turbo", prompt_tokens=0, completion_tokens=0):
-    ## given
+def cost_per_token(model="gpt-3.5-turbo",
+                   prompt_tokens=0,
+                   completion_tokens=0):
+    # given
    prompt_tokens_cost_usd_dollar = 0
    completion_tokens_cost_usd_dollar = 0
    model_cost_ref = litellm.model_cost
    if model in model_cost_ref:
        prompt_tokens_cost_usd_dollar = (
-            model_cost_ref[model]["input_cost_per_token"] * prompt_tokens
-        )
+            model_cost_ref[model]["input_cost_per_token"] * prompt_tokens)
        completion_tokens_cost_usd_dollar = (
-            model_cost_ref[model]["output_cost_per_token"] * completion_tokens
-        )
+            model_cost_ref[model]["output_cost_per_token"] * completion_tokens)
        return prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar
    else:
        # calculate average input cost
@ -464,8 +496,9 @@ def completion_cost(model="gpt-3.5-turbo", prompt="", completion=""):
    prompt_tokens = token_counter(model=model, text=prompt)
    completion_tokens = token_counter(model=model, text=completion)
    prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar = cost_per_token(
-        model=model, prompt_tokens=prompt_tokens, completion_tokens=completion_tokens
-    )
+        model=model,
+        prompt_tokens=prompt_tokens,
+        completion_tokens=completion_tokens)
    return prompt_tokens_cost_usd_dollar + completion_tokens_cost_usd_dollar


@ -557,9 +590,8 @@ def get_optional_params(
            optional_params["max_tokens"] = max_tokens
        if frequency_penalty != 0:
            optional_params["frequency_penalty"] = frequency_penalty
-    elif (
-        model == "chat-bison"
-    ):  # chat-bison has diff args from chat-bison@001 ty Google
+    elif (model == "chat-bison"
+          ):  # chat-bison has diff args from chat-bison@001 ty Google
        if temperature != 1:
            optional_params["temperature"] = temperature
        if top_p != 1:
@ -619,7 +651,10 @@ def load_test_model(
        test_prompt = prompt
    if num_calls:
        test_calls = num_calls
-    messages = [[{"role": "user", "content": test_prompt}] for _ in range(test_calls)]
+    messages = [[{
+        "role": "user",
+        "content": test_prompt
+    }] for _ in range(test_calls)]
    start_time = time.time()
    try:
        litellm.batch_completion(
@ -649,7 +684,7 @@ def load_test_model(


 def set_callbacks(callback_list):
-    global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, heliconeLogger, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient
+    global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, heliconeLogger, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient, llmonitorLogger
    try:
        for callback in callback_list:
            print(f"callback: {callback}")
@ -657,17 +692,15 @@ def set_callbacks(callback_list):
                try:
                    import sentry_sdk
                except ImportError:
-                    print_verbose("Package 'sentry_sdk' is missing. Installing it...")
+                    print_verbose(
+                        "Package 'sentry_sdk' is missing. Installing it...")
                    subprocess.check_call(
-                        [sys.executable, "-m", "pip", "install", "sentry_sdk"]
-                    )
+                        [sys.executable, "-m", "pip", "install", "sentry_sdk"])
                    import sentry_sdk
                sentry_sdk_instance = sentry_sdk
-                sentry_trace_rate = (
-                    os.environ.get("SENTRY_API_TRACE_RATE")
-                    if "SENTRY_API_TRACE_RATE" in os.environ
-                    else "1.0"
-                )
+                sentry_trace_rate = (os.environ.get("SENTRY_API_TRACE_RATE")
+                                     if "SENTRY_API_TRACE_RATE" in os.environ
+                                     else "1.0")
                sentry_sdk_instance.init(
                    dsn=os.environ.get("SENTRY_API_URL"),
                    traces_sample_rate=float(sentry_trace_rate),
@ -678,10 +711,10 @@ def set_callbacks(callback_list):
                try:
                    from posthog import Posthog
                except ImportError:
-                    print_verbose("Package 'posthog' is missing. Installing it...")
+                    print_verbose(
+                        "Package 'posthog' is missing. Installing it...")
                    subprocess.check_call(
-                        [sys.executable, "-m", "pip", "install", "posthog"]
-                    )
+                        [sys.executable, "-m", "pip", "install", "posthog"])
                    from posthog import Posthog
                posthog = Posthog(
                    project_api_key=os.environ.get("POSTHOG_API_KEY"),
@ -691,10 +724,10 @@ def set_callbacks(callback_list):
                try:
                    from slack_bolt import App
                except ImportError:
-                    print_verbose("Package 'slack_bolt' is missing. Installing it...")
+                    print_verbose(
+                        "Package 'slack_bolt' is missing. Installing it...")
                    subprocess.check_call(
-                        [sys.executable, "-m", "pip", "install", "slack_bolt"]
-                    )
+                        [sys.executable, "-m", "pip", "install", "slack_bolt"])
                    from slack_bolt import App
                slack_app = App(
                    token=os.environ.get("SLACK_API_TOKEN"),
@ -704,6 +737,8 @@ def set_callbacks(callback_list):
                print_verbose(f"Initialized Slack App: {slack_app}")
            elif callback == "helicone":
                heliconeLogger = HeliconeLogger()
+            elif callback == "llmonitor":
+                llmonitorLogger = LLMonitorLogger()
            elif callback == "aispend":
                aispendLogger = AISpendLogger()
            elif callback == "berrispend":
@ -718,7 +753,8 @@ def set_callbacks(callback_list):
        raise e


-def handle_failure(exception, traceback_exception, start_time, end_time, args, kwargs):
+def handle_failure(exception, traceback_exception, start_time, end_time, args,
+                   kwargs):
    global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient
    try:
        # print_verbose(f"handle_failure args: {args}")
@ -728,8 +764,7 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
        failure_handler = additional_details.pop("failure_handler", None)

        additional_details["Event_Name"] = additional_details.pop(
-            "failed_event_name", "litellm.failed_query"
-        )
+            "failed_event_name", "litellm.failed_query")
        print_verbose(f"self.failure_callback: {litellm.failure_callback}")

        # print_verbose(f"additional_details: {additional_details}")
@ -746,9 +781,8 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                    for detail in additional_details:
                        slack_msg += f"{detail}: {additional_details[detail]}\n"
                    slack_msg += f"Traceback: {traceback_exception}"
-                    slack_app.client.chat_postMessage(
-                        channel=alerts_channel, text=slack_msg
-                    )
+                    slack_app.client.chat_postMessage(channel=alerts_channel,
+                                                      text=slack_msg)
                elif callback == "sentry":
                    capture_exception(exception)
                elif callback == "posthog":
@ -767,9 +801,8 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                    print_verbose(f"ph_obj: {ph_obj}")
                    print_verbose(f"PostHog Event Name: {event_name}")
                    if "user_id" in additional_details:
-                        posthog.capture(
-                            additional_details["user_id"], event_name, ph_obj
-                        )
+                        posthog.capture(additional_details["user_id"],
+                                        event_name, ph_obj)
                    else:  # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python
                        unique_id = str(uuid.uuid4())
                        posthog.capture(unique_id, event_name)
@ -783,10 +816,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                        "created": time.time(),
                        "error": traceback_exception,
                        "usage": {
-                            "prompt_tokens": prompt_token_calculator(
-                                model, messages=messages
-                            ),
-                            "completion_tokens": 0,
+                            "prompt_tokens":
+                            prompt_token_calculator(model, messages=messages),
+                            "completion_tokens":
+                            0,
                        },
                    }
                    berrispendLogger.log_event(
@ -805,10 +838,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                        "model": model,
                        "created": time.time(),
                        "usage": {
-                            "prompt_tokens": prompt_token_calculator(
-                                model, messages=messages
-                            ),
-                            "completion_tokens": 0,
+                            "prompt_tokens":
+                            prompt_token_calculator(model, messages=messages),
+                            "completion_tokens":
+                            0,
                        },
                    }
                    aispendLogger.log_event(
@ -818,6 +851,27 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                        end_time=end_time,
                        print_verbose=print_verbose,
                    )
+                elif callback == "llmonitor":
+                    print_verbose("reaches llmonitor for logging!")
+                    model = args[0] if len(args) > 0 else kwargs["model"]
+                    messages = args[1] if len(args) > 1 else kwargs["messages"]
+                    usage = {
+                        "prompt_tokens":
+                        prompt_token_calculator(model, messages=messages),
+                        "completion_tokens":
+                        0,
+                    }
+                    llmonitorLogger.log_event(
+                        type="error",
+                        user_id=litellm._thread_context.user,
+                        model=model,
+                        error=traceback_exception,
+                        response_obj=result,
+                        run_id=kwargs["litellm_call_id"],
+                        timestamp=end_time,
+                        usage=usage,
+                        print_verbose=print_verbose,
+                    )
                elif callback == "supabase":
                    print_verbose("reaches supabase for logging!")
                    print_verbose(f"supabaseClient: {supabaseClient}")
@ -828,10 +882,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                        "created": time.time(),
                        "error": traceback_exception,
                        "usage": {
-                            "prompt_tokens": prompt_token_calculator(
-                                model, messages=messages
-                            ),
-                            "completion_tokens": 0,
+                            "prompt_tokens":
+                            prompt_token_calculator(model, messages=messages),
+                            "completion_tokens":
+                            0,
                        },
                    }
                    supabaseClient.log_event(
@ -854,10 +908,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
                        "created": time.time(),
                        "error": traceback_exception,
                        "usage": {
-                            "prompt_tokens": prompt_token_calculator(
-                                model, messages=messages
-                            ),
-                            "completion_tokens": 0,
+                            "prompt_tokens":
+                            prompt_token_calculator(model, messages=messages),
+                            "completion_tokens":
+                            0,
                        },
                    }
                    liteDebuggerClient.log_event(
@ -884,19 +938,18 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
            failure_handler(call_details)
        pass
    except Exception as e:
-        ## LOGGING
+        # LOGGING
        exception_logging(logger_fn=user_logger_fn, exception=e)
        pass


 def handle_success(args, kwargs, result, start_time, end_time):
-    global heliconeLogger, aispendLogger, supabaseClient, liteDebuggerClient
+    global heliconeLogger, aispendLogger, supabaseClient, liteDebuggerClient, llmonitorLogger
    try:
        success_handler = additional_details.pop("success_handler", None)
        failure_handler = additional_details.pop("failure_handler", None)
        additional_details["Event_Name"] = additional_details.pop(
-            "successful_event_name", "litellm.succes_query"
-        )
+            "successful_event_name", "litellm.succes_query")
        for callback in litellm.success_callback:
            try:
                if callback == "posthog":
@ -905,9 +958,8 @@ def handle_success(args, kwargs, result, start_time, end_time):
                        ph_obj[detail] = additional_details[detail]
                    event_name = additional_details["Event_Name"]
                    if "user_id" in additional_details:
-                        posthog.capture(
-                            additional_details["user_id"], event_name, ph_obj
-                        )
+                        posthog.capture(additional_details["user_id"],
+                                        event_name, ph_obj)
                    else:  # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python
                        unique_id = str(uuid.uuid4())
                        posthog.capture(unique_id, event_name, ph_obj)
@ -916,9 +968,8 @@ def handle_success(args, kwargs, result, start_time, end_time):
                    slack_msg = ""
                    for detail in additional_details:
                        slack_msg += f"{detail}: {additional_details[detail]}\n"
-                    slack_app.client.chat_postMessage(
-                        channel=alerts_channel, text=slack_msg
-                    )
+                    slack_app.client.chat_postMessage(channel=alerts_channel,
+                                                      text=slack_msg)
                elif callback == "helicone":
                    print_verbose("reaches helicone for logging!")
                    model = args[0] if len(args) > 0 else kwargs["model"]
@ -931,6 +982,22 @@ def handle_success(args, kwargs, result, start_time, end_time):
                        end_time=end_time,
                        print_verbose=print_verbose,
                    )
+                elif callback == "llmonitor":
+                    print_verbose("reaches llmonitor for logging!")
+                    model = args[0] if len(args) > 0 else kwargs["model"]
+                    messages = args[1] if len(args) > 1 else kwargs["messages"]
+                    usage = kwargs["usage"]
+                    llmonitorLogger.log_event(
+                        type="end",
+                        model=model,
+                        messages=messages,
+                        user_id=litellm._thread_context.user,
+                        response_obj=result,
+                        time=end_time,
+                        usage=usage,
+                        run_id=kwargs["litellm_call_id"],
+                        print_verbose=print_verbose,
+                    )
                elif callback == "aispend":
                    print_verbose("reaches aispend for logging!")
                    model = args[0] if len(args) > 0 else kwargs["model"]
@ -984,7 +1051,7 @@ def handle_success(args, kwargs, result, start_time, end_time):
                        print_verbose=print_verbose,
                    )
            except Exception as e:
-                ## LOGGING
+                # LOGGING
                exception_logging(logger_fn=user_logger_fn, exception=e)
                print_verbose(
                    f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}"
@ -995,7 +1062,7 @@ def handle_success(args, kwargs, result, start_time, end_time):
            success_handler(args, kwargs)
        pass
    except Exception as e:
-        ## LOGGING
+        # LOGGING
        exception_logging(logger_fn=user_logger_fn, exception=e)
        print_verbose(
            f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}"
@ -1046,33 +1113,36 @@ def exception_type(model, original_exception, custom_llm_provider):
                exception_type = ""
            if "claude" in model:  # one of the anthropics
                if hasattr(original_exception, "status_code"):
-                    print_verbose(f"status_code: {original_exception.status_code}")
+                    print_verbose(
+                        f"status_code: {original_exception.status_code}")
                    if original_exception.status_code == 401:
                        exception_mapping_worked = True
                        raise AuthenticationError(
-                            message=f"AnthropicException - {original_exception.message}",
+                            message=
+                            f"AnthropicException - {original_exception.message}",
                            llm_provider="anthropic",
                        )
                    elif original_exception.status_code == 400:
                        exception_mapping_worked = True
                        raise InvalidRequestError(
-                            message=f"AnthropicException - {original_exception.message}",
+                            message=
+                            f"AnthropicException - {original_exception.message}",
                            model=model,
                            llm_provider="anthropic",
                        )
                    elif original_exception.status_code == 429:
                        exception_mapping_worked = True
                        raise RateLimitError(
-                            message=f"AnthropicException - {original_exception.message}",
+                            message=
+                            f"AnthropicException - {original_exception.message}",
                            llm_provider="anthropic",
                        )
-                elif (
-                    "Could not resolve authentication method. Expected either api_key or auth_token to be set."
-                    in error_str
-                ):
+                elif ("Could not resolve authentication method. Expected either api_key or auth_token to be set."
+                      in error_str):
                    exception_mapping_worked = True
                    raise AuthenticationError(
-                        message=f"AnthropicException - {original_exception.message}",
+                        message=
+                        f"AnthropicException - {original_exception.message}",
                        llm_provider="anthropic",
                    )
            elif "replicate" in model:
@ -1096,35 +1166,36 @@ def exception_type(model, original_exception, custom_llm_provider):
                        llm_provider="replicate",
                    )
                elif (
-                    exception_type == "ReplicateError"
-                ):  ## ReplicateError implies an error on Replicate server side, not user side
+                        exception_type == "ReplicateError"
+                ):  # ReplicateError implies an error on Replicate server side, not user side
                    raise ServiceUnavailableError(
                        message=f"ReplicateException - {error_str}",
                        llm_provider="replicate",
                    )
            elif model == "command-nightly":  # Cohere
-                if (
-                    "invalid api token" in error_str
-                    or "No API key provided." in error_str
-                ):
+                if ("invalid api token" in error_str
+                        or "No API key provided." in error_str):
                    exception_mapping_worked = True
                    raise AuthenticationError(
-                        message=f"CohereException - {original_exception.message}",
+                        message=
+                        f"CohereException - {original_exception.message}",
                        llm_provider="cohere",
                    )
                elif "too many tokens" in error_str:
                    exception_mapping_worked = True
                    raise InvalidRequestError(
-                        message=f"CohereException - {original_exception.message}",
+                        message=
+                        f"CohereException - {original_exception.message}",
                        model=model,
                        llm_provider="cohere",
                    )
                elif (
-                    "CohereConnectionError" in exception_type
+                        "CohereConnectionError" in exception_type
                ):  # cohere seems to fire these errors when we load test it (1k+ messages / min)
                    exception_mapping_worked = True
                    raise RateLimitError(
-                        message=f"CohereException - {original_exception.message}",
+                        message=
+                        f"CohereException - {original_exception.message}",
                        llm_provider="cohere",
                    )
            elif custom_llm_provider == "huggingface":
@ -1132,27 +1203,30 @@ def exception_type(model, original_exception, custom_llm_provider):
                    if original_exception.status_code == 401:
                        exception_mapping_worked = True
                        raise AuthenticationError(
-                            message=f"HuggingfaceException - {original_exception.message}",
+                            message=
+                            f"HuggingfaceException - {original_exception.message}",
                            llm_provider="huggingface",
                        )
                    elif original_exception.status_code == 400:
                        exception_mapping_worked = True
                        raise InvalidRequestError(
-                            message=f"HuggingfaceException - {original_exception.message}",
+                            message=
+                            f"HuggingfaceException - {original_exception.message}",
                            model=model,
                            llm_provider="huggingface",
                        )
                    elif original_exception.status_code == 429:
                        exception_mapping_worked = True
                        raise RateLimitError(
-                            message=f"HuggingfaceException - {original_exception.message}",
+                            message=
+                            f"HuggingfaceException - {original_exception.message}",
                            llm_provider="huggingface",
                        )
            raise original_exception  # base case - return the original exception
        else:
            raise original_exception
    except Exception as e:
-        ## LOGGING
+        # LOGGING
        exception_logging(
            logger_fn=user_logger_fn,
            additional_args={
@ -1173,7 +1247,7 @@ def safe_crash_reporting(model=None, exception=None, custom_llm_provider=None):
        "exception": str(exception),
        "custom_llm_provider": custom_llm_provider,
    }
-    threading.Thread(target=litellm_telemetry, args=(data,)).start()
+    threading.Thread(target=litellm_telemetry, args=(data, )).start()


 def litellm_telemetry(data):
@ -1223,11 +1297,13 @@ def get_secret(secret_name):
    if litellm.secret_manager_client != None:
        # TODO: check which secret manager is being used
        # currently only supports Infisical
-        secret = litellm.secret_manager_client.get_secret(secret_name).secret_value
+        secret = litellm.secret_manager_client.get_secret(
+            secret_name).secret_value
        if secret != None:
            return secret  # if secret found in secret manager return it
        else:
-            raise ValueError(f"Secret '{secret_name}' not found in secret manager")
+            raise ValueError(
+                f"Secret '{secret_name}' not found in secret manager")
    elif litellm.api_key != None:  # if users use litellm default key
        return litellm.api_key
    else:
@ -1238,6 +1314,7 @@ def get_secret(secret_name):
 # wraps the completion stream to return the correct format for the model
 # replicate/anthropic/cohere
 class CustomStreamWrapper:
+
    def __init__(self, completion_stream, model, custom_llm_provider=None):
        self.model = model
        self.custom_llm_provider = custom_llm_provider
@ -1288,7 +1365,8 @@ class CustomStreamWrapper:
        elif self.model == "replicate":
            chunk = next(self.completion_stream)
            completion_obj["content"] = chunk
-        elif (self.model == "together_ai") or ("togethercomputer" in self.model):
+        elif (self.model == "together_ai") or ("togethercomputer"
+                                               in self.model):
            chunk = next(self.completion_stream)
            text_data = self.handle_together_ai_chunk(chunk)
            if text_data == "":
@ -1321,12 +1399,11 @@ def read_config_args(config_path):


 ########## ollama implementation ############################
-import aiohttp


-async def get_ollama_response_stream(
-    api_base="http://localhost:11434", model="llama2", prompt="Why is the sky blue?"
-):
+async def get_ollama_response_stream(api_base="http://localhost:11434",
+                                     model="llama2",
+                                     prompt="Why is the sky blue?"):
    session = aiohttp.ClientSession()
    url = f"{api_base}/api/generate"
    data = {
@ -1349,7 +1426,11 @@ async def get_ollama_response_stream(
                                        "content": "",
                                    }
                                    completion_obj["content"] = j["response"]
-                                    yield {"choices": [{"delta": completion_obj}]}
+                                    yield {
+                                        "choices": [{
+                                            "delta": completion_obj
+                                        }]
+                                    }
                                    # self.responses.append(j["response"])
                                    # yield "blank"
                    except Exception as e:
--- a/proxy-server/readme.md
+++ b/proxy-server/readme.md
@ -1,6 +1,7 @@
-
 # liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching
+
 ### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models
+
 [![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/)
 [![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/)
 ![Downloads](https://img.shields.io/pypi/dm/litellm)
@ -11,34 +12,36 @@
 ![4BC6491E-86D0-4833-B061-9F54524B2579](https://github.com/BerriAI/litellm/assets/17561003/f5dd237b-db5e-42e1-b1ac-f05683b1d724)

 ## What does liteLLM proxy do
+
 - Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face**
-  
+
  Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
+
  ```json
  {
    "model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
    "messages": [
-                    { 
-                        "content": "Hello, whats the weather in San Francisco??",
-                        "role": "user"
-                    }
-                ]
+      {
+        "content": "Hello, whats the weather in San Francisco??",
+        "role": "user"
+      }
+    ]
  }
  ```
- **Consistent Input/Output** Format
-    - Call all models using the OpenAI format - `completion(model, messages)`
-    - Text responses will always be available at `['choices'][0]['message']['content']`
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/

- **Example: Logs sent to Supabase**
+- **Consistent Input/Output** Format
+  - Call all models using the OpenAI format - `completion(model, messages)`
+  - Text responses will always be available at `['choices'][0]['message']['content']`
+- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
+- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone`, `LLMonitor` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
+
+  **Example: Logs sent to Supabase**
  <img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08">

 - **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model
 - **Caching** - Implementation of Semantic Caching
 - **Streaming & Async Support** - Return generators to stream text responses

-
 ## API Endpoints

 ### `/chat/completions` (POST)
@ -46,34 +49,37 @@
 This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc

 #### Input
+
 This API endpoint accepts all inputs in raw JSON and expects the following inputs
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/): 
- eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
+
+- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
+  eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
 - `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role).
 - Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/

-
 #### Example JSON body
+
 For claude-2
+
 ```json
 {
-    "model": "claude-2",
-    "messages": [
-                    { 
-                        "content": "Hello, whats the weather in San Francisco??",
-                        "role": "user"
-                    }
-                ]
-    
+  "model": "claude-2",
+  "messages": [
+    {
+      "content": "Hello, whats the weather in San Francisco??",
+      "role": "user"
+    }
+  ]
 }
 ```

 ### Making an API request to the Proxy Server
+
 ```python
 import requests
 import json

-# TODO: use your URL 
+# TODO: use your URL
 url = "http://localhost:5000/chat/completions"

 payload = json.dumps({
@ -94,34 +100,38 @@ print(response.text)
 ```

 ### Output [Response Format]
-Responses from the server are given in the following format. 
+
+Responses from the server are given in the following format.
 All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
+
 ```json
 {
-    "choices": [
-        {
-            "finish_reason": "stop",
-            "index": 0,
-            "message": {
-                "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
-                "role": "assistant"
-            }
-        }
-    ],
-    "created": 1691790381,
-    "id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
-    "model": "gpt-3.5-turbo-0613",
-    "object": "chat.completion",
-    "usage": {
-        "completion_tokens": 41,
-        "prompt_tokens": 16,
-        "total_tokens": 57
+  "choices": [
+    {
+      "finish_reason": "stop",
+      "index": 0,
+      "message": {
+        "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
+        "role": "assistant"
+      }
    }
+  ],
+  "created": 1691790381,
+  "id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
+  "model": "gpt-3.5-turbo-0613",
+  "object": "chat.completion",
+  "usage": {
+    "completion_tokens": 41,
+    "prompt_tokens": 16,
+    "total_tokens": 57
+  }
 }
 ```

 ## Installation & Usage
+
 ### Running Locally
+
 1. Clone liteLLM repository to your local machine:
   ```
   git clone https://github.com/BerriAI/liteLLM-proxy
@ -141,24 +151,24 @@ All responses from the server are returned in the following format (for all LLM
   python main.py
   ```

-   
-
 ## Deploying
+
 1. Quick Start: Deploy on Railway

   [![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/DYqQAW?referralCode=t3ukrU)
-   
-2. `GCP`, `AWS`, `Azure` 
-This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
+
+2. `GCP`, `AWS`, `Azure`
+   This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers

 # Support / Talk with founders
+
 - [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
 - [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
 - Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
 - Our emails ✉️ ishaan@berri.ai / krrish@berri.ai

-
 ## Roadmap
+
 - [ ] Support hosted db (e.g. Supabase)
 - [ ] Easily send data to places like posthog and sentry.
 - [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings