phoenix/litellm

Fork 0

forked from phoenix/litellm-mirror

Ishaan Jaff 0c9c42915f update load test doc

2024-10-04 18:47:26 +05:30

7.6 KiB

Raw Blame History

import Image from '@theme/IdealImage';

LiteLLM Proxy - 1K RPS Load test on locust

Tutorial on how to get to 1K+ RPS with LiteLLM Proxy on locust

Pre-Testing Checklist

Ensure you're using the latest -stable version of litellm
Ensure you're following ALL best practices for production
Locust - Ensure you're Locust instance can create 1K+ requests per second
- 👉 You can use our maintained locust instance here
- If you're self hosting locust
  - here's the spec used for our locust machine
  - here is the locustfile.py used for our tests
Use this machine specification for running litellm proxy
Enterprise LiteLLM - Use prometheus as a callback in your proxy_config.yaml to get metrics on your load test Set litellm_settings.callbacks to monitor success/failures/all types of errors
```
litellm_settings:
    callbacks: ["prometheus"] # Enterprise LiteLLM Only - use prometheus to get metrics on your load test
```

Load Test - Fake OpenAI Endpoint

Expected Performance

Metric	Value
Requests per Second	1174+
Median Response Time	`96ms`
Average Response Time	`142.18ms`

Run Test

Add fake-openai-endpoint to your proxy config.yaml and start your litellm proxy litellm provides a hosted fake-openai-endpoint you can load test against

model_list:
  - model_name: fake-openai-endpoint
    litellm_params:
      model: openai/fake
      api_key: fake-key
      api_base: https://exampleopenaiendpoint-production.up.railway.app/

litellm_settings:
  callbacks: ["prometheus"] # Enterprise LiteLLM Only - use prometheus to get metrics on your load test

pip install locust
Create a file called locustfile.py on your local machine. Copy the contents from the litellm load test located here
Start locust Run locust in the same directory as your locustfile.py from step 2

locust -f locustfile.py --processes 4

Run Load test on locust

Head to the locust UI on http://0.0.0.0:8089

Set Users=1000, Ramp Up Users=1000, Host=Base URL of your LiteLLM Proxy

Expected results

Load test - Endpoints with Rate Limits

Run a load test on 2 LLM deployments each with 10K RPM Quota. Expect to see ~20K RPM

Expected Performance

We expect to see 20,000+ successful responses in 1 minute
The remaining requests fail because the endpoint exceeds it's 10K RPM quota limit - from the LLM API provider

Metric	Value
Successful Responses in 1 minute	20,000+
Requests per Second	~1170+
Median Response Time	`70ms`
Average Response Time	`640.18ms`

Run Test

Add 2 gemini-vision deployments on your config.yaml. Each deployment can handle 10K RPM. (We setup a fake endpoint with a rate limit of 1000 RPM on the /v1/projects/bad-adroit-crow route below )

:::info

All requests with model="gemini-vision" will be load balanced equally across the 2 deployments.

:::

model_list:
  - model_name: gemini-vision
    litellm_params:
      model: vertex_ai/gemini-1.0-pro-vision-001
      api_base: https://exampleopenaiendpoint-production.up.railway.app/v1/projects/bad-adroit-crow-413218/locations/us-central1/publishers/google/models/gemini-1.0-pro-vision-001
      vertex_project: "adroit-crow-413218"
      vertex_location: "us-central1"
      vertex_credentials: /etc/secrets/adroit_crow.json
  - model_name: gemini-vision
    litellm_params:
      model: vertex_ai/gemini-1.0-pro-vision-001
      api_base: https://exampleopenaiendpoint-production-c715.up.railway.app/v1/projects/bad-adroit-crow-413218/locations/us-central1/publishers/google/models/gemini-1.0-pro-vision-001
      vertex_project: "adroit-crow-413218"
      vertex_location: "us-central1"
      vertex_credentials: /etc/secrets/adroit_crow.json

litellm_settings:
  callbacks: ["prometheus"] # Enterprise LiteLLM Only - use prometheus to get metrics on your load test

pip install locust
Create a file called locustfile.py on your local machine. Copy the contents from the litellm load test located here
Start locust Run locust in the same directory as your locustfile.py from step 2

locust -f locustfile.py --processes 4 -t 60

Run Load test on locust

Head to the locust UI on http://0.0.0.0:8089 and use the following settings

Expected results
- Successful responses in 1 minute = 19,800 = (69415 - 49615)
- Requests per second = 1170
- Median response time = 70ms
- Average response time = 640ms

Prometheus Metrics for debugging load tests

Use the following prometheus metrics to debug your load tests / failures

Metric Name	Description
`litellm_deployment_failure_responses`	Total number of failed LLM API calls for a specific LLM deployment. Labels: `"requested_model", "litellm_model_name", "model_id", "api_base", "api_provider", "hashed_api_key", "api_key_alias", "team", "team_alias", "exception_status", "exception_class"`
`litellm_deployment_cooled_down`	Number of times a deployment has been cooled down by LiteLLM load balancing logic. Labels: `"litellm_model_name", "model_id", "api_base", "api_provider", "exception_status"`

Machine Specifications for Running Locust

Metric	Value
`locust --processes 4`	4
`vCPUs` on Load Testing Machine	2.0 vCPUs
`Memory` on Load Testing Machine	450 MB
`Replicas` of Load Testing Machine	1

Machine Specifications for Running LiteLLM Proxy

👉 Number of Replicas of LiteLLM Proxy=20 for getting 1K+ RPS

Service	Spec	CPUs	Memory	Architecture	Version
Server	`t2.large`.	`2vCPUs`	`8GB`	`x86`

Locust file used for testing

import os
import uuid
from locust import HttpUser, task, between

class MyUser(HttpUser):
    wait_time = between(0.5, 1)  # Random wait time between requests

    @task(100)
    def litellm_completion(self):
        # no cache hits with this
        payload = {
            "model": "fake-openai-endpoint",
            "messages": [{"role": "user", "content": f"{uuid.uuid4()} This is a test there will be no cache hits and we'll fill up the context" * 150 }],
            "user": "my-new-end-user-1"
        }
        response = self.client.post("chat/completions", json=payload)
        if response.status_code != 200:
            # log the errors in error.txt
            with open("error.txt", "a") as error_log:
                error_log.write(response.text + "\n")
    


    def on_start(self):
        self.api_key = os.getenv('API_KEY', 'sk-1234')
        self.client.headers.update({'Authorization': f'Bearer {self.api_key}'})

7.6 KiB Raw Blame History

LiteLLM Proxy - 1K RPS Load test on locust

Pre-Testing Checklist

Load Test - Fake OpenAI Endpoint

Expected Performance

Run Test

Load test - Endpoints with Rate Limits

Expected Performance

Run Test

Prometheus Metrics for debugging load tests

Machine Specifications for Running Locust

Machine Specifications for Running LiteLLM Proxy

Locust file used for testing

7.6 KiB

Raw Blame History