litellm/docs/my-website/docs/proxy/caching.md
2023-12-09 15:56:57 -08:00

2.5 KiB

Caching

Cache LLM Responses

Caching can be enabled by adding the cache key in the config.yaml

Step 1: Add cache to the config.yaml

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: gpt-3.5-turbo

litellm_settings:
  set_verbose: True
  cache: True          # set cache responses to True, litellm defaults to using a redis cache

Step 2: Add Redis Credentials to .env

Set either REDIS_URL or the REDIS_HOST in your os environment, to enable caching.

REDIS_URL = ""        # REDIS_URL='redis://username:password@hostname:port/database'
## OR ## 
REDIS_HOST = ""       # REDIS_HOST='redis-18841.c274.us-east-1-3.ec2.cloud.redislabs.com'
REDIS_PORT = ""       # REDIS_PORT='18841'
REDIS_PASSWORD = ""   # REDIS_PASSWORD='liteLlmIsAmazing'

Additional kwargs
You can pass in any additional redis.Redis arg, by storing the variable + value in your os environment, like this:

REDIS_<redis-kwarg-name> = ""

See how it's read from the environment

Step 3: Run proxy with config

$ litellm --config /path/to/config.yaml

Using Caching

Send the same request twice:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "gpt-3.5-turbo",
     "messages": [{"role": "user", "content": "write a poem about litellm!"}],
     "temperature": 0.7
   }'

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "gpt-3.5-turbo",
     "messages": [{"role": "user", "content": "write a poem about litellm!"}],
     "temperature": 0.7
   }'

Control caching per completion request

Caching can be switched on/off per /chat/completions request

  • Caching on for completion - pass caching=True:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
       "model": "gpt-3.5-turbo",
       "messages": [{"role": "user", "content": "write a poem about litellm!"}],
       "temperature": 0.7,
       "caching": true
     }'
    
  • Caching off for completion - pass caching=False:
    curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
       "model": "gpt-3.5-turbo",
       "messages": [{"role": "user", "content": "write a poem about litellm!"}],
       "temperature": 0.7,
       "caching": false
     }'