litellm/docs/my-website/docs/caching/caching.md
2023-08-26 16:30:32 -07:00

1.7 KiB

Caching

liteLLM implements exact match caching. It can be enabled by setting

  1. litellm.caching: When set to True, enables caching for all responses. Keys are the input messages and values store in the cache is the corresponding response

  2. litellm.caching_with_models: When set to True, enables caching on a per-model basis.Keys are the input messages + model and values store in the cache is the corresponding response

Usage

  1. Caching - cache Keys in the cache are model, the following example will lead to a cache hit
litellm.caching = True

# Make completion calls
response1 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])
response2 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])

# response1 == response2, response 1 is cached

# with a diff model
response3 = completion(model="command-nightly", messages=[{"role": "user", "content": "Tell me a joke."}])

# response3 == response1 == response2, since keys are messages
  1. Caching with Models - caching_with_models Keys in the cache are messages + model, the following example will not lead to a cache hit
litellm.caching_with_models = True

# Make completion calls
response1 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])
response2 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])
# response1 == response2, response 1 is cached

# with a diff model, this will call the API since the key is not cached
response3 = completion(model="command-nightly", messages=[{"role": "user", "content": "Tell me a joke."}])

# response3 != response1, since keys are messages + model