Caching

liteLLM implements exact match caching. It can be enabled by setting

litellm.caching: When set to True, enables caching for all responses. Keys are the input messages and values store in the cache is the corresponding response
litellm.caching_with_models: When set to True, enables caching on a per-model basis.Keys are the input messages + model and values store in the cache is the corresponding response

Usage

Caching - cache Keys in the cache are model, the following example will lead to a cache hit

litellm.caching = True

# Make completion calls
response1 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])
response2 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])

# response1 == response2, response 1 is cached

# with a diff model
response3 = completion(model="command-nightly", messages=[{"role": "user", "content": "Tell me a joke."}])

# response3 == response1 == response2, since keys are messages

Caching with Models - caching_with_models Keys in the cache are messages + model, the following example will not lead to a cache hit

litellm.caching_with_models = True

# Make completion calls
response1 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])
response2 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])
# response1 == response2, response 1 is cached

# with a diff model, this will call the API since the key is not cached
response3 = completion(model="command-nightly", messages=[{"role": "user", "content": "Tell me a joke."}])

# response3 != response1, since keys are messages + model

1.7 KiB Raw Blame History

Caching

Usage

1.7 KiB

Raw Blame History