mirror of
https://github.com/BerriAI/litellm.git
synced 2025-04-24 10:14:26 +00:00
add evaluation pipeline
This commit is contained in:
parent
502dfa4331
commit
dcf9372943
4 changed files with 94 additions and 13 deletions
|
@ -1,26 +1,17 @@
|
|||
import Image from '@theme/IdealImage';
|
||||
|
||||
# Weights & Biases Weave - Tracing, Monitoring and Evaluation
|
||||
# Weights & Biases Weave - Tracing and Evaluation
|
||||
|
||||
## What is W&B Weave?
|
||||
|
||||
Weights and Biases (W&B) Weave is a framework for tracking, experimenting with, evaluating, deploying, and improving LLM-based applications. Designed for flexibility and scalability, Weave supports every stage of your LLM application development workflow.
|
||||
|
||||
W&B Weave's integration with LiteLLM enables you to trace, monitor and debug your LLM applications. It enables you to easily evaluate your AI systems with the flexibility of LiteLLM.
|
||||
W&B Weave's integration with LiteLLM enables you to trace, version control and debug your LLM applications. It enables you to easily evaluate your AI systems with the flexibility of LiteLLM.
|
||||
|
||||
Get started with just 2 lines of code and track your LiteLLM calls with W&B Weave. Learn more about W&B Weave [here](https://weave-docs.wandb.ai).
|
||||
|
||||
<Image img={require('../../img/weave_litellm.png')} />
|
||||
|
||||
With the W&B Weave integration, you can:
|
||||
|
||||
- Look at the inputs and outputs made to different LLM vendors/models using LiteLLM
|
||||
- Look at the cost, token usage and latency of the calls made
|
||||
- Give human feedback using emojis and notes
|
||||
- Debug your LLM applications by looking at the traces
|
||||
- Compare different runs and models
|
||||
- And more!
|
||||
|
||||
## Quick Start
|
||||
|
||||
Install W&B Weave
|
||||
|
@ -64,7 +55,7 @@ You will get a Weave URL in the stdout. Open it up to see the trace, cost, token
|
|||
|
||||
Now let's use LiteLLM and W&B Weave to build a simple LLM application to translate text from source language to target language.
|
||||
|
||||
The function `translate` takes in a text and target language, and returns the translated text using the model of your choice. Note that the `translate` function is decorated with `weave.op()`. This is how W&B Weave knows that this function is a part of your application and will be traced when called along with the inputs to the function and the output(s) from the function.
|
||||
The function `translate` takes in a text and target language, and returns the translated text using the model of your choice. Note that the `translate` function is decorated with [`weave.op()`](https://weave-docs.wandb.ai/guides/tracking/ops). This is how W&B Weave knows that this function is a part of your application and will be traced when called along with the inputs to the function and the output(s) from the function.
|
||||
|
||||
Since the underlying LiteLLM calls are automatically traced, you can see a nested trace of the LiteLLM call(s) made with details like the model, cost, token usage, etc.
|
||||
|
||||
|
@ -85,6 +76,96 @@ print(translate("Hello, how are you?", "French", "gpt-4o"))
|
|||
<Image img={require('../../img/weave_trace_application.png')} />
|
||||
|
||||
|
||||
## Building evaluation pipeline
|
||||
## Building an evaluation pipeline
|
||||
|
||||
LiteLLM is powerful for building evaluation pipelines because of the flexibility it provides. Together with W&B Weave, building such pipelines is super easy.
|
||||
|
||||
Below we are building an evaluation pipeline to evaluate LLM's ability to solve maths problems. We first need an evaluation dataset.
|
||||
|
||||
```python
|
||||
samples = [
|
||||
{"question": "What is the sum of 45 and 67?", "answer": "112"},
|
||||
{"question": "If a triangle has sides 3 cm, 4 cm, and 5 cm, what is its area?", "answer": "6 square cm"},
|
||||
{"question": "What is the derivative of x^2 + 3x with respect to x?", "answer": "2x + 3"},
|
||||
{"question": "What is the result of 12 multiplied by 8?", "answer": "96"},
|
||||
{"question": "What is the value of 10! (10 factorial)?", "answer": "3628800"}
|
||||
]
|
||||
```
|
||||
|
||||
Next up we write a simple function that can take in a sample question and return the solution to the problem. We will write this function as a method (`predict`) of our `SimpleMathsSolver` class which is inheriting from the [`weave.Model`](https://weave-docs.wandb.ai/guides/core-types/models) class. This allows us to easily track the attributes (hyperparameters) of our model.
|
||||
|
||||
```python
|
||||
class SimpleMathsSolver(weave.Model):
|
||||
model_name: str
|
||||
temperature: float
|
||||
|
||||
@weave.op()
|
||||
def predict(self, question: str) -> str:
|
||||
response = litellm.completion(
|
||||
model=self.model_name,
|
||||
messages=[
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are given maths problems. Think step by step to solve it. Only return the exact answer without any explanation in \\boxed{}"
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": f"{question}"
|
||||
}
|
||||
],
|
||||
)
|
||||
return response.choices[0].message.content
|
||||
|
||||
maths_solver = SimpleMathsSolver(
|
||||
model_name="gpt-4o",
|
||||
temperature=0.0,
|
||||
)
|
||||
|
||||
print(maths_solver.predict("What is 2+3?"))
|
||||
```
|
||||
|
||||
<Image img={require('../../img/weave_maths_solver.png')} />
|
||||
|
||||
Now what we have the dataset and the model, let's define a simple exact match evaluation metric and setup our evaluation pipeline using [`weave.Evaluation`](https://weave-docs.wandb.ai/guides/core-types/evaluations).
|
||||
|
||||
```python
|
||||
@weave.op()
|
||||
def exact_match(answer: str, output: str):
|
||||
pattern = r"\\boxed\{(.+?)\}"
|
||||
match = re.search(pattern, output)
|
||||
|
||||
if match:
|
||||
extracted_value = match.group(1)
|
||||
is_correct = extracted_value == answer
|
||||
return is_correct
|
||||
else:
|
||||
return None
|
||||
|
||||
evaluation_pipeline = weave.Evaluation(
|
||||
dataset=samples, scorers=[exact_match]
|
||||
)
|
||||
|
||||
asyncio.run(evaluation_pipeline.evaluate(maths_solver))
|
||||
```
|
||||
|
||||
The evaluation page will show as below. Here you can see the overall score as well as the score for each sample. This is a powerful way to debug the limitations of your LLM application while keeping track of everything that matters in a sane way.
|
||||
|
||||
<Image img={require('../../img/weave_evaluation.png')} />
|
||||
|
||||
Now say you want to compare the performance of your current model with a different model using the comparison feature in the UI. LiteLLM's flexibility allows you to do this easily and W&B Weave evaluation pipeline will help you do this in a structured way.
|
||||
|
||||
```python
|
||||
new_maths_solver = SimpleMathsSolver(
|
||||
model_name="gpt-3.5-turbo",
|
||||
temperature=0.0,
|
||||
)
|
||||
|
||||
asyncio.run(evaluation_pipeline.evaluate(new_maths_solver))
|
||||
```
|
||||
|
||||
<Image img={require('../../img/weave_comparison_view.png')} />
|
||||
|
||||
## Support
|
||||
|
||||
* For advanced usage of Weave, visit the [Weave documentation](https://weave-docs.wandb.ai).
|
||||
* For any question or issue with this integration, please [submit an issue](https://github.com/wandb/weave/issues/new?template=Blank+issue) on our [Github](https://github.com/wandb/weave) repository!
|
||||
|
|
BIN
docs/my-website/img/weave_comparison_view.png
Normal file
BIN
docs/my-website/img/weave_comparison_view.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 507 KiB |
BIN
docs/my-website/img/weave_evaluation.png
Normal file
BIN
docs/my-website/img/weave_evaluation.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 552 KiB |
BIN
docs/my-website/img/weave_maths_solver.png
Normal file
BIN
docs/my-website/img/weave_maths_solver.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 154 KiB |
Loading…
Add table
Add a link
Reference in a new issue