Commit graph

124 commits

Author SHA1 Message Date
Dinesh Yeduguru
516e1a3e59
add embedding model by default to distribution templates (#617)
# What does this PR do?
Adds the sentence transformer provider and the `all-MiniLM-L6-v2`
embedding model to the default models to register in the run.yaml for
all providers.

## Test Plan
llama stack build --template together --image-type conda
llama stack run
~/.llama/distributions/llamastack-together/together-run.yaml
2024-12-13 12:48:00 -08:00
Botao Chen
aeb76390fc
[1/n] torchtune <> llama-stack integration skeleton (#540)
### Context 
This is the 1st of series PRs that integrate torchtune with llama-stack
as meta reference post-training implementation. For MVP, we will focus
on single device LoRA SFT.

Though this PR is still WIP, we want to get early feedback on the high
level design of this skeleton while still working on several details

### Scope
To limit the scope of this PR, we focus on the skeleton of the
implementation.

**What are included?**
- refine the post-training SFT apis
- skeleton of supervised_fine_tune implementation. We verified that we
can call the supervised_fine_tune API successfully from llama stack
client SDK (client side PR:
https://github.com/meta-llama/llama-stack-client-python/pull/51)
- a very basic single device LoRA training recipe based on torchtune
core components
- parity check with torchtune library and post training api unit test

**What are not includes?**
- implementation of other job management, get training artifacts apis
(separate PR)
- refactor the meta reference inference logic to support eval on
finetuned model (separate PR)
- several necessary functionality in the training recipe such as
logging, validation etc (separate PR)
- interop with telemetry for tracing and metrics logging, currently
temporarily log to local disk (separate PR)

### Testing
**e2e test**
Although we haven't added detailed testing and numerical parity check
with torchtune yet, we did a simple E2E test from client to server
1. setup server with` llama stack build --template
experimental-post-training --image-type conda` and `llama stack run
experimental-post-training `
2. On client, run `llama-stack-client --endpoint
http://devgpu018.nha2.facebook.com:5000 post_training
supervised_fine_tune`
3. Training finishes successfully. On server side, get the finetune
checkpoints under output dir. On client side, get the job uuid

server 
<img width="1110" alt="Screenshot 2024-12-02 at 5 52 32 PM"
src="https://github.com/user-attachments/assets/b548eb90-7a9b-4edc-a858-ee237cc4361d">

client 
<img width="807" alt="Screenshot 2024-12-02 at 5 52 37 PM"
src="https://github.com/user-attachments/assets/1138ffa8-4698-40fa-b190-3d7b99646838">

**parity check**
torchtune dataloader output and llama-stack post training dataloader
output are same
<img width="1116" alt="Screenshot 2024-12-04 at 8 18 46 PM"
src="https://github.com/user-attachments/assets/5e295cdc-4c24-4ea6-82c0-ca96ef1bd6ee">

torchtune LoRA SFT and llama-stack post training LoRA SFT on alpaca
dataset with llama3.2 3B instruct model are numerical match

<img width="860" alt="Screenshot 2024-12-04 at 8 17 01 PM"
src="https://github.com/user-attachments/assets/c05cf0a8-c674-4d2e-9f0a-c5d01b2dca99">

<img width="1049" alt="Screenshot 2024-12-04 at 8 17 06 PM"
src="https://github.com/user-attachments/assets/b911d4e2-e7b1-41a9-b62c-d75529b6d443">

**unit test ** 
![Uploading Screenshot 2024-12-09 at 1.35.10 PM.png…]()
2024-12-13 11:05:35 -08:00
Dinesh Yeduguru
96e158eaac
Make embedding generation go through inference (#606)
This PR does the following:
1) adds the ability to generate embeddings in all supported inference
providers.
2) Moves all the memory providers to use the inference API and improved
the memory tests to setup the inference stack correctly and use the
embedding models

This is a merge from #589 and #598
2024-12-12 11:47:50 -08:00
Ashwin Bharambe
b7cb06f004
Allow using an "inline" version of Chroma using PersistentClient (#567)
The same code is used (inside providers/remote/memory/chroma/chroma.py)
but it is driven by separate configurations and changes which Chroma
client to use. Note that the dependencies are separate
(`chromadb-client` vs `chromadb` -- the latter is a _much_ heavier
package.)

```
pytest -s -v -m chroma memory/test_memory.py --env CHROMA_DB_PATH=/tmp/chroma_test
pytest -s -v -m chroma memory/test_memory.py --env CHROMA_URL=http://localhost:6001
```
2024-12-11 16:02:04 -08:00
Xi Yan
a4bcfb8bba
[/scoring] add ability to define aggregation functions for scoring functions & refactors (#597)
# What does this PR do?

- Add ability to define aggregation functions for scoring functions via
`ScoringFnParams`
- Supported by `basic` / `regex_parser` / `llm_as_judge` scoring
functions


## Test Plan

```
pytest -v -s -m basic_scoring_together_inference scoring/test_scoring.py
```
<img width="855" alt="image"
src="https://github.com/user-attachments/assets/12db8e6e-2ad4-462e-b9b9-70ba6c050a6c">


```
pytest -v -s -m llm_as_judge_scoring_together_inference scoring/test_scoring.py
```
<img width="858" alt="image"
src="https://github.com/user-attachments/assets/bf806676-6f5e-456d-be9f-f81a26d1df19">



**Example Response** (`basic`)
<img width="863" alt="image"
src="https://github.com/user-attachments/assets/0e57a49c-8386-45cc-8fa9-3e61aaa9a3be">

**Example Response** (`llm-as-judge`)
<img width="854" alt="image"
src="https://github.com/user-attachments/assets/38065bc2-b724-47ed-9535-79b6099c4362">


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-11 10:03:42 -08:00
Dinesh Yeduguru
e128f2547a
add tracing back to the lib cli (#595)
Adds back all the tracing logic removed from library client. also adds
back the logging to agent_instance.
2024-12-11 08:44:20 -08:00
Dinesh Yeduguru
2e3d3a62a5 Revert "add tracing to library client (#591)"
This reverts commit bc1fddf1df.
2024-12-10 08:50:20 -08:00
Dinesh Yeduguru
686f8d5b8d remove info logging in agent instance 2024-12-10 08:40:42 -08:00
Ashwin Bharambe
a4d8a6009a
Fixes for library client (#587)
Library client used _server_ side types which was no bueno. The fix here
is not the completely correct fix but it is good for enough and for the
demo notebook.
2024-12-09 17:14:37 -08:00
Dinesh Yeduguru
bc1fddf1df
add tracing to library client (#591) 2024-12-09 15:46:26 -08:00
Xi Yan
ab7145a04f minor refactor 2024-12-09 15:43:12 -08:00
Xi Yan
cd40a5fdbf
update template run.yaml to include openai api key for braintrust (#590)
# What does this PR do?

**Why**
- braintrust provider needs OpenAI API Key set in config for
DirectClient to work

## Test Plan
```
python llama_stack/scripts/distro_codegen.py 
```

<img width="340" alt="image"
src="https://github.com/user-attachments/assets/eae38296-f880-40f0-9a9e-46a12038db64">

- set API key in client via provider_data
<img width="907" alt="image"
src="https://github.com/user-attachments/assets/3d74cd7c-dc7e-4a42-8a40-c22f19b0c534">


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-09 15:40:59 -08:00
Ashwin Bharambe
5335393fe3 Avoid deleting temp directory between agent turns
This brings an interesting aspect -- we need to maintain session-level
tempdir state (!) since the model was told there was some resource at a
given location that it needs to maintain
2024-12-08 22:25:37 -08:00
Ashwin Bharambe
e951852848 Miscellaneous fixes around telemetry, library client and run yaml autogen
Also add a `venv` image-type for llama stack build
2024-12-08 20:40:22 -08:00
Ashwin Bharambe
224e62290f kill unnecessarily large imports from telemetry init 2024-12-08 16:57:16 -08:00
Dinesh Yeduguru
c543bc0745
Console span processor improvements (#577)
Makes the console span processor output spans in less prominent way and
highlight the logs based on severity.


![Screenshot 2024-12-06 at 11 26
46 AM](https://github.com/user-attachments/assets/c3a1b051-85db-4b71-b7a5-7bab5a26f072)
2024-12-06 11:46:16 -08:00
Ashwin Bharambe
084ec337af Small cleanup of console logs 2024-12-06 10:29:24 -08:00
Adrian Cole
27a27152cd
Renames otel config from jaeger to otel (#569)
# What does this PR do?

#525 introduced a telemetry configuration named jaeger, but what it
really is pointing to is an OTLP HTTP endpoint which is supported by
most servers in the ecosystem, including raw opentelemetry collectors,
several APMs, and even https://github.com/ymtdzzz/otel-tui

I chose to rename this to "otel" as it will bring in more people to the
ecosystem vs feeling it only works with jaeger. Later, we can use the
[standard
ENV](https://opentelemetry.io/docs/specs/otel/protocol/exporter/) to
configure this if we like so that you can override things with variables
people might expect.

Note: I also added to the README that you have to install conda.
Depending on experience level of the user, and especially with miniforge
vs other ways, I felt this helps.

## Test Plan

I would like to test this, but actually got a little lost. The previous
PRs referenced yaml which doesn't seem published anywhere. It would be
nice to have a pre-canned setup that uses ollama and turns on otel, but
would also appreciate a hand on instructions meanwhile.

## Sources

https://github.com/meta-llama/llama-stack/pull/525

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.

---------

Signed-off-by: Adrian Cole <adrian.cole@elastic.co>
2024-12-06 10:16:42 -08:00
Ashwin Bharambe
392be5f6dc Reduce log volume a bit, needs more work 2024-12-05 21:40:21 -08:00
Dinesh Yeduguru
c23363d561
Add ability to query and export spans to dataset (#574)
This PR adds two new methods to the telemetry API:
1) Gives the ability to query spans directly instead of first querying
traces and then using that to get spans
2) Another method save_spans_to_dataset, which builds on the query spans
to save it on dataset.

This give the ability to saves spans that are part of an agent session
to a dataset.

The unique aspect of this API is that we dont require each provider of
telemetry to implement this method. Hence, its implemented in the
protocol class itself. This required the protocol check to be slightly
modified.
2024-12-05 21:07:30 -08:00
Ashwin Bharambe
cdfc98cf08 add a warning at least for when bwrap is not available for code execution 2024-12-05 20:54:28 -08:00
Ashwin Bharambe
66440e2c20 Add missing init file 2024-12-05 17:44:14 -08:00
Dinesh Yeduguru
fcd6449519
Telemetry API redesign (#525)
# What does this PR do?
Change the Telemetry API to be able to support different use cases like
returning traces for the UI and ability to export for Evals.
Other changes:
* Add a new trace_protocol decorator to decorate all our API methods so
that any call to them will automatically get traced across all impls.
* There is some issue with the decorator pattern of span creation when
using async generators, where there are multiple yields with in the same
context. I think its much more explicit by using the explicit context
manager pattern using with. I moved the span creations in agent instance
to be using with
* Inject session id at the turn level, which should quickly give us all
traces across turns for a given session

Addresses #509

## Test Plan
```
llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml
PYTHONPATH=. python -m examples.agents.rag_with_memory_bank localhost 5000


 curl -X POST 'http://localhost:5000/alpha/telemetry/query-traces' \
-H 'Content-Type: application/json' \
-d '{
  "attribute_filters": [
    {
      "key": "session_id",
      "op": "eq",
      "value": "dd667b87-ca4b-4d30-9265-5a0de318fc65" }],
  "limit": 100,
  "offset": 0,
  "order_by": ["start_time"]
}' | jq .
[
  {
    "trace_id": "6902f54b83b4b48be18a6f422b13e16f",
    "root_span_id": "5f37b85543afc15a",
    "start_time": "2024-12-04T08:08:30.501587",
    "end_time": "2024-12-04T08:08:36.026463"
  },
  {
    "trace_id": "92227dac84c0615ed741be393813fb5f",
    "root_span_id": "af7c5bb46665c2c8",
    "start_time": "2024-12-04T08:08:36.031170",
    "end_time": "2024-12-04T08:08:41.693301"
  },
  {
    "trace_id": "7d578a6edac62f204ab479fba82f77b6",
    "root_span_id": "1d935e3362676896",
    "start_time": "2024-12-04T08:08:41.695204",
    "end_time": "2024-12-04T08:08:47.228016"
  },
  {
    "trace_id": "dbd767d76991bc816f9f078907dc9ff2",
    "root_span_id": "f5a7ee76683b9602",
    "start_time": "2024-12-04T08:08:47.234578",
    "end_time": "2024-12-04T08:08:53.189412"
  }
]


curl -X POST 'http://localhost:5000/alpha/telemetry/get-span-tree' \
-H 'Content-Type: application/json' \
-d '{ "span_id" : "6cceb4b48a156913", "max_depth": 2, "attributes_to_return": ["input"] }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   875  100   790  100    85  18462   1986 --:--:-- --:--:-- --:--:-- 20833
{
  "span_id": "6cceb4b48a156913",
  "trace_id": "dafa796f6aaf925f511c04cd7c67fdda",
  "parent_span_id": "892a66d726c7f990",
  "name": "retrieve_rag_context",
  "start_time": "2024-12-04T09:28:21.781995",
  "end_time": "2024-12-04T09:28:21.913352",
  "attributes": {
    "input": [
      "{\"role\":\"system\",\"content\":\"You are a helpful assistant\"}",
      "{\"role\":\"user\",\"content\":\"What are the top 5 topics that were explained in the documentation? Only list succinct bullet points.\",\"context\":null}"
    ]
  },
  "children": [
    {
      "span_id": "1a2df181854064a8",
      "trace_id": "dafa796f6aaf925f511c04cd7c67fdda",
      "parent_span_id": "6cceb4b48a156913",
      "name": "MemoryRouter.query_documents",
      "start_time": "2024-12-04T09:28:21.787620",
      "end_time": "2024-12-04T09:28:21.906512",
      "attributes": {
        "input": null
      },
      "children": [],
      "status": "ok"
    }
  ],
  "status": "ok"
}

```

<img width="1677" alt="Screenshot 2024-12-04 at 9 42 56 AM"
src="https://github.com/user-attachments/assets/4d3cea93-05ce-415a-93d9-4b1628631bf8">
2024-12-04 11:22:45 -08:00
Xi Yan
16769256b7
[llama stack ui] add native eval & inspect distro & playground pages (#541)
# What does this PR do?

New Pages Added: 

- (1) Inspect Distro
- (2) Evaluations: 
  - (a) native evaluations (including generation)
  - (b) application evaluations (no generation, scoring only)
- (3) Playground: 
  - (a) chat
  - (b) RAG  

## Test Plan

```
streamlit run app.py
```

#### Playground

https://github.com/user-attachments/assets/6ca617e8-32ca-49b2-9774-185020ff5204

#### Inspect

https://github.com/user-attachments/assets/01d52b2d-92af-4e3a-b623-a9b8ba22ba99


#### Evaluations (Generation + Scoring)

https://github.com/user-attachments/assets/345845c7-2a2b-4095-960a-9ae40f6a93cf

#### Evaluations (Scoring)

https://github.com/user-attachments/assets/6cc1659f-eba4-49ca-a0a5-7c243557b4f5


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-04 09:47:09 -08:00
Sixian Yi
caf1dac114
unregister API for dataset (#507)
# What does this PR do?

1) Implement `unregister_dataset(dataset_id)` API in both llama stack
routing table and providers: It removes {dataset_id -> Dataset} mapping
from routing table and removes the dataset_id references in provider as
well (ex. for huggingface, we use a KV store to store the dataset id =>
dataset. we delete it during unregistering as well)

2) expose the datasets/unregister_dataset api endpoint 

## Test Plan

**Unit test:** 

`
pytest llama_stack/providers/tests/datasetio/test_datasetio.py -m
"huggingface" -v -s --tb=short --disable-warnings
`

**Test on endpoint:**
tested llama stack using an ollama distribution template:
1) start an ollama server 
2) Start a llama stack server with the default ollama distribution
config + dataset/datasetsio APIs + datasetio provider
```
---- .../ollama-run.yaml
...
apis:
- agents
- inference
- memory
- safety
- telemetry
- datasetio
- datasets
providers:
  datasetio:
  - provider_id: localfs
    provider_type: inline::localfs
    config: {}
...
```
   saw that the new API showed up in startup script
   
  ```
Serving API datasets
 GET /alpha/datasets/get
 GET /alpha/datasets/list
 POST /alpha/datasets/register
 POST /alpha/datasets/unregister
```

3) query `/alpha/datasets/unregister` through curl (since we have not implemented unregister api in llama stack client)

```
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets register
--dataset-id sixian --url
https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/chat.rst
--schema {}
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ metadata ┃ type    ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sixian     │ localfs     │ {}       │ dataset │
└────────────┴─────────────┴──────────┴─────────┘
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets register
--dataset-id sixian2 --url
https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/chat.rst
--schema {}
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ metadata ┃ type    ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sixian     │ localfs     │ {}       │ dataset │
│ sixian2    │ localfs     │ {}       │ dataset │
└────────────┴─────────────┴──────────┴─────────┘
(base) sxyi@sxyi-mbp llama-stack % curl
http://localhost:5001/alpha/datasets/unregister \
-H "Content-Type: application/json" \
-d '{"dataset_id": "sixian"}'
null%

(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ metadata ┃ type    ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sixian2    │ localfs     │ {}       │ dataset │
└────────────┴─────────────┴──────────┴─────────┘
(base) sxyi@sxyi-mbp llama-stack % curl
http://localhost:5001/alpha/datasets/unregister \
-H "Content-Type: application/json" \
-d '{"dataset_id": "sixian2"}'
null%

(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
```

## Sources


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-03 21:18:30 -08:00
Xi Yan
6e10d0b23e precommit 2024-12-03 18:52:43 -08:00
Xi Yan
fd19a8a517 add missing __init__ 2024-12-03 18:50:18 -08:00
Xi Yan
50cc165077
fixes tests & move braintrust api_keys to request headers (#535)
# What does this PR do?

- braintrust scoring provider requires OPENAI_API_KEY env variable to be
set
- move this to be able to be set as request headers (e.g. like together
/ fireworks api keys)
- fixes pytest with agents dependency

## Test Plan

**E2E**
```
llama stack run 
```
```yaml
scoring:
  - provider_id: braintrust-0
    provider_type: inline::braintrust
    config: {}
```

**Client**
```python
self.client = LlamaStackClient(
    base_url=os.environ.get("LLAMA_STACK_ENDPOINT", "http://localhost:5000"),
    provider_data={
        "openai_api_key": os.environ.get("OPENAI_API_KEY", ""),
    },
)
```
- run `llama-stack-client eval run_scoring`

**Unit Test**
```
pytest -v -s -m meta_reference_eval_together_inference eval/test_eval.py
```

```
pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py --env OPENAI_API_KEY=$OPENAI_API_KEY
```
<img width="745" alt="image"
src="https://github.com/user-attachments/assets/68f5cdda-f6c8-496d-8b4f-1b3dabeca9c2">

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-26 13:11:21 -08:00
Xi Yan
d3956a1d22 fix description 2024-11-25 22:02:45 -08:00
Xi Yan
bbd81231ce add missing __init__ 2024-11-25 17:23:27 -08:00
Dinesh Yeduguru
501e7c9d64
Fix opentelemetry adapter (#510)
# What does this PR do?

This PR fixes some of the issues with our telemetry setup to enable logs
to be delivered to opentelemetry and jaeger. Main fixes
1) Updates the open telemetry provider to use the latest oltp exports
instead of deprected ones.
2) Adds a tracing middleware, which injects traces into each HTTP
request that the server recieves and this is going to be the root trace.
Previously, we did this in the create_dynamic_route method, which is
actually not the actual exectuion flow, but more of a config and this
causes the traces to end prematurely. Through middleware, we plugin the
trace start and end at the right location.
3) We manage our own methods to create traces and spans and this does
not fit well with Opentelemetry SDK since it does not support provide a
way to take in traces and spans that are already created. it expects us
to use the SDK to create them. For now, I have a hacky approach of just
maintaining a map from our internal telemetry objects to the open
telemetry specfic ones. This is not the ideal solution. I will explore
other ways to get around this issue. for now, to have something that
works, i am going to keep this as is.

Addresses: #509
2024-11-22 18:18:11 -08:00
Ashwin Bharambe
c1025ebfdb Delete some dead code 2024-11-21 15:20:06 -08:00
Ashwin Bharambe
a0a00f1345 Update telemetry to have TEXT be the default log format 2024-11-21 15:18:45 -08:00
Xi Yan
945db5dac2 fix logging 2024-11-21 15:02:57 -08:00
Xi Yan
654722da7d fix model id for llm_as_judge_405b 2024-11-21 11:34:49 -08:00
Dinesh Yeduguru
6395dadc2b
use logging instead of prints (#499)
# What does this PR do?

This PR moves all print statements to use logging. Things changed:
- Had to add `await start_trace("sse_generator")` to server.py to
actually get tracing working. else was not seeing any logs
- If no telemetry provider is provided in the run.yaml, we will write to
stdout
- by default, the logs are going to be in JSON, but we expose an option
to configure to output in a human readable way.
2024-11-21 11:32:53 -08:00
liyunlu0618
4e1105e563
Fix fp8 quantization script. (#500)
# What does this PR do?

Fix fp8 quantization script.

## Test Plan

```
sh run_quantize_checkpoint.sh localhost fp8 /home/yll/fp8_test/ /home/yll/fp8_test/quantized_2 /home/yll/fp8_test/tokenizer.model 1 1
```

## Sources

Please link relevant resources if necessary.


## Before submitting

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.

Co-authored-by: Yunlu Li <yll@meta.com>
2024-11-21 09:15:28 -08:00
Ashwin Bharambe
cd6ccb664c Integrate distro docs into the restructured docs 2024-11-20 23:20:05 -08:00
Ashwin Bharambe
2411a44833 Update more distribution docs to be simpler and partially codegen'ed 2024-11-20 22:03:44 -08:00
Ashwin Bharambe
e84d4436b5
Since we are pushing for HF repos, we should accept them in inference configs (#497)
# What does this PR do?

As the title says. 

## Test Plan

This needs
8752149f58
to also land. So the next package (0.0.54) will make this work properly.

The test is:

```bash
pytest -v -s -m "llama_3b and meta_reference" test_model_registration.py
```
2024-11-20 16:14:37 -08:00
Mengtao Yuan
1086b500f9
Support Tavily as built-in search tool. (#485)
# What does this PR do?

Add Tavily as a built-in search tool, in addition to Brave and Bing.

## Test Plan

It's tested using ollama remote, showing parity to the Brave search
tool.
- Install and run ollama with `ollama run llama3.1:8b-instruct-fp16`
- Build ollama distribution `llama stack build --template ollama
--image-type conda`
- Run ollama `stack run
/$USER/.llama/distributions/llamastack-ollama/ollama-run.yaml --port
5001`
- Client test command: `python - m
agents.test_agents.TestAgents.test_create_agent_turn_with_tavily_search`,
with enviroments:

MASTER_ADDR=0.0.0.0;MASTER_PORT=5001;RANK=0;REMOTE_STACK_HOST=0.0.0.0;REMOTE_STACK_PORT=5001;TAVILY_SEARCH_API_KEY=tvly-<YOUR-KEY>;WORLD_SIZE=1

Test passes on the specific case (ollama remote).

Server output: 
```
Listening on ['::', '0.0.0.0']:5001
INFO:     Started server process [7220]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5001 (Press CTRL+C to quit)
INFO:     127.0.0.1:65209 - "POST /agents/create HTTP/1.1" 200 OK
INFO:     127.0.0.1:65210 - "POST /agents/session/create HTTP/1.1" 200 OK
INFO:     127.0.0.1:65211 - "POST /agents/turn/create HTTP/1.1" 200 OK
role='user' content='What are the latest developments in quantum computing?' context=None
role='assistant' content='' stop_reason=<StopReason.end_of_turn: 'end_of_turn'> tool_calls=[ToolCall(call_id='fc92ccb8-1039-4ce8-ba5e-8f2b0147661c', tool_name=<BuiltinTool.brave_search: 'brave_search'>, arguments={'query': 'latest developments in quantum computing'})]
role='ipython' call_id='fc92ccb8-1039-4ce8-ba5e-8f2b0147661c' tool_name=<BuiltinTool.brave_search: 'brave_search'> content='{"query": "latest developments in quantum computing", "top_k": [{"title": "IBM Unveils 400 Qubit-Plus Quantum Processor and Next-Generation IBM ...", "url": "https://newsroom.ibm.com/2022-11-09-IBM-Unveils-400-Qubit-Plus-Quantum-Processor-and-Next-Generation-IBM-Quantum-System-Two", "content": "This system is targeted to be online by the end of 2023 and will be a building b...<more>...onnect large-scale ...", "url": "https://news.mit.edu/2023/quantum-interconnects-photon-emission-0105", "content": "Quantum computers hold the promise of performing certain tasks that are intractable even on the world\'s most powerful supercomputers. In the future, scientists anticipate using quantum computing to emulate materials systems, simulate quantum chemistry, and optimize hard tasks, with impacts potentially spanning finance to pharmaceuticals.", "score": 0.71721, "raw_content": null}]}'
Assistant: The latest developments in quantum computing include:

* IBM unveiling its 400 qubit-plus quantum processor and next-generation IBM Quantum System Two, which will be a building block of quantum-centric supercomputing.
* The development of utility-scale quantum computing, which can serve as a scientific tool to explore utility-scale classes of problems in chemistry, physics, and materials beyond brute force classical simulation of quantum mechanics.
* The introduction of advanced hardware across IBM's global fleet of 100+ qubit systems, as well as easy-to-use software that users and computational scientists can now obtain reliable results from quantum systems as they map increasingly larger and more complex problems to quantum circuits.
* Research on quantum repeaters, which use defects in diamond to interconnect quantum systems and could provide the foundation for scalable quantum networking.
* The development of a new source of quantum light, which could be used to improve the efficiency of quantum computers.
* The creation of a new mathematical "blueprint" that is accelerating fusion device development using Dyson maps.
* Research on canceling noise to improve quantum devices, with MIT researchers developing a protocol to extend the life of quantum coherence.
```

Verified with tool response. The final model response is updated with
the search requests.

## Sources

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.

Co-authored-by: Martin Yuan <myuan@meta.com>
2024-11-19 20:59:02 -08:00
Ashwin Bharambe
2a31163178
Auto-generate distro yamls + docs (#468)
# What does this PR do?

Automatically generates
- build.yaml
- run.yaml
- run-with-safety.yaml
- parts of markdown docs

for the distributions.

## Test Plan

At this point, this only updates the YAMLs and the docs. Some testing
(especially with ollama and vllm) has been performed but needs to be
much more tested.
2024-11-18 14:57:06 -08:00
Xi Yan
0784284ab5
[Agentic Eval] add ability to run agents generation (#469)
# What does this PR do?

- add ability to run agents generation for full eval (generate +
scoring)
- pre-register SimpleQA  benchmark llm-as-judge scoring function in code


## Test Plan


![image](https://github.com/user-attachments/assets/b4b6f086-1be4-4c2a-8ab0-6839f0067c0a)


![image](https://github.com/user-attachments/assets/05bb7a09-2d7a-4031-8eb6-e1ca670ee439)


#### Simple QA w/ Search

![image](https://github.com/user-attachments/assets/0a51e3f3-9fc7-479b-8295-89aed63496e0)

- eval_task_config_simpleqa_search.json
```json
{
    "type": "benchmark",
    "eval_candidate": {
        "type": "agent",
        "config": {
            "model": "Llama3.1-405B-Instruct",
            "instructions": "Please use the search tool to answer the question.",
            "sampling_params": {
                "strategy": "greedy",
                "temperature": 1.0,
                "top_p": 0.9
            },
            "tools": [
                {
                    "type": "brave_search",
                    "engine": "brave",
                    "api_key": "API_KEY"
                }
            ],
            "tool_choice": "auto",
            "tool_prompt_format": "json",
            "input_shields": [],
            "output_shields": [],
            "enable_session_persistence": false
        }
    }
}
```

#### SimpleQA w/o Search

![image](https://github.com/user-attachments/assets/6301feef-2abb-4bee-b50c-97da1c90482b)


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-18 11:43:03 -08:00
Dinesh Yeduguru
57bafd0f8c
fix faiss serialize and serialize of index (#464)
faiss serialize index returns a np object, that we first need to save to
buffer and then write to sqllite. Since we are using json, we need to
base64 encode the data.

Same in the read path, we base64 decode and read into np array and then
call into deserialize index.

tests:
torchrun $CONDA_PREFIX/bin/pytest -v -s -m "faiss"
llama_stack/providers/tests/memory/test_memory.py

Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
2024-11-15 18:02:48 -08:00
Dinesh Yeduguru
ff99025875
await initialize in faiss (#463)
tests:
```
 torchrun $CONDA_PREFIX/bin/pytest -v -s -m "faiss" llama_stack/providers/tests/memory/test_memory.py
```

Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
2024-11-15 14:21:31 -08:00
Xi Yan
788411b680 categorical score for llm as judge 2024-11-14 22:33:59 -05:00
Dinesh Yeduguru
0850ad656a
unregister for memory banks and remove update API (#458)
The semantics of an Update on resources is very tricky to reason about
especially for memory banks and models. The best way to go forward here
is for the user to unregister and register a new resource. We don't have
a compelling reason to support update APIs.


Tests:
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m
"chroma" --env CHROMA_HOST=localhost --env CHROMA_PORT=8000

pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m
"pgvector" --env PGVECTOR_DB=postgres --env PGVECTOR_USER=postgres --env
PGVECTOR_PASSWORD=mysecretpassword --env PGVECTOR_HOST=0.0.0.0

$CONDA_PREFIX/bin/pytest -v -s -m "ollama"
llama_stack/providers/tests/inference/test_model_registration.py

---------

Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
2024-11-14 17:12:11 -08:00
Xi Yan
2eab3b7ed9 skip aggregation for llm_as_judge 2024-11-14 17:50:46 -05:00
Xi Yan
58381dbe78
local persistence for eval tasks (#453)
# What does this PR do?

- add local persistence for eval tasks
- follow https://github.com/meta-llama/llama-stack/pull/375

## Test Plan

1. fresh llama stack run
2. kill server
3. restart server: llama stack run

<img width="690" alt="image"
src="https://github.com/user-attachments/assets/3d76e477-b91a-43a6-86ea-8e3ef2d04ed3">

Using run.yaml
```yaml
eval_tasks:
  - eval_task_id: meta-reference-mmlu
    provider_id: meta-reference-0
    dataset_id: mmlu
    scoring_functions:
      - basic::regex_parser_multiple_choice_answer
```

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-14 10:36:23 -05:00
Sarthak Deshpande
838b8d4fb5
PR-437-Fixed bug to allow system instructions after first turn (#440)
# What does this PR do?

In short, provide a summary of what this PR does and why. Usually, the
relevant context should be present in a linked issue.

- [This PR solves the issue where agents cannot keep track of
instructions after executing the first turn because system instructions
were not getting appended in the messages list. It also solves the issue
where turns are not being fetched in the appropriate sequence.]
Addresses issue (#issue)


## Test Plan

Please describe:
- I have a file which has a precise prompt which requires more than one
turn to be executed will share the file below. I ran that file as a
python script to make sure that the turns are being executed as per the
instructions after making the code change
 
```
import asyncio
from typing import List, Optional, Dict

from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.agents.event_logger import EventLogger

from llama_stack_client.types import SamplingParams, UserMessage
from llama_stack_client.types.agent_create_params import AgentConfig

LLAMA_STACK_API_TOGETHER_URL="http://10.12.79.177:5001"

class Agent:
    def __init__(self):
        self.client = LlamaStackClient(
            base_url=LLAMA_STACK_API_TOGETHER_URL,
        )


    def create_agent(self, agent_config: AgentConfig):
        agent = self.client.agents.create(
            agent_config=agent_config,
        )
        self.agent_id = agent.agent_id
        session = self.client.agents.session.create(
            agent_id=agent.agent_id,
            session_name="example_session",
        )
        self.session_id = session.session_id

    async def execute_turn(self, content: str):
        response = self.client.agents.turn.create(
            agent_id=self.agent_id,
            session_id=self.session_id,
            messages=[
                UserMessage(content=content, role="user"),
            ],
            stream=True,
        )

        for chunk in response:
            if chunk.event.payload.event_type != "turn_complete":
                yield chunk


async def run_main():
    system_prompt="""You are an AI Agent tasked with Capturing Book Renting Information for a Library.
You will politely gather the book and user details one step at a time to send over the book to the user. Here’s how to proceed:

1.	Data Security: Inform the user that their data will be kept secure.

2.	Optional Participation: Let them know they are not required to share details but that doing so will help them learn about the books offered.

3.	Sequential Information Capture: Follow the steps below, one question at a time. Do not skip or combine questions.

Steps
Step 1: Politely ask to provide the name of the book.

Step 2: Ask for the name of the author.

Step 3: Ask for the Author's country.

Step 4: Ask for the year of publication.

Step 5: If any information is missing or seems incorrect, ask the user to re-enter that specific detail.

Step 6: Confirm that the user consents to share the entered information.

Step 7: Thank the user for providing the details and let them know they will receive an email about the book.

Do not do any validation of the user entered information.

Do not print the Steps or your internal thoughts in the response.

Do not print the prompts or data structure object in the response

Do not fill in the requested user data on your own. It has to be entered by the user only.

Finally, compile and print the user-provided information as a JSON object in your response.

"""

    agent_config = AgentConfig(
        model="Llama3.2-11B-Vision-Instruct",
        instructions=system_prompt,
        enable_session_persistence=True,
    )

    agent = Agent()
    agent.create_agent(agent_config)

    print("Agent and Session:", agent.agent_id, agent.session_id)

    while True:
        query = input("Enter your query (or type 'exit' to quit): ")
        if query.lower() == "exit":
            print("Exiting the loop.")
            break
        else:
            prompt = query
            print(f"User> {prompt}")
            response = agent.execute_turn(content=prompt)
            async for log in EventLogger().log(response):
                if log is not None:
                    log.print()

if __name__ == "__main__":
    asyncio.run(run_main())
```

Below is a screenshot of the results of the first commit
<img width="1770" alt="Screenshot 2024-11-13 at 3 15 29 PM"
src="https://github.com/user-attachments/assets/1a7a090d-fc92-49cc-a786-bfc812e3d9cc">
Below is a screenshot of the results of the second commit
<img width="1792" alt="Screenshot 2024-11-13 at 6 40 56 PM"
src="https://github.com/user-attachments/assets/a9474f75-cd8c-4d49-82cd-5ff81ff12b07">
Also a screenshot of print statement to show that the turns being
fetched now are in a sequence
<img width="1783" alt="Screenshot 2024-11-13 at 6 42 22 PM"
src="https://github.com/user-attachments/assets/b906404e-a3e4-48a2-b893-69f36bbdcb98">

## Sources

Please link relevant resources if necessary.


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
2024-11-13 10:34:04 -08:00