Commit graph

378 commits

Author SHA1 Message Date
Hardik Shah
a1fe3c30dd
fix: Update getting_started.ipynb (#1245)
update to install properly in system python in colab
2025-02-24 18:22:32 -08:00
Ashwin Bharambe
d6356f822a fix: remove UV_SYSTEM_PYTHON from getting started notebook since llama stack build detects notebook environment 2025-02-24 10:05:02 -08:00
Reid
1842eeb96f
docs: small fixes (#1224) 2025-02-24 07:59:58 -05:00
Yuan Tang
17162b9978
docs: Add vLLM to the list of inference providers in concepts and providers pages (#1227)
This increases visibility of the vLLM provider.
2025-02-23 20:16:30 -08:00
Francisco Arceo
19ae4b35d9
docs: Adding Provider sections to docs (#1195)
# What does this PR do?
Adding Provider sections to docs (some of these will be empty and need
updating).


This PR is still a draft while I seek feedback from other contributors.
I opened it to make the structure visible in the linked GitHub Issue.

# Closes https://github.com/meta-llama/llama-stack/issues/1189

- Providers Overview Page
![Screenshot 2025-02-21 at 12 15
09 PM](https://github.com/user-attachments/assets/e83e5a17-0d96-4de0-8251-68161799a054)

- SQLite-Vec specific page
![Screenshot 2025-02-21 at 12 15
34 PM](https://github.com/user-attachments/assets/14773900-fc8f-49e9-832a-b060b7ca010a)

## Test Plan
N/A

[//]: # (## Documentation)

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-02-22 11:59:34 -08:00
ehhuang
25fddccfd8
feat: tool outputs metadata (#1155)
Summary:

Allows tools to output metadata. This is useful for evaluating tool
outputs, e.g. RAG tool will output document IDs, which can be used to
score recall.

Will need to make a similar change on the client side to support
ClientTool outputting metadata.

Test Plan:

LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/client-sdk/agents/test_agents.py
2025-02-21 13:15:31 -08:00
Xi Yan
0fe071764f
feat(1/n): api: unify agents for handling server & client tools (#1178)
# Problem

Our current Agent framework has discrepancies in definition on how we
handle server side and client side tools.

1. Server Tools: a single Turn is returned including `ToolExecutionStep`
in agenst
2. Client Tools: `create_agent_turn` is called in loop with client agent
lib yielding the agent chunk

ad6ffc63df/src/llama_stack_client/lib/agents/agent.py (L186-L211)

This makes it inconsistent to work with server & client tools. It also
complicates the logs to telemetry to get information about agents turn /
history for observability.

#### Principle
The same `turn_id` should be used to represent the steps required to
complete a user message including client tools.

## Solution

1. `AgentTurnResponseEventType.turn_awaiting_input` status to indicate
that the current turn is not completed, and awaiting tool input
2. `continue_agent_turn` endpoint to update agent turn with client's
tool response.


# What does this PR do?
- Skeleton API as example

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

- Just API update, no functionality change
```
llama stack run + client-sdk test
```

<img width="842" alt="image"
src="https://github.com/user-attachments/assets/7ac56b5f-f424-4632-9476-7e0f57555bc3"
/>


[//]: # (## Documentation)
2025-02-21 11:48:27 -08:00
Ashwin Bharambe
992f865b2e
chore: move embedding deps to RAG tool where they are needed (#1210)
`EMBEDDING_DEPS` were wrongly associated with `vector_io` providers.
They are needed by
https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/utils/memory/vector_store.py#L142
and related code and is used by the RAG tool and as such should only be
needed by the `inline::rag-runtime` provider.
2025-02-21 11:33:41 -08:00
Ashwin Bharambe
11697f85c5
fix: pull ollama embedding model if necessary (#1209)
Embedding models are tiny and can be pulled on-demand. Let's do that so
the user doesn't have to do "yet another thing" to get themselves set
up.

Thanks @hardikjshah for the suggestion.

Also fixed a build dependency miss (TODO: distro_codegen needs to
actually check that the build template contains all providers mentioned
for the run.yaml file)

## Test Plan 

First run `ollama rm all-minilm:latest`. 

Run `llama stack build --template ollama && llama stack run ollama --env
INFERENCE_MODEL=llama3.2:3b-instruct-fp16`. See that it outputs a
"Pulling embedding model `all-minilm:latest`" output and the stack
starts up correctly. Verify that `ollama list` shows the model is
correctly downloaded.
2025-02-21 10:35:56 -08:00
Reid
c9c4a3c921
feat: model remove cmd (#1128)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

add a subcommand, help to clean the unneeded model:
```
$ llama model --help
usage: llama model [-h] {download,list,prompt-format,describe,verify-download,remove} ...

Work with llama models

options:
  -h, --help            show this help message and exit

$ llama model remove --help
usage: llama model remove [-h] -m MODEL [-f]

Remove the downloaded llama model

options:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Specify the llama downloaded model name
  -f, --force           Used to forcefully remove the llama model from the storage without further confirmation

$ llama model remove -m Llama3.2-1B-Instruct:int4-qlora-eo8
Are you sure you want to remove Llama3.2-1B-Instruct:int4-qlora-eo8? (y/n): n
Removal aborted.

$ llama model remove -mLlama3.2-1B-Instruct:int4-qlora-eo8-f
Llama3.2-1B-Instruct:int4-qlora-eo8 removed.
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-21 08:05:12 -08:00
Ashwin Bharambe
81ce39a607
feat(api): Add options for supporting various embedding models (#1192)
We need to support:
- asymmetric embedding models (#934)
- truncation policies (#933)
- varying dimensional output (#932) 

## Test Plan

```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```
2025-02-20 22:27:12 -08:00
Ashwin Bharambe
6f9d622340
fix(api): update embeddings signature so inputs and outputs list align (#1161)
See Issue #922 

The change is slightly backwards incompatible but no callsite (in our
client codebases or stack-apps) every passes a depth-2
`List[List[InterleavedContentItem]]` (which is now disallowed.)

## Test Plan

```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```

Also ran `tests/client-sdk/inference/test_embeddings.py`
2025-02-20 21:43:13 -08:00
Matthew Farrellee
832c535aaf
feat(providers): add NVIDIA Inference embedding provider and tests (#935)
# What does this PR do?

add /v1/inference/embeddings implementation to NVIDIA provider

**open topics** -
- *asymmetric models*. NeMo Retriever includes asymmetric models, which
are models that embed differently depending on if the input is destined
for storage or lookup against storage. the /v1/inference/embeddings api
does not allow the user to indicate the type of embedding to perform.
see https://github.com/meta-llama/llama-stack/issues/934
- *truncation*. embedding models typically have a limited context
window, e.g. 1024 tokens is common though newer models have 8k windows.
when the input is larger than this window the endpoint cannot perform
its designed function. two options: 0. return an error so the user can
reduce the input size and retry; 1. perform truncation for the user and
proceed (common strategies are left or right truncation). many users
encounter context window size limits and will struggle to write reliable
programs. this struggle is especially acute without access to the
model's tokenizer. the /v1/inference/embeddings api does not allow the
user to delegate truncation policy. see
https://github.com/meta-llama/llama-stack/issues/933
- *dimensions*. "Matryoshka" embedding models are available. they allow
users to control the number of embedding dimensions the model produces.
this is a critical feature for managing storage constraints. embeddings
of 1024 dimensions what achieve 95% recall for an application may not be
worth the storage cost if a 512 dimensions can achieve 93% recall.
controlling embedding dimensions allows applications to determine their
recall and storage tradeoffs. the /v1/inference/embeddings api does not
allow the user to control the output dimensions. see
https://github.com/meta-llama/llama-stack/issues/932

## Test Plan

- `llama stack run llama_stack/templates/nvidia/run.yaml`
- `LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v
tests/client-sdk/inference/test_embedding.py --embedding-model
baai/bge-m3`


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.

---------

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-02-20 16:59:48 -08:00
Ashwin Bharambe
9436dd570d
feat: register embedding models for ollama, together, fireworks (#1190)
# What does this PR do?

We have support for embeddings in our Inference providers, but so far we
haven't done the final step of actually registering the known embedding
models and making sure they are extremely easy to use. This is one step
towards that.

## Test Plan

Run existing inference tests.

```bash

$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```

The value of the EMBEDDING_DIMENSION isn't actually used in these tests,
it is merely used by the test fixtures to check if the model is an LLM
or Embedding.
2025-02-20 15:39:08 -08:00
ehhuang
1166afdf76
fix: some telemetry APIs don't currently work (#1188)
Summary:

This bug is surfaced by using the http LS client. The issue is that
non-scalar values in 'GET' method are `body` params in fastAPI, but our
spec generation script doesn't respect that. We fix by just making them
POST method instead.

Test Plan:
Test API call with newly sync'd client
(https://github.com/meta-llama/llama-stack-client-python/pull/149)

<img width="1114" alt="image"
src="https://github.com/user-attachments/assets/7710aca5-d163-4e00-a465-14e6fcaac2b2"
/>
2025-02-20 14:09:25 -08:00
Xi Yan
ea1faae50e
chore!: deprecate eval/tasks (#1186)
# What does this PR do?
- Fully deprecate eval/tasks

[//]: # (If resolving an issue, uncomment and update the line below)
Closes #1088 

NOTE: this will be a breaking change. We have introduced the new API in
0.1.3 .

Notebook has been updated to use the new endpoints.

## Test Plan
```
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb 
```
<img width="611" alt="image"
src="https://github.com/user-attachments/assets/79f6efe1-81ba-494e-bf36-1fc0c2b9bc6f"
/>



cc @SLR722  for awareness

[//]: # (## Documentation)
2025-02-20 14:06:21 -08:00
Ashwin Bharambe
07ccf908f7 ModelAlias -> ProviderModelEntry 2025-02-20 14:02:36 -08:00
Kevin Cogan
561295af76
docs: Fix Links, Add Podman Instructions, Vector DB Unregister, and Example Script (#1129)
# What does this PR do?
This PR improves the documentation in several ways:

- **Fixed incorrect link in `tools.md`** to ensure all references point
to the correct resources.
- **Added instructions for running the `code-interpreter` agent in a
Podman container**, helping users configure and execute the tool in
containerized environments.
- **Introduced an unregister command for single and multiple vector
databases**, making it easier to manage vector DBs.
- **Provided a simple example script for using the `code-interpreter`
agent**, giving users a practical reference for implementation.

These updates enhance the clarity, usability, and completeness of the
documentation.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
The following steps were performed to verify the accuracy of the
changes:

1. **Validated all fixed link** by checking their destinations to ensure
correctness.
2. **Ran the `code-interpreter` agent in a Podman container** following
the new instructions to confirm functionality.
3. **Executed the vector database unregister commands** and verified
that both single and multiple databases were correctly removed.
4. **Tested the new example script for `code-interpreter`**, ensuring it
runs without errors.

All changes were reviewed and tested successfully, improving the
documentation's accuracy and ease of use.

[//]: # (## Documentation)
2025-02-20 13:52:14 -08:00
Vladimir Ivić
f7161611c6
feat: adding endpoints for files and uploads (#1070)
Summary:
Adds spec definitions for file uploads operations.

This API focuses around two high level operations:
* Initiating and managing upload session
* Accessing uploaded file information

Usage examples:

To start a file upload session:
```
curl -X POST https://localhost:8321/v1/files \
-d '{
   "key": "image123.jpg',
   "bucket": "images",
   "mime_type": "image/jpg",
   "size": 12345
}'

# Returns
{
  “id”: <session_id>
  “url”: “https://localhost:8321/v1/files/session:<session_id>”,
  "offset": 0,
  "size": 12345
}

```

To upload file content to an existing session
```
curl -i -X POST "https://localhost:8321/v1/files/session:<session_id> \
  --data-binary @<path_to_local_file>

# Returns
{
  "key": "image123.jpg",
  "bucket": "images",
  "mime_type": "image/jpg",
  "bytes": 12345,
  "created_at": 1737492240
}

# Implementing on server side (Flask example for simplicity):
@app.route('/uploads/{upload_id}', methods=['POST'])
def upload_content_to_session(upload_id):
    try:
        # Get the binary file data from the request body
        file_data = request.data

        # Save the file to disk
        save_path = f"./uploads/{upload_id}"
        with open(save_path, 'wb') as f:
            f.write(file_data)
        return {__uploaded_file_json__}, 200
    except Exception as e:
        return 500

```

To read information about an existing upload session
```
curl -i -X GET "https://localhost:8321/v1/files/session:<session_id>

# Returns
{
  “id”: <session_id>
  “url”: “https://localhost:8321/v1/files/session:<session_id>”,
  "offset": 1024,
  "size": 12345
}
```

To list buckets
```
GET /files

# Returns
{
  "data": [
     {"name": "bucket1"},
     {"name": "bucket2"},
   ]
}
```

To list all files in a bucket
```
GET /files/{bucket}

# Returns
{
  "data": [
    {
      "key": "shiba.jpg",
      "bucket": "dogs",
      "mime_type": "image/jpg",
      "bytes": 82334,
      "created_at": 1737492240,
    },
    {
      "key": "persian_cat.jpg",
      "mime_type": "image/jpg",
      "bucket": "cats",
      "bytes": 39924,
      "created_at": 1727493440,
    },
  ]
}
```

To get specific file info
```
GET /files/{bucket}/{key}

{
  "key": "shiba.jpg",
  "bucket": "dogs",
  "mime_type": "image/jpg",
  "bytes": 82334,
  "created_at": 1737492240,
}

```

To delete specific file
```
DELETE /files/{bucket}/{key}

{
  "key": "shiba.jpg",
  "bucket": "dogs",
  "mime_type": "image/jpg",
  "bytes": 82334,
  "created_at": 1737492240,
}

```
2025-02-20 13:09:00 -08:00
Ben Browning
fbec826883
docs: Add note about distro_codegen.py and provider dependencies (#1175)
# What does this PR do?

This expands upon the existing distro_codegen.py text in the new API
provider documentation to include a note about not including
provider-specific dependencies in the code path that builds the
distribution's template.

Our distro_codegen pre-commit hook will catch this case anyway, but this
attempts to inform provider authors ahead of time about that.

## Test Plan

I built the docs website locally via the following:
```
pip install docs/requirements.txt
sphinx-build -M html docs/source docs_output
```
Then, I opened that newly generated
`docs_output/html/contributing/new_api_provider.html` in my browser and
confirmed everything rendered correctly.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-02-20 09:23:46 -08:00
Sixian Yi
531940aea9
script for running client sdk tests (#895)
# What does this PR do?
Create a script for running all client-sdk tests on Async Library
client, with the option to generate report


## Test Plan

```
python llama_stack/scripts/run_client_sdk_tests.py --templates together fireworks --report
```



## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2025-02-19 22:38:06 -08:00
Yuan Tang
25cdab5b28
docs: Remove unused python-openapi and json-strong-typing in openapi_generator (#1167)
This is no longer required to generated API reference after
5e7904ef6c
2025-02-19 22:06:29 -08:00
Ashwin Bharambe
d39f8de619 Pin sphinx 2025-02-19 20:20:46 -08:00
Ashwin Bharambe
89fdb2c9e9 Try a different css file API for sphinx 2025-02-19 20:14:40 -08:00
Sébastien Han
26503ca1a4
docs: fix Python llama_stack_client SDK links (#1150)
# What does this PR do?
It seems that the llama_stack_client repo and the main repo were
originally the same, causing links to point to local references. We’ve
now updated them to use the correct llama_stack_client repo links.

Signed-off-by: Sébastien Han <seb@redhat.com>

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-02-19 19:05:14 -08:00
Ben Browning
e9b8259cf9
fix: Get distro_codegen.py working with default deps and enabled in pre-commit hooks (#1123)
# What does this PR do?

Before this change, `distro_codegen.py` would only work if the user
manually installed multiple provider-specific dependencies (see #1122).
Now, users can run `distro_codegen.py` without any provider-specific
dependencies because we avoid importing the entire provider
implementations just to get the config needed to build the provider
template.

Concretely, this mostly means moving the
MODEL_ALIASES (and related variants) definitions to a new models.py
class within the provider implementation for those providers that
require additional dependencies. It also meant moving a couple of
imports from top-level imports to inside `get_adapter_impl` for some
providers, which follows the pattern used by multiple existing
providers.

To ensure we don't regress and accidentally add new imports that cause
distro_codegen.py to fail, the stubbed-in pre-commit hook for
distro_codegen.py was uncommented and slightly tweaked to run via `uv
run python ...` to ensure it runs with only the project's default
dependencies and to run automatically instead of manually.

Lastly, this updates distro_codegen.py itself to keep track of paths it
might have changed and to only `git diff` those specific paths when
checking for changed files instead of doing a diff on the entire working
tree. The latter was overly broad and would require a user have no other
unstaged changes in their working tree, even if those unstaged changes
were unrelated to generated code. Now it only flags uncommitted changes
for paths distro_codegen.py actually writes to.

Our generated code was also out-of-date, presumably because of these
issues, so this commit also has some updates to the generated code
purely because it was out of sync, and the pre-commit hook now enforces
things to be updated.

(Closes #1122)

## Test Plan

I manually tested distro_codegen.py and the pre-commit hook to verify
those work as expected, flagging any uncommited changes and catching any
imports that attempt to pull in provider-specific dependencies.

However, I do not have valid api keys to the impacted provider
implementations, and am unable to easily run the inference tests against
each changed provider. There are no functional changes to the provider
implementations here, but I'd appreciate a second set of eyes on the
changed import statements and moving of MODEL_ALIASES type code to a
separate models.py to ensure I didn't make any obvious errors.

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-02-19 18:39:20 -08:00
Alessandro Sangiorgi
9e03df983e
fix(rag-example): add provider_id to avoid llama_stack_client 400 error (#1114)
# What does this PR do?
Add provider_id to avoid errors using the rag example with
llama_stack_client

`llama_stack_client.BadRequestError: Error code: 400 - {'detail':
'Invalid value: No provider specified and multiple providers available.
Please specify a provider_id.'}`

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Co-authored-by: Xi Yan <yanxi970830@gmail.com>
2025-02-19 15:37:25 -08:00
Ashwin Bharambe
034ece0011 Ensure that deprecations for fields follow through to OpenAPI 2025-02-19 13:54:04 -08:00
Ashwin Bharambe
31a5ba5268 Add title to the json schemas 2025-02-19 13:26:39 -08:00
Ashwin Bharambe
5e7904ef6c Kill the older strong_typing code 2025-02-19 12:24:21 -08:00
ehhuang
8de7cf103b
feat: support tool_choice = {required, none, <function>} (#1059)
Summary:

titled


Test Plan:

added tests and

LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/
--safety-shield meta-llama/Llama-Guard-3-8B
2025-02-18 23:25:15 -05:00
Xi Yan
8585b95a28 rename 2025-02-18 16:02:44 -08:00
Reid
4e76d312fa
fix: modify the model id title for model list (#1095)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

Re-check and based on the doc, the download model id, actually is model
descriptor(also without `meta-llama/`).


https://llama-stack.readthedocs.io/en/latest/references/llama_cli_reference/index.html
```
$ llama download --source huggingface --model-id  Llama-Guard-3-1B:int4 --hf-token xxx  # model descriptor
Fetching 8 files:   0%|                                                                                                                   | 0/8 [00:00<?, ?it/s]
LICENSE.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.71k/7.71k [00:00<00:00, 10.5MB/s]

$ llama download --source huggingface --model-id  Llama-Guard-3-1B-INT4 --hf-token xxxx  # hugging face repo without meta-llama/
usage: llama download [-h] [--source {meta,huggingface}] [--model-id MODEL_ID] [--hf-token HF_TOKEN] [--meta-url META_URL] [--max-parallel MAX_PARALLEL]
                      [--ignore-patterns IGNORE_PATTERNS] [--manifest-file MANIFEST_FILE]
llama download: error: Model Llama-Guard-3-1B-INT4 not found <<<<---


$ llama download --source meta --model-id Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8
usage: llama download [-h] [--source {meta,huggingface}] [--model-id MODEL_ID] [--hf-token HF_TOKEN] [--meta-url META_URL] [--max-parallel MAX_PARALLEL]
                      [--ignore-patterns IGNORE_PATTERNS] [--manifest-file MANIFEST_FILE]
llama download: error: Model Llama-3.2-3B-Instruct-SpinQuant_INT4_EO8 not found

$ llama download --source meta --model-id Llama3.2-3B-Instruct:int4-spinquant-eo8
Please provide the signed URL for model Llama3.2-3B-Instruct:int4-spinquant-eo8 you received via email after visiting https://www.llama.com/llama-downloads/ (e.g., https://llama3-1.llamameta.net/*?Policy...): ^CTraceback (most recent call last):

$ llama download --source meta --model-id meta-llama/Llama3.2-3B-Instruct:int4-spinquant-eo8
usage: llama download [-h] [--source {meta,huggingface}] [--model-id MODEL_ID] [--hf-token HF_TOKEN] [--meta-url META_URL]
                      [--max-parallel MAX_PARALLEL] [--ignore-patterns IGNORE_PATTERNS] [--manifest-file MANIFEST_FILE]
llama download: error: Model meta-llama/Llama3.2-3B-Instruct:int4-spinquant-eo8 not found
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-18 10:26:41 -08:00
Reid
89d37687dd
chore: remove --no-list-templates option (#1121)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

From the code and the usage, seems cannot see that need to use
`--no-list-templates` to handle, and also make the user confused from
the help text, so try to remove it.
```
$ llama stack build --no-list-templates
> Enter a name for your Llama Stack (e.g. my-local-stack):

$ llama stack build
> Enter a name for your Llama Stack (e.g. my-local-stack):

before:
$ llama stack build --help
  --list-templates, --no-list-templates
                        Show the available templates for building a Llama Stack distribution (default: False)

after:
  --list-templates      Show the available templates for building a Llama Stack distribution
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-18 10:13:46 -08:00
Yuan Tang
6b1773d530
docs: Fix incorrect link and command for generating API reference (#1124) 2025-02-15 22:05:23 -05:00
Hardik Shah
df864ee575
Update index.md to refer to v0.1.3 2025-02-14 14:29:17 -08:00
Yuan Tang
64328bfe62
fix: enable_session_persistence in AgentConfig should be optional (#1012)
# What does this PR do?
This issue was discovered in
https://github.com/meta-llama/llama-stack/pull/1009#discussion_r1947036518.

## Test Plan

This field is no longer required after the change.

[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-02-14 09:19:53 -08:00
Ashwin Bharambe
314ee09ae3
chore: move all Llama Stack types from llama-models to llama-stack (#1098)
llama-models should have extremely minimal cruft. Its sole purpose
should be didactic -- show the simplest implementation of the llama
models and document the prompt formats, etc.

This PR is the complement to
https://github.com/meta-llama/llama-models/pull/279

## Test Plan

Ensure all `llama` CLI `model` sub-commands work:

```bash
llama model list
llama model download --model-id ...
llama model prompt-format -m ...
```

Ran tests:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/
LLAMA_STACK_CONFIG=fireworks pytest -s -v vector_io/
LLAMA_STACK_CONFIG=fireworks pytest -s -v agents/
```

Create a fresh venv `uv venv && source .venv/bin/activate` and run
`llama stack build --template fireworks --image-type venv` followed by
`llama stack run together --image-type venv` <-- the server runs

Also checked that the OpenAPI generator can run and there is no change
in the generated files as a result.

```bash
cd docs/openapi_generator
sh run_openapi_generator.sh
```
2025-02-14 09:10:59 -08:00
Xi Yan
da53dc3f5f
fix: openapi for eval-task (#1085)
# What does this PR do?
- as title

## Test Plan
- the deprecated endpoint need to obey what it was before

[//]: # (## Documentation)
2025-02-13 17:10:45 -08:00
Xi Yan
2a8e199e10 fix notebook 2025-02-13 16:52:46 -08:00
Xi Yan
8b655e3cd2
fix!: update eval-tasks -> benchmarks (#1032)
# What does this PR do?

- Update `/eval-tasks` to `/benchmarks`
- ⚠️ Remove differentiation between `app` v.s. `benchmark` eval task
config. Now we only have `BenchmarkConfig`. The overloaded `benchmark`
is confusing and do not add any value. Backward compatibility is being
kept as the "type" is not being used anywhere.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
- This change is backward compatible 
- Run notebook test with

```
pytest -v -s --nbval-lax ./docs/getting_started.ipynb
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
```

<img width="846" alt="image"
src="https://github.com/user-attachments/assets/d2fc06a7-593a-444f-bc1f-10ab9b0c843d"
/>



[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)

---------

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Co-authored-by: Ben Browning <ben324@gmail.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com>
Co-authored-by: reidliu <reid201711@gmail.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-13 16:40:58 -08:00
Xi Yan
2fa9e3c941
fix: make backslash work in GET /models/{model_id:path} (#1068) 2025-02-13 08:46:43 -08:00
Charlie Doern
025f615868
feat: add support for running in a venv (#1018)
# What does this PR do?

add --image-type to `llama stack run`. Which takes conda, container or
venv also add start_venv.sh which start the stack using a venv

resolves #1007

## Test Plan

running locally:

`llama stack build --template ollama --image-type venv`
`llama stack run --image-type venv
~/.llama/distributions/ollama/ollama-run.yaml`
...
```
llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
Using run configuration: /Users/charliedoern/.llama/distributions/ollama/ollama-run.yaml
+ python -m llama_stack.distribution.server.server --yaml-config /Users/charliedoern/.llama/distributions/ollama/ollama-run.yaml --port 8321
Using config file: /Users/charliedoern/.llama/distributions/ollama/ollama-run.yaml
Run configuration:
apis:
- agents
- datasetio
...
```

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-02-12 11:13:04 -05:00
Dinesh Yeduguru
d8a20e034b
feat: make telemetry attributes be dict[str,PrimitiveType] (#1055)
# What does this PR do?
Make attributes in telemetry be only primitive types and avoid arbitrary
nesting.

## Test Plan
```
 LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/test_agents.py  -k "test_builtin_tool_web_search"
# Verified that attributes still show up correclty in jaeger
```
2025-02-11 15:10:17 -08:00
Dinesh Yeduguru
ab7f802698
feat: add MetricResponseMixin to chat completion response types (#1050)
# What does this PR do?
Defines a MetricResponseMixin which can be inherited by any response
class. Adds it to chat completion response types.


This is a short term solution to allow inference API to return metrics
The ideal way to do this is to have a way for all response types to
include metrics
and all metric events logged to the telemetry API to be included with
the response
To do this, we will need to augment all response types with a metrics
field.
We have hit a blocker from stainless SDK that prevents us from doing
this.
The blocker is that if we were to augment the response types that have a
data field
in them like so
class ListModelsResponse(BaseModel):
    metrics: Optional[List[MetricEvent]] = None
    data: List[Models]
    ...
The client SDK will need to access the data by using a .data field,
which is not
ergonomic. Stainless SDK does support unwrapping the response type, but
it
requires that the response type to only have a single field.

We will need a way in the client SDK to signal that the metrics are
needed
and if they are needed, the client SDK has to return the full response
type
without unwrapping it.

## Test Plan
sh run_openapi_generator.sh ./
sh stainless_sync.sh dineshyv/dev add-metrics-to-resp-v4

LLAMA_STACK_CONFIG="/Users/dineshyv/.llama/distributions/fireworks/fireworks-run.yaml"
pytest -v tests/client-sdk/agents/test_agents.py
2025-02-11 14:58:12 -08:00
Ellis Tarn
36d35406a7
fix: a bad newline in ollama docs (#1036)
# What does this PR do?
Catches a bug in the previous codegen which was removing newlines.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
python llama_stack/scripts/distro_codegen.py
```

[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)
2025-02-10 14:27:17 -08:00
Ellis Tarn
afca9d92f9
fix: Readthedocs cannot parse comments, resulting in docs bugs (#1033) 2025-02-10 16:35:16 -05:00
Ellis Tarn
ab9516c789
fix: Gaps in doc codegen (#1035)
# What does this PR do?
Catches docs up to source with:
```
python llama_stack/scripts/distro_codegen.py
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Manually checked
```
sphinx-autobuild docs/source build/html
```

[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)
2025-02-10 13:24:15 -08:00
Michael Clifford
076213165c
docs: update rag.md example code to prevent errors (#1009) 2025-02-10 09:25:30 -05:00
raghotham
7766e68e92
docs: update index.md for 0.1.2 (#1013)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)
2025-02-07 15:36:20 -08:00