Commit graph

62 commits

Author SHA1 Message Date
Sébastien Han
ac5fd57387
chore: remove nested imports (#2515)
# What does this PR do?

* Given that our API packages use "import *" in `__init.py__` we don't
need to do `from llama_stack.apis.models.models` but simply from
llama_stack.apis.models. The decision to use `import *` is debatable and
should probably be revisited at one point.

* Remove unneeded Ruff F401 rule
* Consolidate Ruff F403 rule in the pyprojectfrom
llama_stack.apis.models.models

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-06-26 08:01:05 +05:30
Charlie Doern
d12f195f56
feat: drop python 3.10 support (#2469)
# What does this PR do?

dropped python3.10, updated pyproject and dependencies, and also removed
some blocks of code with special handling for enum.StrEnum

Closes #2458

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-06-19 12:07:14 +05:30
Hardik Shah
985d0b156c
feat: Add suffix to openai_completions (#2449)
Some checks failed
Integration Tests / test-matrix (library, 3.10, inspect) (push) Failing after 9s
Integration Tests / test-matrix (http, 3.11, providers) (push) Failing after 5s
Integration Tests / test-matrix (library, 3.10, providers) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.10, post_training) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 13s
Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 6s
Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, vector_io) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 9s
Test External Providers / test-external-providers (venv) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 14s
Unit Tests / unit-tests (3.10) (push) Failing after 19s
Unit Tests / unit-tests (3.11) (push) Failing after 20s
Unit Tests / unit-tests (3.12) (push) Failing after 18s
Unit Tests / unit-tests (3.13) (push) Failing after 16s
Update ReadTheDocs / update-readthedocs (push) Failing after 8s
Pre-commit / pre-commit (push) Successful in 58s
For code completion apps need "fill in the middle" capabilities. 
Added option of `suffix` to `openai_completion` to enable this. 
Updated ollama provider to showcase the same. 

### Test Plan 
```
pytest -sv --stack-config="inference=ollama"  tests/integration/inference/test_openai_completion.py --text-model qwen2.5-coder:1.5b -k test_openai_completion_non_streaming_suffix
```

### OpenAI Sample script
```
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1/openai/v1")

response = client.completions.create(
    model="qwen2.5-coder:1.5b",
    prompt="The capital of ",
    suffix="is Paris.",
    max_tokens=10,
)

print(response.choices[0].text)
``` 
### Output
```
France is ____.

To answer this question, we 
```
2025-06-13 16:06:06 -07:00
Hardik Shah
b21050935e
feat: New OpenAI compat embeddings API (#2314)
Some checks failed
Integration Tests / test-matrix (http, agents) (push) Failing after 9s
Integration Tests / test-matrix (http, scoring) (push) Failing after 9s
Integration Tests / test-matrix (library, inference) (push) Failing after 9s
Integration Tests / test-matrix (library, inspect) (push) Failing after 9s
Integration Tests / test-matrix (library, post_training) (push) Failing after 15s
Integration Tests / test-matrix (library, providers) (push) Failing after 14s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 43s
Integration Tests / test-matrix (library, scoring) (push) Failing after 8s
Integration Tests / test-matrix (http, inference) (push) Failing after 46s
Integration Tests / test-matrix (library, tool_runtime) (push) Failing after 8s
Integration Tests / test-matrix (library, agents) (push) Failing after 44s
Integration Tests / test-matrix (http, inspect) (push) Failing after 47s
Integration Tests / test-matrix (http, providers) (push) Failing after 45s
Integration Tests / test-matrix (library, datasets) (push) Failing after 45s
Integration Tests / test-matrix (http, post_training) (push) Failing after 46s
Integration Tests / test-matrix (http, tool_runtime) (push) Failing after 47s
Integration Tests / test-matrix (http, datasets) (push) Failing after 49s
Test External Providers / test-external-providers (venv) (push) Failing after 6s
Update ReadTheDocs / update-readthedocs (push) Failing after 6s
Unit Tests / unit-tests (3.12) (push) Failing after 7s
Unit Tests / unit-tests (3.10) (push) Failing after 8s
Unit Tests / unit-tests (3.11) (push) Failing after 8s
Unit Tests / unit-tests (3.13) (push) Failing after 7s
Pre-commit / pre-commit (push) Successful in 1m12s
# What does this PR do?
Adds a new endpoint that is compatible with OpenAI for embeddings api. 
`/openai/v1/embeddings`
Added providers for OpenAI, LiteLLM and SentenceTransformer. 


## Test Plan
```
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004
```
2025-05-31 22:11:47 -07:00
ehhuang
5844c2da68
feat: add list responses API (#2233)
# What does this PR do?
This is not part of the official OpenAI API, but we'll use this for the
logs UI.
In order to support more filtering options, I'm adopting the newly
introduced sql store in in place of the kv store.

## Test Plan
Added integration/unit tests.
2025-05-23 13:16:48 -07:00
ehhuang
047303e339
feat: introduce APIs for retrieving chat completion requests (#2145)
# What does this PR do?
This PR introduces APIs to retrieve past chat completion requests, which
will be used in the LS UI.

Our current `Telemetry` is ill-suited for this purpose as it's untyped
so we'd need to filter by obscure attribute names, making it brittle.

Since these APIs are 'provided by stack' and don't need to be
implemented by inference providers, we introduce a new InferenceProvider
class, containing the existing inference protocol, which is implemented
by inference providers.

The APIs are OpenAI-compliant, with an additional `input_messages`
field.


## Test Plan
This PR just adds the API and marks them provided_by_stack. S
tart stack server -> doesn't crash
2025-05-18 21:43:19 -07:00
Sébastien Han
bb5fca9521
chore: more API validators (#2165)
# What does this PR do?

We added:

* make sure docstrings are present with 'params' and 'returns'
* fail if someone sets 'returns: None'
* fix the failing APIs

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-15 11:22:51 -07:00
Sébastien Han
1a529705da
chore: more mypy fixes (#2029)
# What does this PR do?

Mainly tried to cover the entire llama_stack/apis directory, we only
have one left. Some excludes were just noop.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-06 09:52:31 -07:00
Ihar Hrachyshka
9e6561a1ec
chore: enable pyupgrade fixes (#1806)
# What does this PR do?

The goal of this PR is code base modernization.

Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)

Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-05-01 14:23:50 -07:00
Ben Browning
5b8e75b392
fix: OpenAI spec cleanup for assistant requests (#1963)
# What does this PR do?

Some of our multi-turn verification tests were failing because I had
accidentally marked content as a required field in the OpenAI chat
completion request assistant messages, but it's actually optional. It is
required for messages from other roles, but assistant is explicitly
allowed to be optional.

Similarly, the assistant message tool_calls field should default to None
instead of an empty list.

These two changes get the openai-llama-stack verification test back to
100% passing, just like it passes 100% when not behind Llama Stack. They
also increase the pass rate of some of the other providers in the
verification test, but don't get them to 100%.

## Test Plan

I started a Llama Stack server setup to run all the verification tests
(requires OPENAI_API_KEY env variable)

```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```

Then, I manually ran the verification tests to see which were failing,
fix them, and ran them again after these changes to ensure they were all
passing.

```
python -m pytest -s -v tests/verifications/openai_api/test_chat_completion.py --provider=openai-llama-stack
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-17 06:56:10 -07:00
Ben Browning
7641a5cd0b
fix: 100% OpenAI API verification for together and fireworks (#1946)
# What does this PR do?

TLDR: Changes needed to get 100% passing tests for OpenAI API
verification tests when run against Llama Stack with the `together`,
`fireworks`, and `openai` providers. And `groq` is better than before,
at 88% passing.

This cleans up the OpenAI API support for image message types
(specifically `image_url` types) and handling of the `response_format`
chat completion parameter. Both of these required a few more Pydantic
model definitions in our Inference API, just to move from the
not-quite-right stubs I had in place to something fleshed out to match
the actual OpenAI API specs.

As part of testing this, I also found and fixed a bug in the litellm
implementation of openai_completion and openai_chat_completion, so the
providers based on those should actually be working now.

The method `prepare_openai_completion_params` in
`llama_stack/providers/utils/inference/openai_compat.py` was improved to
actually recursively clean up input parameters, including handling of
lists, dicts, and dumping of Pydantic models to dicts. These changes
were required to get to 100% passing tests on the OpenAI API
verification against the `openai` provider.

With the above, the together.ai provider was passing as well as it is
without Llama Stack. But, since we have Llama Stack in the middle, I
took the opportunity to clean up the together.ai provider so that it now
also passes the OpenAI API spec tests we have at 100%. That means
together.ai is now passing our verification test better when using an
OpenAI client talking to Llama Stack than it is when hitting together.ai
directly, without Llama Stack in the middle.

And, another round of work for Fireworks to improve translation of
incoming OpenAI chat completion requests to Llama Stack chat completion
requests gets the fireworks provider passing at 100%. The server-side
fireworks.ai tool calling support with OpenAI chat completions and Llama
4 models isn't great yet, but by pointing the OpenAI clients at Llama
Stack's API we can clean things up and get everything working as
expected for Llama 4 models.

## Test Plan

### OpenAI API Verification Tests

I ran the OpenAI API verification tests as below and 100% of the tests
passed.

First, start a Llama Stack server that runs the `openai` provider with
the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template
setup to do this out of the box, so I added a
`tests/verifications/openai-api-verification-run.yaml` to do this.

First, ensure you have the necessary API key environment variables set:

```
export TOGETHER_API_KEY="..."
export FIREWORKS_API_KEY="..."
export OPENAI_API_KEY="..."
```

Then, run a Llama Stack server that serves up all these providers:

```
llama stack run \
      --image-type venv \
      tests/verifications/openai-api-verification-run.yaml
```

Finally, generate a new verification report against all these providers,
both with and without the Llama Stack server in the middle.

```
python tests/verifications/generate_report.py \
      --run-tests \
      --provider \
        together \
        fireworks \
        groq \
        openai \
        together-llama-stack \
        fireworks-llama-stack \
        groq-llama-stack \
        openai-llama-stack
```

You'll see that most of the configurations with Llama Stack in the
middle now pass at 100%, even though some of them do not pass at 100%
when hitting the backend provider's API directly with an OpenAI client.

### OpenAI Completion Integration Tests with vLLM:

I also ran the smaller `test_openai_completion.py` test suite (that's
not yet merged with the verification tests) on multiple of the
providers, since I had to adjust the method signature of
openai_chat_completion a bit and thus had to touch lots of these
providers to match. Here's the tests I ran there, all passing:

```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```

### OpenAI Completion Integration Tests with ollama

```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```

### OpenAI Completion Integration Tests with together.ai

```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo"
```

### OpenAI Completion Integration Tests with fireworks.ai

```
INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct"

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-14 08:56:29 -07:00
Ashwin Bharambe
8b4158169f fix: dont check protocol compliance for experimental methods 2025-04-12 16:26:32 -07:00
Ashwin Bharambe
f34f22f8c7
feat: add batch inference API to llama stack inference (#1945)
# What does this PR do?

This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`

The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.

Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.

So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.

## Test Plan

Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144

LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```

Then run the batch inference test case.
2025-04-12 11:41:12 -07:00
Ben Browning
2b2db5fbda
feat: OpenAI-Compatible models, completions, chat/completions (#1894)
# What does this PR do?

This stubs in some OpenAI server-side compatibility with three new
endpoints:

/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions

This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .

The two "v1" instances in there isn't awesome, but the thinking is that
Llama Stack's API is v1 and then our OpenAI compatibility layer is
compatible with OpenAI V1. And, some OpenAI clients implicitly assume
the URL ends with "v1", so this gives maximum compatibility.

The openai models endpoint is implemented in the routing layer, and just
returns all the models Llama Stack knows about.

The following providers should be working with the new OpenAI
completions and chat/completions API:
* remote::anthropic (untested)
* remote::cerebras-openai-compat (untested)
* remote::fireworks (tested)
* remote::fireworks-openai-compat (untested)
* remote::gemini (untested)
* remote::groq-openai-compat (untested)
* remote::nvidia (tested)
* remote::ollama (tested)
* remote::openai (untested)
* remote::passthrough (untested)
* remote::sambanova-openai-compat (untested)
* remote::together (tested)
* remote::together-openai-compat (untested)
* remote::vllm (tested)

The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API, we'll
add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses to
OpenAI responses.

This is related to #1817 but is a bit larger in scope than just chat
completions, as I have real use-cases that need the older completions
API as well.

## Test Plan

### vLLM

```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```

### ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```



## Documentation

Run a Llama Stack distribution that uses one of the providers mentioned
in the list above. Then, use your favorite OpenAI client to send
completion or chat completion requests with the base_url set to
http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the
host and port of your Llama Stack server, if different.

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-11 13:14:17 -07:00
Ashwin Bharambe
530d4bdfe1
refactor: move all llama code to models/llama out of meta reference (#1887)
# What does this PR do?

Move around bits. This makes the copies from llama-models _much_ easier
to maintain and ensures we don't entangle meta-reference specific
tidbits into llama-models code even by accident.

Also, kills the meta-reference-quantized-gpu distro and rolls
quantization deps into meta-reference-gpu.

## Test Plan

```
LLAMA_MODELS_DEBUG=1 \
  with-proxy llama stack run meta-reference-gpu \
  --env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \
   --env INFERENCE_CHECKPOINT_DIR=<DIR> \
   --env MODEL_PARALLEL_SIZE=4 \
   --env QUANTIZATION_TYPE=fp8_mixed
```

Start a server with and without quantization. Point integration tests to
it using:

```
pytest -s -v  tests/integration/inference/test_text_inference.py \
   --stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
2025-04-07 15:03:58 -07:00
Ihar Hrachyshka
66d6c2580e
chore: more mypy checks (ollama, vllm, ...) (#1777)
# What does this PR do?

- **chore: mypy for strong_typing**
- **chore: mypy for remote::vllm**
- **chore: mypy for remote::ollama**
- **chore: mypy for providers.datatype**

---------

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-04-01 17:12:39 +02:00
Ihar Hrachyshka
41bd350539
chore: Don't set type variables from register_schema() (#1713)
# What does this PR do?

Don't set type variables from register_schema().

`mypy` is not happy about it since type variables are calculated at
runtime and hence the typing hints are not available during static
analysis.

Good news is there is no good reason to set the variables from the
return type.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-19 20:29:00 -07:00
ehhuang
a505bf45a3
feat(api): remove tool_name from ToolResponseMessage (#1599)
Summary:
This is not used anywhere.

closes #1421 

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct --record-responses
2025-03-12 19:41:48 -07:00
Dinesh Yeduguru
58d08d100e
feat: Add back inference metrics and preserve context variables across asyncio boundary (#1552)
# What does this PR do?
This PR adds back the changes in #1300  which were reverted in  #1476 .

It also adds logic to preserve context variables across asyncio
boundary. this is needed with the library client since the async
generator logic yields control to code outside the event loop, and on
resuming, does not have the same context as before and this requires
preserving the context vars.

address #1477 
## Test Plan


```
 curl --request POST \
  --url http://localhost:8321/v1/inference/chat-completion \
  --header 'content-type: application/json' \
  --data '{
  "model_id": "meta-llama/Llama-3.1-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": {
        "type": "text",
        "text": "where do humans live"
      }
    }
  ],
  "stream": false
}' | jq .

{
  "metrics": [
    {
      "trace_id": "kCZwO3tyQC-FuAGb",
      "span_id": "bsP_5a5O",
      "timestamp": "2025-03-11T16:47:38.549084Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "prompt_tokens",
      "value": 10,
      "unit": "tokens"
    },
    {
      "trace_id": "kCZwO3tyQC-FuAGb",
      "span_id": "bsP_5a5O",
      "timestamp": "2025-03-11T16:47:38.549449Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "completion_tokens",
      "value": 369,
      "unit": "tokens"
    },
    {
      "trace_id": "kCZwO3tyQC-FuAGb",
      "span_id": "bsP_5a5O",
      "timestamp": "2025-03-11T16:47:38.549457Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "total_tokens",
      "value": 379,
      "unit": "tokens"
    }
  ],
  "completion_message": {
    "role": "assistant",
    "content": "Humans live on the planet Earth, specifically on its landmasses and in its oceans. Here's a breakdown of where humans live:\n\n1. **Continents:** Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica ( temporary residents, mostly scientists and researchers)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. **Countries:** There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. **Cities and towns:** Many humans live in urban areas, such as cities and towns, which are often located near coastlines, rivers, or other bodies of water.\n4. **Rural areas:** Some humans live in rural areas, such as villages, farms, and countryside.\n5. **Islands:** Humans inhabit many islands around the world, including those in the Pacific, Indian, and Atlantic Oceans.\n6. **Mountains and highlands:** Humans live in mountainous regions, such as the Himalayas, the Andes, and the Rocky Mountains.\n7. **Deserts:** Some humans live in desert regions, such as the Sahara, the Mojave, and the Atacama.\n8. **Coastal areas:** Many humans live in coastal areas, such as beaches, ports, and coastal cities.\n9. **Underwater habitats:** A few humans live in underwater habitats, such as research stations and submarines.\n10. **Space:** A small number of humans have lived in space, including astronauts on the International Space Station and those who have visited the Moon.\n\nOverall, humans can be found living in almost every environment on Earth, from the frozen tundra to the hottest deserts, and from the highest mountains to the deepest oceans.",
    "stop_reason": "end_of_turn",
    "tool_calls": []
  },
  "logprobs": null
}

```

Orignal repro no longer showing any error:
```
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml
python -m examples.agents.e2e_loop_with_client_tools localhost 8321
```

client logs:
https://gist.github.com/dineshyv/047c7e87b18a5792aa660e311ea53166
server logs:
https://gist.github.com/dineshyv/97a2174099619e9916c7c490be26e559
2025-03-12 12:01:03 -07:00
Dinesh Yeduguru
60e7f3d705
fix: Revert "feat: record token usage for inference API (#1300)" (#1476)
This reverts commit b8535417e0.

Test plan:
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run
~/.llama/distributions/together/together-run.yaml
python -m examples.agents.e2e_loop_with_client_tools localhost 8321
2025-03-07 10:16:47 -08:00
Sébastien Han
803bf0e029
fix: solve ruff B008 warnings (#1444)
# What does this PR do?

The commit addresses the Ruff warning B008 by refactoring the code to
avoid calling SamplingParams() directly in function argument defaults.
Instead, it either uses Field(default_factory=SamplingParams) for
Pydantic models or sets the default to None and instantiates
SamplingParams inside the function body when the argument is None.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-06 16:48:35 -08:00
Dinesh Yeduguru
b8535417e0
feat: record token usage for inference API (#1300)
# What does this PR do?
Inference router computes the token usage related metrics for all
providers and returns the metrics as part of response and also logs to
telemetry.

## Test Plan
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run
~/.llama/distributions/fireworks/fireworks-run.yaml

```
curl --request POST \
  --url http://localhost:8321/v1/inference/chat-completion \
  --header 'content-type: application/json' \
  --data '{
  "model_id": "meta-llama/Llama-3.1-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": {
        "type": "text",
        "text": "where do humans live"
      }
    }
  ],
  "stream": false
}' | jq .
{
  "metrics": [
    {
      "trace_id": "yjv1tf0jS1evOyPm",
      "span_id": "WqYKvg0_",
      "timestamp": "2025-02-27T18:55:10.770903Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "prompt_tokens",
      "value": 10,
      "unit": "tokens"
    },
    {
      "trace_id": "yjv1tf0jS1evOyPm",
      "span_id": "WqYKvg0_",
      "timestamp": "2025-02-27T18:55:10.770916Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "completion_tokens",
      "value": 411,
      "unit": "tokens"
    },
    {
      "trace_id": "yjv1tf0jS1evOyPm",
      "span_id": "WqYKvg0_",
      "timestamp": "2025-02-27T18:55:10.770919Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "total_tokens",
      "value": 421,
      "unit": "tokens"
    }
  ],
  "completion_message": {
    "role": "assistant",
    "content": "Humans live in various parts of the world, inhabiting almost every continent, country, and region. Here's a breakdown of where humans live:\n\n1. **Continents:** Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica (research stations only)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. **Countries:** There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. **Regions:** Humans live in diverse regions, including:\n\t* Deserts (e.g., Sahara, Mojave)\n\t* Forests (e.g., Amazon, Congo)\n\t* Grasslands (e.g., Prairies, Steppes)\n\t* Mountains (e.g., Himalayas, Andes)\n\t* Oceans (e.g., coastal areas, islands)\n\t* Tundras (e.g., Arctic, sub-Arctic)\n4. **Cities and towns:** Many humans live in urban areas, such as cities and towns, which are often located near:\n\t* Coastlines\n\t* Rivers\n\t* Lakes\n\t* Mountains\n5. **Rural areas:** Some humans live in rural areas, such as:\n\t* Villages\n\t* Farms\n\t* Countryside\n6. **Islands:** Humans inhabit many islands, including:\n\t* Tropical islands (e.g., Hawaii, Maldives)\n\t* Arctic islands (e.g., Greenland, Iceland)\n\t* Continental islands (e.g., Great Britain, Ireland)\n7. **Extreme environments:** Humans also live in extreme environments, such as:\n\t* High-altitude areas (e.g., Tibet, Andes)\n\t* Low-altitude areas (e.g., Death Valley, Dead Sea)\n\t* Areas with extreme temperatures (e.g., Arctic, Sahara)\n\nOverall, humans have adapted to live in a wide range of environments and ecosystems around the world.",
    "stop_reason": "end_of_turn",
    "tool_calls": []
  },
  "logprobs": null
}
```

```
 LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/inference

======================================================================== short test summary info =========================================================================
FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B:vis=11B-inference:chat_completion:tool_calling_tools_absent-True] - ValueError: Unsupported tool prompt format: ToolPromptFormat.json
FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B:vis=11B-inference:chat_completion:tool_calling_tools_absent-False] - ValueError: Unsupported tool prompt format: ToolPromptFormat.json
FAILED tests/integration/inference/test_vision_inference.py::test_image_chat_completion_non_streaming[txt=8B:vis=11B] - fireworks.client.error.InvalidRequestError: {'error': {'object': 'error', 'type': 'invalid_request_error', 'message': 'Failed to decode image cannot identify image f...
FAILED tests/integration/inference/test_vision_inference.py::test_image_chat_completion_streaming[txt=8B:vis=11B] - fireworks.client.error.InvalidRequestError: {'error': {'object': 'error', 'type': 'invalid_request_error', 'message': 'Failed to decode image cannot identify image f...
========================================================= 4 failed, 16 passed, 23 xfailed, 17 warnings in 44.36s =========================================================
```
2025-03-05 12:41:45 -08:00
ehhuang
25fddccfd8
feat: tool outputs metadata (#1155)
Summary:

Allows tools to output metadata. This is useful for evaluating tool
outputs, e.g. RAG tool will output document IDs, which can be used to
score recall.

Will need to make a similar change on the client side to support
ClientTool outputting metadata.

Test Plan:

LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/client-sdk/agents/test_agents.py
2025-02-21 13:15:31 -08:00
Ashwin Bharambe
81ce39a607
feat(api): Add options for supporting various embedding models (#1192)
We need to support:
- asymmetric embedding models (#934)
- truncation policies (#933)
- varying dimensional output (#932) 

## Test Plan

```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```
2025-02-20 22:27:12 -08:00
Ashwin Bharambe
6f9d622340
fix(api): update embeddings signature so inputs and outputs list align (#1161)
See Issue #922 

The change is slightly backwards incompatible but no callsite (in our
client codebases or stack-apps) every passes a depth-2
`List[List[InterleavedContentItem]]` (which is now disallowed.)

## Test Plan

```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```

Also ran `tests/client-sdk/inference/test_embeddings.py`
2025-02-20 21:43:13 -08:00
ehhuang
8de7cf103b
feat: support tool_choice = {required, none, <function>} (#1059)
Summary:

titled


Test Plan:

added tests and

LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/
--safety-shield meta-llama/Llama-Guard-3-8B
2025-02-18 23:25:15 -05:00
Ashwin Bharambe
314ee09ae3
chore: move all Llama Stack types from llama-models to llama-stack (#1098)
llama-models should have extremely minimal cruft. Its sole purpose
should be didactic -- show the simplest implementation of the llama
models and document the prompt formats, etc.

This PR is the complement to
https://github.com/meta-llama/llama-models/pull/279

## Test Plan

Ensure all `llama` CLI `model` sub-commands work:

```bash
llama model list
llama model download --model-id ...
llama model prompt-format -m ...
```

Ran tests:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/
LLAMA_STACK_CONFIG=fireworks pytest -s -v vector_io/
LLAMA_STACK_CONFIG=fireworks pytest -s -v agents/
```

Create a fresh venv `uv venv && source .venv/bin/activate` and run
`llama stack build --template fireworks --image-type venv` followed by
`llama stack run together --image-type venv` <-- the server runs

Also checked that the OpenAPI generator can run and there is no change
in the generated files as a result.

```bash
cd docs/openapi_generator
sh run_openapi_generator.sh
```
2025-02-14 09:10:59 -08:00
Dinesh Yeduguru
ab7f802698
feat: add MetricResponseMixin to chat completion response types (#1050)
# What does this PR do?
Defines a MetricResponseMixin which can be inherited by any response
class. Adds it to chat completion response types.


This is a short term solution to allow inference API to return metrics
The ideal way to do this is to have a way for all response types to
include metrics
and all metric events logged to the telemetry API to be included with
the response
To do this, we will need to augment all response types with a metrics
field.
We have hit a blocker from stainless SDK that prevents us from doing
this.
The blocker is that if we were to augment the response types that have a
data field
in them like so
class ListModelsResponse(BaseModel):
    metrics: Optional[List[MetricEvent]] = None
    data: List[Models]
    ...
The client SDK will need to access the data by using a .data field,
which is not
ergonomic. Stainless SDK does support unwrapping the response type, but
it
requires that the response type to only have a single field.

We will need a way in the client SDK to signal that the metrics are
needed
and if they are needed, the client SDK has to return the full response
type
without unwrapping it.

## Test Plan
sh run_openapi_generator.sh ./
sh stainless_sync.sh dineshyv/dev add-metrics-to-resp-v4

LLAMA_STACK_CONFIG="/Users/dineshyv/.llama/distributions/fireworks/fireworks-run.yaml"
pytest -v tests/client-sdk/agents/test_agents.py
2025-02-11 14:58:12 -08:00
Ashwin Bharambe
474c4bdd7a
Make a couple properties optional (#963) 2025-02-04 16:20:24 -08:00
ehhuang
c9ab72fa82
Support sys_prompt behavior in inference (#937)
# What does this PR do?

The current default system prompt for llama3.2 tends to overindex on
tool calling and doesn't work well when the prompt does not require tool
calling.

This PR adds an option to override the default system prompt, and
organizes tool-related configs into a new config object.

- [ ] Addresses issue (#issue)


## Test Plan

python -m unittest
llama_stack.providers.tests.inference.test_prompt_adapter


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/937).
* #938
* __->__ #937
2025-02-03 23:35:16 -08:00
Yuan Tang
34ab7a3b6c
Fix precommit check after moving to ruff (#927)
Lint check in main branch is failing. This fixes the lint check after we
moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We
need to move to a `ruff.toml` file as well as fixing and ignoring some
additional checks.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-02 06:46:45 -08:00
Ashwin Bharambe
0d96070af9
Update OpenAPI generator to add param and field documentation (#896)
We desperately need to document our APIs. This is the basic requirement
of having a Spec :)

This PR updates the OpenAPI generator so documentation for request
parameters and object fields can be properly added to the OpenAPI specs.
From there, this should get picked by Stainless, etc.

## Test Plan:

Updated client-sdk (See
https://github.com/meta-llama/llama-stack-client-python/pull/104) and
then ran:

```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=../../llama_stack/templates/fireworks/run.yaml pytest -s -v inference/test_inference.py agents/test_agents.py
```
2025-01-29 10:04:30 -08:00
Ashwin Bharambe
e5936a8df8
Update discriminator to have the correct mapping (#881)
See
https://swagger.io/docs/specification/v3_0/data-models/inheritance-and-polymorphism/#discriminator

When specifying discriminators, mapping must be specified unless the
value of the discriminator is the subtype itself (which in our case is
not.)

The changes in the YAML are self-explanatory.
2025-01-27 09:18:13 -08:00
Dinesh Yeduguru
7fb2c1c48d
More idiomatic REST API (#765)
# What does this PR do?

This PR changes our API to follow more idiomatic REST API approaches of
having paths being resources and methods indicating the action being
performed.

Changes made to generator:
1) removed the prefix check of "get" as its not required and is actually
needed for other method types too
2) removed _ check on path since variables can have "_"



## Test Plan

LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v
tests/client-sdk/agents/test_agents.py
2025-01-15 13:20:09 -08:00
Ashwin Bharambe
aced2ce07e introduce and use a generic ContentDelta 2025-01-13 23:16:53 -08:00
Ashwin Bharambe
ee4e04804f
Rename ipython to tool (#756)
See https://github.com/meta-llama/llama-models/pull/261 for the
corresponding PR in llama-models.

Once these PRs land, you need to work `main` from llama-models (vs. from
pypi)
2025-01-13 19:11:51 -08:00
Dinesh Yeduguru
8af6951106
remove conflicting default for tool prompt format in chat completion (#742)
# What does this PR do?
We are setting a default value of json for tool prompt format, which
conflicts with llama 3.2/3.3 models since they use python list. This PR
changes the defaults to None and in the code, we infer default based on
the model.

Addresses: #695 

Tests:
❯ LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v
tests/client-sdk/inference/test_inference.py -k
"test_text_chat_completion"

 pytest llama_stack/providers/tests/inference/test_prompt_adapter.py
2025-01-10 10:41:53 -08:00
Xi Yan
3c72c034e6
[remove import *] clean up import *'s (#689)
# What does this PR do?

- as title, cleaning up `import *`'s
- upgrade tests to make them more robust to bad model outputs
- remove import *'s in llama_stack/apis/* (skip __init__ modules)
<img width="465" alt="image"
src="https://github.com/user-attachments/assets/d8339c13-3b40-4ba5-9c53-0d2329726ee2"
/>

- run `sh run_openapi_generator.sh`, no types gets affected

## Test Plan

### Providers Tests

**agents**
```
pytest -v -s llama_stack/providers/tests/agents/test_agents.py -m "together" --safety-shield meta-llama/Llama-Guard-3-8B --inference-model meta-llama/Llama-3.1-405B-Instruct-FP8
```

**inference**
```bash
# meta-reference
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py

# together
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py

pytest ./llama_stack/providers/tests/inference/test_prompt_adapter.py 
```

**safety**
```
pytest -v -s llama_stack/providers/tests/safety/test_safety.py -m together --safety-shield meta-llama/Llama-Guard-3-8B
```

**memory**
```
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m "sentence_transformers" --env EMBEDDING_DIMENSION=384
```

**scoring**
```
pytest -v -s -m llm_as_judge_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct
pytest -v -s -m basic_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
pytest -v -s -m braintrust_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
```


**datasetio**
```
pytest -v -s -m localfs llama_stack/providers/tests/datasetio/test_datasetio.py
pytest -v -s -m huggingface llama_stack/providers/tests/datasetio/test_datasetio.py
```


**eval**
```
pytest -v -s -m meta_reference_eval_together_inference llama_stack/providers/tests/eval/test_eval.py
pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio llama_stack/providers/tests/eval/test_eval.py
```

### Client-SDK Tests
```
LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v ./tests/client-sdk
```

### llama-stack-apps
```
PORT=5000
LOCALHOST=localhost

python -m examples.agents.hello $LOCALHOST $PORT
python -m examples.agents.inflation $LOCALHOST $PORT
python -m examples.agents.podcast_transcript $LOCALHOST $PORT
python -m examples.agents.rag_as_attachments $LOCALHOST $PORT
python -m examples.agents.rag_with_memory_bank $LOCALHOST $PORT
python -m examples.safety.llama_guard_demo_mm $LOCALHOST $PORT
python -m examples.agents.e2e_loop_with_custom_tools $LOCALHOST $PORT

# Vision model
python -m examples.interior_design_assistant.app
python -m examples.agent_store.app $LOCALHOST $PORT
```

### CLI
```
which llama
llama model prompt-format -m Llama3.2-11B-Vision-Instruct
llama model list
llama stack list-apis
llama stack list-providers inference

llama stack build --template ollama --image-type conda
```

### Distributions Tests
**ollama**
```
llama stack build --template ollama --image-type conda
ollama run llama3.2:1b-instruct-fp16
llama stack run ./llama_stack/templates/ollama/run.yaml --env INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
```

**fireworks**
```
llama stack build --template fireworks --image-type conda
llama stack run ./llama_stack/templates/fireworks/run.yaml
```

**together**
```
llama stack build --template together --image-type conda
llama stack run ./llama_stack/templates/together/run.yaml
```

**tgi**
```
llama stack run ./llama_stack/templates/tgi/run.yaml --env TGI_URL=http://0.0.0.0:5009 --env INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
```

## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-27 15:45:44 -08:00
Ashwin Bharambe
12cbed1617 Register Message and ResponseFormat 2024-12-18 10:32:25 -08:00
Ashwin Bharambe
8de8eb03c8
Update the "InterleavedTextMedia" type (#635)
## What does this PR do?

This is a long-pending change and particularly important to get done
now.

Specifically:
- we cannot "localize" (aka download) any URLs from media attachments
anywhere near our modeling code. it must be done within llama-stack.
- `PIL.Image` is infesting all our APIs via `ImageMedia ->
InterleavedTextMedia` and that cannot be right at all. Anything in the
API surface must be "naturally serializable". We need a standard `{
type: "image", image_url: "<...>" }` which is more extensible
- `UserMessage`, `SystemMessage`, etc. are moved completely to
llama-stack from the llama-models repository.

See https://github.com/meta-llama/llama-models/pull/244 for the
corresponding PR in llama-models.

## Test Plan

```bash
cd llama_stack/providers/tests

pytest -s -v -k "fireworks or ollama or together" inference/test_vision_inference.py
pytest -s -v -k "(fireworks or ollama or together) and llama_3b" inference/test_text_inference.py
pytest -s -v -k chroma memory/test_memory.py \
  --env EMBEDDING_DIMENSION=384 --env CHROMA_DB_PATH=/tmp/foobar

pytest -s -v -k fireworks agents/test_agents.py  \
   --safety-shield=meta-llama/Llama-Guard-3-8B \
   --inference-model=meta-llama/Llama-3.1-8B-Instruct
```

Updated the client sdk (see PR ...), installed the SDK in the same
environment and then ran the SDK tests:

```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=together pytest -s -v agents/test_agents.py
LLAMA_STACK_CONFIG=ollama pytest -s -v memory/test_memory.py

# this one needed a bit of hacking in the run.yaml to ensure I could register the vision model correctly
INFERENCE_MODEL=llama3.2-vision:latest LLAMA_STACK_CONFIG=ollama pytest -s -v inference/test_inference.py
```
2024-12-17 11:18:31 -08:00
Xi Yan
78e2bfbe7a
[tests] add client-sdk pytests & delete client.py (#638)
# What does this PR do?

**Why**
- Clean up examples which we will not maintain; reduce the surface area
to the minimal showcases

**What**
- Delete `client.py` in /apis/*
- Move all scripts to unit tests
  - SDK sync in the future will just require running pytests

**Side notes**
- `bwrap` not available on Mac so code_interpreter will not work

## Test Plan

```
LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v ./tests/client-sdk
```
<img width="725" alt="image"
src="https://github.com/user-attachments/assets/36bfe537-628d-43c3-8479-dcfcfe2e4035"
/>


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-16 12:04:56 -08:00
Dinesh Yeduguru
c543bc0745
Console span processor improvements (#577)
Makes the console span processor output spans in less prominent way and
highlight the logs based on severity.


![Screenshot 2024-12-06 at 11 26
46 AM](https://github.com/user-attachments/assets/c3a1b051-85db-4b71-b7a5-7bab5a26f072)
2024-12-06 11:46:16 -08:00
Dinesh Yeduguru
fcd6449519
Telemetry API redesign (#525)
# What does this PR do?
Change the Telemetry API to be able to support different use cases like
returning traces for the UI and ability to export for Evals.
Other changes:
* Add a new trace_protocol decorator to decorate all our API methods so
that any call to them will automatically get traced across all impls.
* There is some issue with the decorator pattern of span creation when
using async generators, where there are multiple yields with in the same
context. I think its much more explicit by using the explicit context
manager pattern using with. I moved the span creations in agent instance
to be using with
* Inject session id at the turn level, which should quickly give us all
traces across turns for a given session

Addresses #509

## Test Plan
```
llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml
PYTHONPATH=. python -m examples.agents.rag_with_memory_bank localhost 5000


 curl -X POST 'http://localhost:5000/alpha/telemetry/query-traces' \
-H 'Content-Type: application/json' \
-d '{
  "attribute_filters": [
    {
      "key": "session_id",
      "op": "eq",
      "value": "dd667b87-ca4b-4d30-9265-5a0de318fc65" }],
  "limit": 100,
  "offset": 0,
  "order_by": ["start_time"]
}' | jq .
[
  {
    "trace_id": "6902f54b83b4b48be18a6f422b13e16f",
    "root_span_id": "5f37b85543afc15a",
    "start_time": "2024-12-04T08:08:30.501587",
    "end_time": "2024-12-04T08:08:36.026463"
  },
  {
    "trace_id": "92227dac84c0615ed741be393813fb5f",
    "root_span_id": "af7c5bb46665c2c8",
    "start_time": "2024-12-04T08:08:36.031170",
    "end_time": "2024-12-04T08:08:41.693301"
  },
  {
    "trace_id": "7d578a6edac62f204ab479fba82f77b6",
    "root_span_id": "1d935e3362676896",
    "start_time": "2024-12-04T08:08:41.695204",
    "end_time": "2024-12-04T08:08:47.228016"
  },
  {
    "trace_id": "dbd767d76991bc816f9f078907dc9ff2",
    "root_span_id": "f5a7ee76683b9602",
    "start_time": "2024-12-04T08:08:47.234578",
    "end_time": "2024-12-04T08:08:53.189412"
  }
]


curl -X POST 'http://localhost:5000/alpha/telemetry/get-span-tree' \
-H 'Content-Type: application/json' \
-d '{ "span_id" : "6cceb4b48a156913", "max_depth": 2, "attributes_to_return": ["input"] }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   875  100   790  100    85  18462   1986 --:--:-- --:--:-- --:--:-- 20833
{
  "span_id": "6cceb4b48a156913",
  "trace_id": "dafa796f6aaf925f511c04cd7c67fdda",
  "parent_span_id": "892a66d726c7f990",
  "name": "retrieve_rag_context",
  "start_time": "2024-12-04T09:28:21.781995",
  "end_time": "2024-12-04T09:28:21.913352",
  "attributes": {
    "input": [
      "{\"role\":\"system\",\"content\":\"You are a helpful assistant\"}",
      "{\"role\":\"user\",\"content\":\"What are the top 5 topics that were explained in the documentation? Only list succinct bullet points.\",\"context\":null}"
    ]
  },
  "children": [
    {
      "span_id": "1a2df181854064a8",
      "trace_id": "dafa796f6aaf925f511c04cd7c67fdda",
      "parent_span_id": "6cceb4b48a156913",
      "name": "MemoryRouter.query_documents",
      "start_time": "2024-12-04T09:28:21.787620",
      "end_time": "2024-12-04T09:28:21.906512",
      "attributes": {
        "input": null
      },
      "children": [],
      "status": "ok"
    }
  ],
  "status": "ok"
}

```

<img width="1677" alt="Screenshot 2024-12-04 at 9 42 56 AM"
src="https://github.com/user-attachments/assets/4d3cea93-05ce-415a-93d9-4b1628631bf8">
2024-12-04 11:22:45 -08:00
Ashwin Bharambe
0dc7f5fa89
Add version to REST API url (#478)
# What does this PR do? 

Adds a `/alpha/` prefix to all the REST API urls.

Also makes them all use hyphens instead of underscores as is more
standard practice.

(This is based on feedback from our partners.)

## Test Plan 

The Stack itself does not need updating. However, client SDKs and
documentation will need to be updated.
2024-11-18 22:44:14 -08:00
Dinesh Yeduguru
fdff24e77a
Inference to use provider resource id to register and validate (#428)
This PR changes the way model id gets translated to the final model name
that gets passed through the provider.
Major changes include:
1) Providers are responsible for registering an object and as part of
the registration returning the object with the correct provider specific
name of the model provider_resource_id
2) To help with the common look ups different names a new ModelLookup
class is created.



Tested all inference providers including together, fireworks, vllm,
ollama, meta reference and bedrock
2024-11-12 20:02:00 -08:00
Dinesh Yeduguru
ec644d3418
migrate model to Resource and new registration signature (#410)
* resource oriented object design for models

* add back llama_model field

* working tests

* register singature fix

* address feedback

---------

Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
2024-11-08 16:12:57 -08:00
Ashwin Bharambe
37b330b4ef
add dynamic clients for all APIs (#348)
* add dynamic clients for all APIs

* fix openapi generator

* inference + memory + agents tests now pass with "remote" providers

* Add docstring which fixes openapi generator :/
2024-10-31 14:46:25 -07:00
Ashwin Bharambe
eccd7dc4a9 Avoid warnings from pydantic for overriding schema
Also fix structured output in completions
2024-10-28 21:39:48 -07:00
Ashwin Bharambe
161aef0aae Small updates to quantization config 2024-10-24 12:08:56 -07:00
Ashwin Bharambe
7afe51c84d
New quantized models (#301) 2024-10-24 08:38:56 -07:00