Commit graph

62 commits

Author SHA1 Message Date
Hardik Shah
985d0b156c
feat: Add suffix to openai_completions (#2449)
Some checks failed
Integration Tests / test-matrix (library, 3.10, inspect) (push) Failing after 9s
Integration Tests / test-matrix (http, 3.11, providers) (push) Failing after 5s
Integration Tests / test-matrix (library, 3.10, providers) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.10, post_training) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 13s
Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 6s
Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, vector_io) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 9s
Test External Providers / test-external-providers (venv) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 14s
Unit Tests / unit-tests (3.10) (push) Failing after 19s
Unit Tests / unit-tests (3.11) (push) Failing after 20s
Unit Tests / unit-tests (3.12) (push) Failing after 18s
Unit Tests / unit-tests (3.13) (push) Failing after 16s
Update ReadTheDocs / update-readthedocs (push) Failing after 8s
Pre-commit / pre-commit (push) Successful in 58s
For code completion apps need "fill in the middle" capabilities. 
Added option of `suffix` to `openai_completion` to enable this. 
Updated ollama provider to showcase the same. 

### Test Plan 
```
pytest -sv --stack-config="inference=ollama"  tests/integration/inference/test_openai_completion.py --text-model qwen2.5-coder:1.5b -k test_openai_completion_non_streaming_suffix
```

### OpenAI Sample script
```
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1/openai/v1")

response = client.completions.create(
    model="qwen2.5-coder:1.5b",
    prompt="The capital of ",
    suffix="is Paris.",
    max_tokens=10,
)

print(response.choices[0].text)
``` 
### Output
```
France is ____.

To answer this question, we 
```
2025-06-13 16:06:06 -07:00
Francisco Arceo
554ada57b0
chore: Add OpenAI compatibility for Ollama embeddings (#2440)
# What does this PR do?
This PR adds OpenAI compatibility for Ollama embeddings. Closes
https://github.com/meta-llama/llama-stack/issues/2428

Summary of changes:
- `llama_stack/providers/remote/inference/ollama/ollama.py`
- Implements the OpenAI embeddings endpoint for Ollama, replacing the
NotImplementedError with a full function that validates the model,
prepares parameters, calls the client, encodes embedding data
(optionally in base64), and returns a correctly structured response.
- Updates import statements to include the new embedding response
utilities.

- `llama_stack/providers/utils/inference/litellm_openai_mixin.py`
- Refactors the embedding data encoding logic to use a new shared
utility (`b64_encode_openai_embeddings_response`) instead of inline
base64 encoding and packing logic.
   - Cleans up imports accordingly.

- `llama_stack/providers/utils/inference/openai_compat.py`
- Adds `b64_encode_openai_embeddings_response` to handle encoding OpenAI
embedding outputs (including base64 support) in a reusable way.
- Adds `prepare_openai_embeddings_params` utility for standardizing
embedding parameter preparation.
   - Updates imports to include the new embedding data class.

- `tests/integration/inference/test_openai_embeddings.py`
- Removes `"remote::ollama"` from the list of providers that skip OpenAI
embeddings tests, since support is now implemented.

## Note

There was one minor issue, which required me to override the
`OpenAIEmbeddingsResponse.model` name with
`self._get_model(model).identifier` name, which is very unsatisfying.

## Test Plan
Unit Tests and integration tests

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-06-13 14:28:51 -04:00
Ashwin Bharambe
3251b44d8a
refactor: unify stream and non-stream impls for responses (#2388)
Some checks failed
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 3s
Integration Tests / test-matrix (http, datasets) (push) Failing after 9s
Integration Tests / test-matrix (http, agents) (push) Failing after 10s
Integration Tests / test-matrix (http, inference) (push) Failing after 9s
Integration Tests / test-matrix (http, inspect) (push) Failing after 8s
Integration Tests / test-matrix (http, post_training) (push) Failing after 9s
Integration Tests / test-matrix (http, providers) (push) Failing after 10s
Integration Tests / test-matrix (http, scoring) (push) Failing after 9s
Integration Tests / test-matrix (library, agents) (push) Failing after 9s
Integration Tests / test-matrix (http, tool_runtime) (push) Failing after 10s
Integration Tests / test-matrix (library, datasets) (push) Failing after 10s
Integration Tests / test-matrix (library, inspect) (push) Failing after 9s
Integration Tests / test-matrix (library, inference) (push) Failing after 9s
Integration Tests / test-matrix (library, post_training) (push) Failing after 10s
Integration Tests / test-matrix (library, providers) (push) Failing after 9s
Integration Tests / test-matrix (library, scoring) (push) Failing after 9s
Test External Providers / test-external-providers (venv) (push) Failing after 7s
Integration Tests / test-matrix (library, tool_runtime) (push) Failing after 11s
Unit Tests / unit-tests (3.11) (push) Failing after 8s
Unit Tests / unit-tests (3.12) (push) Failing after 7s
Unit Tests / unit-tests (3.13) (push) Failing after 9s
Unit Tests / unit-tests (3.10) (push) Failing after 30s
Pre-commit / pre-commit (push) Successful in 1m18s
The non-streaming version is just a small layer on top of the streaming
version - just pluck off the final `response.completed` event and return
that as the response!

This PR also includes a couple other changes which I ended up making
while working on it on a flight:
- changes to `ollama` so it does not pull embedding models
unconditionally
- a small fix to library client to make the stream and non-stream cases
a bit more symmetric
2025-06-05 17:48:09 +02:00
Ashwin Bharambe
cba55808ab
feat(distro): add more providers to starter distro, prefix conflicting models (#2362)
The name changes to the verifications file are unfortunate, but maybe we
don't need that @ehhuang ?

Edit: deleted the verifications template now
2025-06-03 12:10:46 -07:00
Ben Browning
e92f571f47
fix: ollama chat completion needs unique ids (#2344)
# What does this PR do?

The chat completion ids generated by Ollama are not unique enough to use
with stored chat completions as they rely on only 3 numbers of
randomness to give unique values - ie `chatcmpl-373`. This causes
frequent collisions in id values of chat completions in Ollama, which
creates issues in our SQL storage of chat completions by id where it
expects ids to actually be unique.

So, this adjusts Ollama responses to use uuids as unique ids. This does
mean we're replacing the ids generated natively by Ollama. If we don't
wish to do this, we'll either need to relax the unique constraint on our
chat completions id field in the inference storage or convince Ollama
upstream to use something closer to uuid values here.

Closes #2315

## Test Plan

I tested by running the openai completion / chat completion integration
tests in a loop. Without this change, I regularly get unique id
collisions. With this change, I do not. We sometimes see flakes from
these unique id collisions in our CI tests, and this will resolve those.

```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run llama_stack/templates/ollama/run.yaml

while true; do; \
  INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
  pytest -s -v \
    tests/integration/inference/test_openai_completion.py \
    --stack-config=http://localhost:8321 \
    --text-model="meta-llama/Llama-3.2-3B-Instruct"; \
done
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-06-02 17:43:20 -07:00
Hardik Shah
b21050935e
feat: New OpenAI compat embeddings API (#2314)
Some checks failed
Integration Tests / test-matrix (http, agents) (push) Failing after 9s
Integration Tests / test-matrix (http, scoring) (push) Failing after 9s
Integration Tests / test-matrix (library, inference) (push) Failing after 9s
Integration Tests / test-matrix (library, inspect) (push) Failing after 9s
Integration Tests / test-matrix (library, post_training) (push) Failing after 15s
Integration Tests / test-matrix (library, providers) (push) Failing after 14s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 43s
Integration Tests / test-matrix (library, scoring) (push) Failing after 8s
Integration Tests / test-matrix (http, inference) (push) Failing after 46s
Integration Tests / test-matrix (library, tool_runtime) (push) Failing after 8s
Integration Tests / test-matrix (library, agents) (push) Failing after 44s
Integration Tests / test-matrix (http, inspect) (push) Failing after 47s
Integration Tests / test-matrix (http, providers) (push) Failing after 45s
Integration Tests / test-matrix (library, datasets) (push) Failing after 45s
Integration Tests / test-matrix (http, post_training) (push) Failing after 46s
Integration Tests / test-matrix (http, tool_runtime) (push) Failing after 47s
Integration Tests / test-matrix (http, datasets) (push) Failing after 49s
Test External Providers / test-external-providers (venv) (push) Failing after 6s
Update ReadTheDocs / update-readthedocs (push) Failing after 6s
Unit Tests / unit-tests (3.12) (push) Failing after 7s
Unit Tests / unit-tests (3.10) (push) Failing after 8s
Unit Tests / unit-tests (3.11) (push) Failing after 8s
Unit Tests / unit-tests (3.13) (push) Failing after 7s
Pre-commit / pre-commit (push) Successful in 1m12s
# What does this PR do?
Adds a new endpoint that is compatible with OpenAI for embeddings api. 
`/openai/v1/embeddings`
Added providers for OpenAI, LiteLLM and SentenceTransformer. 


## Test Plan
```
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004
```
2025-05-31 22:11:47 -07:00
ehhuang
047303e339
feat: introduce APIs for retrieving chat completion requests (#2145)
# What does this PR do?
This PR introduces APIs to retrieve past chat completion requests, which
will be used in the LS UI.

Our current `Telemetry` is ill-suited for this purpose as it's untyped
so we'd need to filter by obscure attribute names, making it brittle.

Since these APIs are 'provided by stack' and don't need to be
implemented by inference providers, we introduce a new InferenceProvider
class, containing the existing inference protocol, which is implemented
by inference providers.

The APIs are OpenAI-compliant, with an additional `input_messages`
field.


## Test Plan
This PR just adds the API and marks them provided_by_stack. S
tart stack server -> doesn't crash
2025-05-18 21:43:19 -07:00
Ben Browning
136e6b3cf7
fix: ollama openai completion and chat completion params (#2125)
# What does this PR do?

The ollama provider was using an older variant of the code to convert
incoming parameters from the OpenAI API completions and chat completion
endpoints into requests that get sent to the backend provider over its
own OpenAI client. This updates it to use the common
`prepare_openai_completion_params` method used elsewhere, which takes
care of removing stray `None` values even for nested structures.

Without this, some other parameters, even if they have values of `None`,
make their way to ollama and actually influence its inference output as
opposed to when those parameters are not sent at all.

## Test Plan

This passes tests/integration/inference/test_openai_completion.py and
fixes the issue found in #2098, which was tested via manual curl
requests crafted a particular way.

Closes #2098

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-05-12 10:57:53 -07:00
Ben Browning
40e71758d9
fix: inference providers still using tools with tool_choice="none" (#2048)
# What does this PR do?

In our OpenAI API verification tests, some providers were still calling
tools even when `tool_choice="none"` was passed in the chat completion
requests. Because they aren't all respecting `tool_choice` properly,
this adjusts our routing implementation to remove the `tools` and
`tool_choice` from the request if `tool_choice="none"` is passed in so
that it does not attempt to call any of those tools. Adjusting this in
the router fixes this across all providers.

This also cleans up the non-streaming together.ai responses for tools,
ensuring it returns `None` instead of an empty list when there are no
tool calls, to exactly match the OpenAI API responses in that case.

## Test Plan

I observed existing failures in our OpenAI API verification suite - see

https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#together-llama-stack
for the failing `test_chat_*_tool_choice_none` tests. All streaming and
non-streaming variants were failing across all 3 tested models.

After this change, all of those 6 failing tests are now passing with no
regression in the other tests.

I verified this via:

```
llama stack run --image-type venv \
  tests/verifications/openai-api-verification-run.yaml
```

```
python -m pytest -s -v \
  'tests/verifications/openai_api/test_chat_completion.py' \
  --provider=together-llama-stack
```

The entire verification suite is not 100% on together.ai yet, but it's
getting closer.

This also increased the pass rate for fireworks.ai, and did not regress
the groq or openai tests at all.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-05-07 14:34:47 +02:00
Sébastien Han
1a529705da
chore: more mypy fixes (#2029)
# What does this PR do?

Mainly tried to cover the entire llama_stack/apis directory, we only
have one left. Some excludes were just noop.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-06 09:52:31 -07:00
Ihar Hrachyshka
9e6561a1ec
chore: enable pyupgrade fixes (#1806)
# What does this PR do?

The goal of this PR is code base modernization.

Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)

Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-05-01 14:23:50 -07:00
Matthew Farrellee
88a796ca5a
fix: allow use of models registered at runtime (#1980)
# What does this PR do?

fix a bug where models registered at runtime could not be used.

```
$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct

$ curl http://localhost:8321/v1/openai/v1/chat/completions \                                                        
-H "Content-Type: application/json" \
-d '{
  "model": "test-model",
  "messages": [{"role": "user", "content": "What is the weather like in Boston today?"}]
}'

=(client)=> {"detail":"Internal server error: An unexpected error occurred."}
=(server)=> TypeError: Missing required arguments; Expected either ('messages' and 'model') or ('messages', 'model' and 'stream') arguments to be given
```

*root cause:* test-model is not added to ModelRegistryHelper's
alias_to_provider_id_map.

as part of the fix, this adds tests for ModelRegistryHelper and defines
its expected behavior.

user visible behavior changes -

| action | existing behavior | new behavior |
| -- | -- | -- |
| double register | success (but no change) | error |
| register unknown | success (fail when used) | error |

existing behavior for register unknown model and double register -
```
$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct-unknown
Successfully registered model test-model

$ llama-stack-client models list | grep test-model
│ llm │ test-model                               │ meta/llama-3.1-70b-instruct-unknown │     │ nv… │

$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct       
Successfully registered model test-model

$ llama-stack-client models list | grep test-model
│ llm │ test-model                               │ meta/llama-3.1-70b-instruct-unknown │     │ nv… │
```

new behavior for register unknown -
```
$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct-unknown
╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Failed to register model                                                                         │
│                                                                                                  │
│ Error Type: BadRequestError                                                                      │
│ Details: Error code: 400 - {'detail': "Invalid value: Model id                                   │
│ 'meta/llama-3.1-70b-instruct-unknown' is not supported. Supported ids are:                       │
│ meta/llama-3.1-70b-instruct, snowflake/arctic-embed-l, meta/llama-3.2-1b-instruct,               │
│ nvidia/nv-embedqa-mistral-7b-v2, meta/llama-3.2-90b-vision-instruct, meta/llama-3.2-3b-instruct, │
│ meta/llama-3.2-11b-vision-instruct, meta/llama-3.1-405b-instruct, meta/llama3-8b-instruct,       │
│ meta/llama3-70b-instruct, nvidia/llama-3.2-nv-embedqa-1b-v2, meta/llama-3.1-8b-instruct,         │
│ nvidia/nv-embedqa-e5-v5"}                                                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
```

new behavior for double register -
```
$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct
Successfully registered model test-model

$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.2-1b-instruct 
╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Failed to register model                                                                         │
│                                                                                                  │
│ Error Type: BadRequestError                                                                      │
│ Details: Error code: 400 - {'detail': "Invalid value: Model id 'test-model' is already           │
│ registered. Please use a different id or unregister it first."}                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
```


## Test Plan

```
uv run pytest -v tests/unit/providers/utils/test_model_registry.py
```
2025-05-01 12:00:58 -07:00
Ben Browning
934446ddb4
fix: ollama still using tools with tool_choice="none" (#2047)
# What does this PR do?

In our OpenAI API verification tests, ollama was still calling tools
even when `tool_choice="none"` was passed in its chat completion
requests. Because ollama isn't respecting `tool_choice` properly, this
adjusts our provider implementation to remove the `tools` from the
request if `tool_choice="none"` is passed in so that it does not attempt
to call any of those tools.

## Test Plan

I tested this with a couple of Llama models, using both our OpenAI
completions integration tests and our verification test suites.

### OpenAI Completions / Chat Completions integration tests

These all passed before, and still do.

```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" \
  llama stack build --template ollama --image-type venv --run
```

```
LLAMA_STACK_CONFIG=http://localhost:8321 \
  python -m pytest -v \
  tests/integration/inference/test_openai_completion.py \
  --text-model "llama3.2:3b-instruct-fp16"
```

### OpenAI API Verification test suite

test_chat_*_tool_choice_none OpenAI API verification tests pass now,
when they failed before.

See

https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#ollama-llama-stack
for an example of these failures from a recent nightly CI run.

```
INFERENCE_MODEL="llama3.3:70b-instruct-q3_K_M" \
  llama stack build --template ollama --image-type venv --run
```

```
cat <<-EOF > tests/verifications/conf/ollama-llama-stack.yaml
base_url: http://localhost:8321/v1/openai/v1
api_key_var: OPENAI_API_KEY
models:
- llama3.3:70b-instruct-q3_K_M
model_display_names:
  llama3.3:70b-instruct-q3_K_M: Llama-3.3-70B-Instruct
test_exclusions:
  llama3.3:70b-instruct-q3_K_M:
  - test_chat_non_streaming_image
  - test_chat_streaming_image
  - test_chat_multi_turn_multiple_images
EOF
```

```
python -m pytest -s -v \
  'tests/verifications/openai_api/test_chat_completion.py' \
  --provider=ollama-llama-stack
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-29 10:45:28 +02:00
Nathan Weinberg
cf158f2cb9
feat: allow ollama to use 'latest' if available but not specified (#1903)
# What does this PR do?
ollama's CLI supports running models via commands such as 'ollama run
llama3.2' this syntax does not work with the INFERENCE_MODEL llamastack
var as currently specifying a tag such as 'latest' is required

this commit will check to see if the 'latest' model is available and use
that model if a user passes a model name without a tag but the 'latest'
is available in ollama

## Test Plan
Behavior pre-code change
```bash
$ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run
...
INFO     2025-04-08 13:42:42,842 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking            
         connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`...                                            
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 502, in <module>
    main()
  File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 401, in main
    impls = asyncio.run(construct_stack(config))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 222, in construct_stack
    await register_resources(run_config, impls)
  File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 99, in register_resources
    await method(**obj.model_dump())
  File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
    result = await method(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 294, in register_model
    registered_model = await self.register_object(model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 228, in register_object
    registered_obj = await register_object_with_provider(obj, p)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 77, in register_object_with_provider
    return await p.register_model(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
    result = await method(self, *args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/remote/inference/ollama/ollama.py", line 315, in register_model
    raise ValueError(
ValueError: Model 'llama3.2' is not available in Ollama. Available models: llama3.2:latest
++ error_handler 108
++ echo 'Error occurred in script at line: 108'
Error occurred in script at line: 108
++ exit 1
```

Behavior post-code change
```bash
$ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run
...
INFO     2025-04-08 13:58:17,365 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking            
         connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`...                                            
WARNING  2025-04-08 13:58:18,190 llama_stack.providers.remote.inference.ollama.ollama:317 inference: Imprecise provider 
         resource id was used but 'latest' is available in Ollama - using 'llama3.2:latest'                             
INFO     2025-04-08 13:58:18,191 llama_stack.providers.remote.inference.ollama.ollama:308 inference: Pulling embedding  
         model `all-minilm:latest` if necessary...                                                                      
INFO     2025-04-08 13:58:18,799 __main__:478 server: Listening on ['::', '0.0.0.0']:8321                               
INFO:     Started server process [28378]
INFO:     Waiting for application startup.
INFO     2025-04-08 13:58:18,803 __main__:148 server: Starting up                                                       
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
...
```

## Documentation
Did not document this anywhere but happy to do so if there is an
appropriate place

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
2025-04-14 09:03:54 -07:00
Ben Browning
7641a5cd0b
fix: 100% OpenAI API verification for together and fireworks (#1946)
# What does this PR do?

TLDR: Changes needed to get 100% passing tests for OpenAI API
verification tests when run against Llama Stack with the `together`,
`fireworks`, and `openai` providers. And `groq` is better than before,
at 88% passing.

This cleans up the OpenAI API support for image message types
(specifically `image_url` types) and handling of the `response_format`
chat completion parameter. Both of these required a few more Pydantic
model definitions in our Inference API, just to move from the
not-quite-right stubs I had in place to something fleshed out to match
the actual OpenAI API specs.

As part of testing this, I also found and fixed a bug in the litellm
implementation of openai_completion and openai_chat_completion, so the
providers based on those should actually be working now.

The method `prepare_openai_completion_params` in
`llama_stack/providers/utils/inference/openai_compat.py` was improved to
actually recursively clean up input parameters, including handling of
lists, dicts, and dumping of Pydantic models to dicts. These changes
were required to get to 100% passing tests on the OpenAI API
verification against the `openai` provider.

With the above, the together.ai provider was passing as well as it is
without Llama Stack. But, since we have Llama Stack in the middle, I
took the opportunity to clean up the together.ai provider so that it now
also passes the OpenAI API spec tests we have at 100%. That means
together.ai is now passing our verification test better when using an
OpenAI client talking to Llama Stack than it is when hitting together.ai
directly, without Llama Stack in the middle.

And, another round of work for Fireworks to improve translation of
incoming OpenAI chat completion requests to Llama Stack chat completion
requests gets the fireworks provider passing at 100%. The server-side
fireworks.ai tool calling support with OpenAI chat completions and Llama
4 models isn't great yet, but by pointing the OpenAI clients at Llama
Stack's API we can clean things up and get everything working as
expected for Llama 4 models.

## Test Plan

### OpenAI API Verification Tests

I ran the OpenAI API verification tests as below and 100% of the tests
passed.

First, start a Llama Stack server that runs the `openai` provider with
the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template
setup to do this out of the box, so I added a
`tests/verifications/openai-api-verification-run.yaml` to do this.

First, ensure you have the necessary API key environment variables set:

```
export TOGETHER_API_KEY="..."
export FIREWORKS_API_KEY="..."
export OPENAI_API_KEY="..."
```

Then, run a Llama Stack server that serves up all these providers:

```
llama stack run \
      --image-type venv \
      tests/verifications/openai-api-verification-run.yaml
```

Finally, generate a new verification report against all these providers,
both with and without the Llama Stack server in the middle.

```
python tests/verifications/generate_report.py \
      --run-tests \
      --provider \
        together \
        fireworks \
        groq \
        openai \
        together-llama-stack \
        fireworks-llama-stack \
        groq-llama-stack \
        openai-llama-stack
```

You'll see that most of the configurations with Llama Stack in the
middle now pass at 100%, even though some of them do not pass at 100%
when hitting the backend provider's API directly with an OpenAI client.

### OpenAI Completion Integration Tests with vLLM:

I also ran the smaller `test_openai_completion.py` test suite (that's
not yet merged with the verification tests) on multiple of the
providers, since I had to adjust the method signature of
openai_chat_completion a bit and thus had to touch lots of these
providers to match. Here's the tests I ran there, all passing:

```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```

### OpenAI Completion Integration Tests with ollama

```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```

### OpenAI Completion Integration Tests with together.ai

```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo"
```

### OpenAI Completion Integration Tests with fireworks.ai

```
INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct"

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-14 08:56:29 -07:00
Sébastien Han
69554158fa
feat: add health to all providers through providers endpoint (#1418)
The `/v1/providers` now reports the health status of each
provider when implemented.

```
curl -L http://127.0.0.1:8321/v1/providers|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4072  100  4072    0     0   246k      0 --:--:-- --:--:-- --:--:--  248k
{
  "data": [
    {
      "api": "inference",
      "provider_id": "ollama",
      "provider_type": "remote::ollama",
      "config": {
        "url": "http://localhost:11434"
      },
      "health": {
        "status": "OK"
      }
    },
    {
      "api": "vector_io",
      "provider_id": "faiss",
      "provider_type": "inline::faiss",
      "config": {
        "kvstore": {
          "type": "sqlite",
          "namespace": null,
          "db_path": "/Users/leseb/.llama/distributions/ollama/faiss_store.db"
        }
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "safety",
      "provider_id": "llama-guard",
      "provider_type": "inline::llama-guard",
      "config": {
        "excluded_categories": []
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "agents",
      "provider_id": "meta-reference",
      "provider_type": "inline::meta-reference",
      "config": {
        "persistence_store": {
          "type": "sqlite",
          "namespace": null,
          "db_path": "/Users/leseb/.llama/distributions/ollama/agents_store.db"
        }
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "telemetry",
      "provider_id": "meta-reference",
      "provider_type": "inline::meta-reference",
      "config": {
        "service_name": "llama-stack",
        "sinks": "console,sqlite",
        "sqlite_db_path": "/Users/leseb/.llama/distributions/ollama/trace_store.db"
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "eval",
      "provider_id": "meta-reference",
      "provider_type": "inline::meta-reference",
      "config": {
        "kvstore": {
          "type": "sqlite",
          "namespace": null,
          "db_path": "/Users/leseb/.llama/distributions/ollama/meta_reference_eval.db"
        }
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "datasetio",
      "provider_id": "huggingface",
      "provider_type": "remote::huggingface",
      "config": {
        "kvstore": {
          "type": "sqlite",
          "namespace": null,
          "db_path": "/Users/leseb/.llama/distributions/ollama/huggingface_datasetio.db"
        }
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "datasetio",
      "provider_id": "localfs",
      "provider_type": "inline::localfs",
      "config": {
        "kvstore": {
          "type": "sqlite",
          "namespace": null,
          "db_path": "/Users/leseb/.llama/distributions/ollama/localfs_datasetio.db"
        }
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "scoring",
      "provider_id": "basic",
      "provider_type": "inline::basic",
      "config": {},
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "scoring",
      "provider_id": "llm-as-judge",
      "provider_type": "inline::llm-as-judge",
      "config": {},
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "scoring",
      "provider_id": "braintrust",
      "provider_type": "inline::braintrust",
      "config": {
        "openai_api_key": "********"
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "brave-search",
      "provider_type": "remote::brave-search",
      "config": {
        "api_key": "********",
        "max_results": 3
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "tavily-search",
      "provider_type": "remote::tavily-search",
      "config": {
        "api_key": "********",
        "max_results": 3
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "code-interpreter",
      "provider_type": "inline::code-interpreter",
      "config": {},
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "rag-runtime",
      "provider_type": "inline::rag-runtime",
      "config": {},
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "model-context-protocol",
      "provider_type": "remote::model-context-protocol",
      "config": {},
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "wolfram-alpha",
      "provider_type": "remote::wolfram-alpha",
      "config": {
        "api_key": "********"
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    }
  ]
}
```

Per providers too:

```
curl -L http://127.0.0.1:8321/v1/providers/ollama
{"api":"inference","provider_id":"ollama","provider_type":"remote::ollama","config":{"url":"http://localhost:11434"},"health":{"status":"OK"}}
```

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-14 11:59:36 +02:00
Ashwin Bharambe
f34f22f8c7
feat: add batch inference API to llama stack inference (#1945)
# What does this PR do?

This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`

The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.

Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.

So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.

## Test Plan

Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144

LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```

Then run the batch inference test case.
2025-04-12 11:41:12 -07:00
Ben Browning
2b2db5fbda
feat: OpenAI-Compatible models, completions, chat/completions (#1894)
# What does this PR do?

This stubs in some OpenAI server-side compatibility with three new
endpoints:

/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions

This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .

The two "v1" instances in there isn't awesome, but the thinking is that
Llama Stack's API is v1 and then our OpenAI compatibility layer is
compatible with OpenAI V1. And, some OpenAI clients implicitly assume
the URL ends with "v1", so this gives maximum compatibility.

The openai models endpoint is implemented in the routing layer, and just
returns all the models Llama Stack knows about.

The following providers should be working with the new OpenAI
completions and chat/completions API:
* remote::anthropic (untested)
* remote::cerebras-openai-compat (untested)
* remote::fireworks (tested)
* remote::fireworks-openai-compat (untested)
* remote::gemini (untested)
* remote::groq-openai-compat (untested)
* remote::nvidia (tested)
* remote::ollama (tested)
* remote::openai (untested)
* remote::passthrough (untested)
* remote::sambanova-openai-compat (untested)
* remote::together (tested)
* remote::together-openai-compat (untested)
* remote::vllm (tested)

The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API, we'll
add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses to
OpenAI responses.

This is related to #1817 but is a bit larger in scope than just chat
completions, as I have real use-cases that need the older completions
API as well.

## Test Plan

### vLLM

```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```

### ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```



## Documentation

Run a Llama Stack distribution that uses one of the providers mentioned
in the list above. Then, use your favorite OpenAI client to send
completion or chat completion requests with the base_url set to
http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the
host and port of your Llama Stack server, if different.

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-11 13:14:17 -07:00
Matthew Farrellee
3a9be58523
fix: use ollama list to find models (#1854)
# What does this PR do?

closes #1853 

## Test Plan
```
uv run llama stack build --image-type conda --image-name ollama --config llama_stack/templates/ollama/build.yaml

ollama pull llama3.2:3b

LLAMA_STACK_CONFIG=http://localhost:8321 uv run pytest tests/integration/inference/test_text_inference.py -v --text-model=llama3.2:3b
```
2025-04-09 10:34:26 +02:00
Ihar Hrachyshka
66d6c2580e
chore: more mypy checks (ollama, vllm, ...) (#1777)
# What does this PR do?

- **chore: mypy for strong_typing**
- **chore: mypy for remote::vllm**
- **chore: mypy for remote::ollama**
- **chore: mypy for providers.datatype**

---------

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-04-01 17:12:39 +02:00
Sébastien Han
7cf1e24c4e
feat(logging): implement category-based logging (#1362)
# What does this PR do?

This commit introduces a new logging system that allows loggers to be
assigned
a category while retaining the logger name based on the file name. The
log
format includes both the logger name and the category, producing output
like:

```
INFO     2025-03-03 21:44:11,323 llama_stack.distribution.stack:103 [core]: Tool_groups: builtin::websearch served by
         tavily-search
```

Key features include:

- Category-based logging: Loggers can be assigned a category (e.g.,
  "core", "server") when programming. The logger can be loaded like
  this: `logger = get_logger(name=__name__, category="server")`
- Environment variable control: Log levels can be configured
per-category using the
  `LLAMA_STACK_LOGGING` environment variable. For example:
`LLAMA_STACK_LOGGING="server=DEBUG;core=debug"` enables DEBUG level for
the "server"
    and "core" categories.
- `LLAMA_STACK_LOGGING="all=debug"` sets DEBUG level globally for all
categories and
    third-party libraries.

This provides fine-grained control over logging levels while maintaining
a clean and
informative log format.

The formatter uses the rich library which provides nice colors better
stack traces like so:

```
ERROR    2025-03-03 21:49:37,124 asyncio:1758 [uncategorized]: unhandled exception during asyncio.run() shutdown
         task: <Task finished name='Task-16' coro=<handle_signal.<locals>.shutdown() done, defined at
         /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:146>
         exception=UnboundLocalError("local variable 'loop' referenced before assignment")>
         ╭────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────╮
         │ /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:178 in shutdown                │
         │                                                                                                                │
         │   175 │   │   except asyncio.CancelledError:                                                                   │
         │   176 │   │   │   pass                                                                                         │
         │   177 │   │   finally:                                                                                         │
         │ ❱ 178 │   │   │   loop.stop()                                                                                  │
         │   179 │                                                                                                        │
         │   180 │   loop = asyncio.get_running_loop()                                                                    │
         │   181 │   loop.create_task(shutdown())                                                                         │
         ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
         UnboundLocalError: local variable 'loop' referenced before assignment
```

Co-authored-by: Ashwin Bharambe <@ashwinb>
Signed-off-by: Sébastien Han <seb@redhat.com>

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

```
python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml
INFO     2025-03-03 21:55:35,918 __main__:365 [server]: Using config file: llama_stack/templates/ollama/run.yaml           
INFO     2025-03-03 21:55:35,925 __main__:378 [server]: Run configuration:                                                 
INFO     2025-03-03 21:55:35,928 __main__:380 [server]: apis:                                                              
         - agents                                                     
``` 
[//]: # (## Documentation)

---------

Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-03-07 11:34:30 -08:00
Sébastien Han
803bf0e029
fix: solve ruff B008 warnings (#1444)
# What does this PR do?

The commit addresses the Ruff warning B008 by refactoring the code to
avoid calling SamplingParams() directly in function argument defaults.
Instead, it either uses Field(default_factory=SamplingParams) for
Pydantic models or sets the default to None and instantiates
SamplingParams inside the function body when the argument is None.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-06 16:48:35 -08:00
Ashwin Bharambe
754feba61f
feat: add a configurable category-based logger (#1352)
A self-respecting server needs good observability which starts with
configurable logging. Llama Stack had little until now. This PR adds a
`logcat` facility towards that. Callsites look like:

```python
logcat.debug("inference", f"params to ollama: {params}")
```

- the first parameter is a category. there is a static list of
categories in `llama_stack/logcat.py`
- each category can be associated with a log-level which can be
configured via the `LLAMA_STACK_LOGGING` env var.
- a value `LLAMA_STACK_LOGGING=inference=debug;server=info"` does the
obvious thing. there is a special key called `all` which is an alias for
all categories

## Test Plan

Ran with `LLAMA_STACK_LOGGING="all=debug" llama stack run fireworks` and
saw the following:


![image](https://github.com/user-attachments/assets/d24b95ab-3941-426c-9ea0-a4c62542e6f0)

Hit it with a client-sdk test case and saw this:


![image](https://github.com/user-attachments/assets/3fee8c6c-986e-4125-a09c-f5dc019682e2)
2025-03-02 18:51:14 -08:00
Ashwin Bharambe
ab54b8cd58
feat(providers): support non-llama models for inference providers (#1200)
This PR begins the process of supporting non-llama models within Llama
Stack. We start simple by adding support for this functionality within a
few existing providers: fireworks, together and ollama.

## Test Plan

```bash
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/inference/test_text_inference.py \
  --inference-model accounts/fireworks/models/phi-3-vision-128k-instruct
```

^ this passes most of the tests but as expected fails the tool calling
related tests since they are very specific to Llama models

```
inference/test_text_inference.py::test_text_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_completion_log_probs_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_completion_log_probs_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_text_completion_structured_output[accounts/fireworks/models/phi-3-vision-128k-instruct-completion-01] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-Which planet do humans live on?-Earth] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-Which planet has rings around it with a name starting w
ith letter S?-Saturn] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-What's the name of the Sun in latin?-Sol] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-What is the name of the US captial?-Washington] PASSED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_none[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_text_chat_completion_structured_output[accounts/fireworks/models/phi-3-vision-128k-instruct] ERROR
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[accounts/fireworks/models/phi-3-vision-128k-instruct-True] PASSED
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[accounts/fireworks/models/phi-3-vision-128k-instruct-False] PASSED
```
2025-02-21 13:21:28 -08:00
Ashwin Bharambe
36162c8c82 fix(ollama): register model with the helper first so it gets normalized 2025-02-21 12:51:38 -08:00
Ashwin Bharambe
11697f85c5
fix: pull ollama embedding model if necessary (#1209)
Embedding models are tiny and can be pulled on-demand. Let's do that so
the user doesn't have to do "yet another thing" to get themselves set
up.

Thanks @hardikjshah for the suggestion.

Also fixed a build dependency miss (TODO: distro_codegen needs to
actually check that the build template contains all providers mentioned
for the run.yaml file)

## Test Plan 

First run `ollama rm all-minilm:latest`. 

Run `llama stack build --template ollama && llama stack run ollama --env
INFERENCE_MODEL=llama3.2:3b-instruct-fp16`. See that it outputs a
"Pulling embedding model `all-minilm:latest`" output and the stack
starts up correctly. Verify that `ollama list` shows the model is
correctly downloaded.
2025-02-21 10:35:56 -08:00
Ashwin Bharambe
81ce39a607
feat(api): Add options for supporting various embedding models (#1192)
We need to support:
- asymmetric embedding models (#934)
- truncation policies (#933)
- varying dimensional output (#932) 

## Test Plan

```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```
2025-02-20 22:27:12 -08:00
Ashwin Bharambe
6f9d622340
fix(api): update embeddings signature so inputs and outputs list align (#1161)
See Issue #922 

The change is slightly backwards incompatible but no callsite (in our
client codebases or stack-apps) every passes a depth-2
`List[List[InterleavedContentItem]]` (which is now disallowed.)

## Test Plan

```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```

Also ran `tests/client-sdk/inference/test_embeddings.py`
2025-02-20 21:43:13 -08:00
Ashwin Bharambe
9436dd570d
feat: register embedding models for ollama, together, fireworks (#1190)
# What does this PR do?

We have support for embeddings in our Inference providers, but so far we
haven't done the final step of actually registering the known embedding
models and making sure they are extremely easy to use. This is one step
towards that.

## Test Plan

Run existing inference tests.

```bash

$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```

The value of the EMBEDDING_DIMENSION isn't actually used in these tests,
it is merely used by the test fixtures to check if the model is an LLM
or Embedding.
2025-02-20 15:39:08 -08:00
Ashwin Bharambe
07ccf908f7 ModelAlias -> ProviderModelEntry 2025-02-20 14:02:36 -08:00
Ashwin Bharambe
eddef0b2ae
chore: slight renaming of model alias stuff (#1181)
Quick test by running:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk
```
2025-02-20 11:48:46 -08:00
Ashwin Bharambe
cdcbeb005b
chore: remove llama_models.llama3.api imports from providers (#1107)
There should be a choke-point for llama3.api imports -- this is the
prompt adapter. Creating a ChatFormat() object on demand is inexpensive.
The underlying Tokenizer is a singleton anyway.
2025-02-19 19:01:29 -08:00
Ashwin Bharambe
314ee09ae3
chore: move all Llama Stack types from llama-models to llama-stack (#1098)
llama-models should have extremely minimal cruft. Its sole purpose
should be didactic -- show the simplest implementation of the llama
models and document the prompt formats, etc.

This PR is the complement to
https://github.com/meta-llama/llama-models/pull/279

## Test Plan

Ensure all `llama` CLI `model` sub-commands work:

```bash
llama model list
llama model download --model-id ...
llama model prompt-format -m ...
```

Ran tests:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/
LLAMA_STACK_CONFIG=fireworks pytest -s -v vector_io/
LLAMA_STACK_CONFIG=fireworks pytest -s -v agents/
```

Create a fresh venv `uv venv && source .venv/bin/activate` and run
`llama stack build --template fireworks --image-type venv` followed by
`llama stack run together --image-type venv` <-- the server runs

Also checked that the OpenAPI generator can run and there is no change
in the generated files as a result.

```bash
cd docs/openapi_generator
sh run_openapi_generator.sh
```
2025-02-14 09:10:59 -08:00
Sébastien Han
e4a1579e63
build: format codebase imports using ruff linter (#1028)
# What does this PR do?

- Configured ruff linter to automatically fix import sorting issues.
- Set --exit-non-zero-on-fix to ensure non-zero exit code when fixes are
applied.
- Enabled the 'I' selection to focus on import-related linting rules.
- Ran the linter, and formatted all codebase imports accordingly.
- Removed the black dep from the "dev" group since we use ruff

Signed-off-by: Sébastien Han <seb@redhat.com>

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-02-13 10:06:21 -08:00
Xi Yan
66d7e15c93
perf: ensure ToolCall in ChatCompletionResponse is subset of ChatCompletionRequest.tools (#1041)
# What does this PR do?

**Problem**
- Using script:
https://gist.github.com/thoraxe/6163b2145ce7b1c24c6026b64cf90085

- This hits an issue on server with `code_interpreter` not found, as we
do not pass "builtin::code_interpreter" in AgentConfig's `toolgroups`.

This is a general issue where model always tries to output
`code_interpreter` in `ToolCall` even when we do not have
`code_interpreter` available for execution.

**Reproduce Deeper Problem in chat-completion**
- Use script:
https://gist.github.com/yanxi0830/163a9ad7b5db10556043fbfc7ecd7603

1. We currently always populate `code_interpreter` in `ToolCall` in
ChatCompletionResponse if the model's response begins with
`<|python_tag|>`. See
c5f5958498/models/llama3/api/chat_format.py (L200-L213)

<img width="913" alt="image"
src="https://github.com/user-attachments/assets/328d313d-0a0b-495c-8715-61cca9ccc4a6"
/>

2. This happens even if we do not pass the `code_interpreter` as a
`tools` in ChatCompletionRequest.

**This PR**

Explicitly make sure that the tools returned in
`ChatCompletionResponse.tool_calls` is always a tool requested by
`ChatCompletionRequest.tools`.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

**Before**
<img width="913" alt="image"
src="https://github.com/user-attachments/assets/328d313d-0a0b-495c-8715-61cca9ccc4a6"
/>
<img width="997" alt="image"
src="https://github.com/user-attachments/assets/d3e82b62-b142-4939-954c-62843bec7110"
/>


**After**
<img width="856" alt="image"
src="https://github.com/user-attachments/assets/2c70ce55-c8d0-45ea-b10f-f70adc50d3d9"
/>
<img width="1000" alt="image"
src="https://github.com/user-attachments/assets/b5e81826-c35b-4052-bf81-7afff93ce2ef"
/>



**Unit Test**
```
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request --inference-model "meta-llama/Llama-3.3-70B-Instruct"
```

```
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/
```
<img width="1002" alt="image"
src="https://github.com/user-attachments/assets/04808517-eded-4122-97f5-7e5142de9779"
/>



**Streaming**
- Chat Completion
<img width="902" alt="image"
src="https://github.com/user-attachments/assets/f477bc86-bd38-4729-b49e-a0a6ed3f835a"
/>

- Agent
<img width="916" alt="image"
src="https://github.com/user-attachments/assets/f4cc3417-23cd-46b1-953d-3a2271e79bbb"
/>


[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)
2025-02-11 18:31:35 -08:00
Sébastien Han
316c43fdaf
refactor(ollama): model availability check (#986)
# What does this PR do?

Moved model availability check logic into a dedicated
check_model_availability function. Eliminated redundant code by reusing
the helper function in both embedding and non-embedding model
registration.

Signed-off-by: Sébastien Han <seb@redhat.com>

## Test Plan

Run Ollama and serve 2 models to get most the unit test pass:

```
ollama run llama3.2:3b-instruct-fp16 --keepalive 2m &
ollama run llama3.1:8b  --keepalive 2m &
```

Run the unit test:

```
uv run pytest -v -k "ollama" --inference-model=llama3.2:3b-instruct-fp16 llama_stack/providers/tests/inference/test_model_registration.py
/Users/leseb/Documents/AI/llama-stack/.venv/lib/python3.13/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
============================================ test session starts =============================================
platform darwin -- Python 3.13.1, pytest-8.3.4, pluggy-1.5.0 -- /Users/leseb/Documents/AI/llama-stack/.venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.13.1', 'Platform': 'macOS-15.3-arm64-arm-64bit-Mach-O', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0', 'nbval': '0.11.0'}}
rootdir: /Users/leseb/Documents/AI/llama-stack
configfile: pyproject.toml
plugins: html-4.1.1, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0, nbval-0.11.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 65 items / 60 deselected / 5 selected                                                              

llama_stack/providers/tests/inference/test_model_registration.py::TestModelRegistration::test_register_unsupported_model[-ollama] PASSED [ 20%]
llama_stack/providers/tests/inference/test_model_registration.py::TestModelRegistration::test_register_nonexistent_model[-ollama] PASSED [ 40%]
llama_stack/providers/tests/inference/test_model_registration.py::TestModelRegistration::test_register_with_llama_model[-ollama] FAILED [ 60%]
llama_stack/providers/tests/inference/test_model_registration.py::TestModelRegistration::test_initialize_model_during_registering[-ollama] FAILED [ 80%]
llama_stack/providers/tests/inference/test_model_registration.py::TestModelRegistration::test_register_with_invalid_llama_model[-ollama] PASSED [100%]

================================================== FAILURES ==================================================
_______________________ TestModelRegistration.test_register_with_llama_model[-ollama] ________________________
llama_stack/providers/tests/inference/test_model_registration.py:54: in test_register_with_llama_model
    _ = await models_impl.register_model(
llama_stack/providers/utils/telemetry/trace_protocol.py:91: in async_wrapper
    result = await method(self, *args, **kwargs)
llama_stack/distribution/routers/routing_tables.py:245: in register_model
    registered_model = await self.register_object(model)
llama_stack/distribution/routers/routing_tables.py:192: in register_object
    registered_obj = await register_object_with_provider(obj, p)
llama_stack/distribution/routers/routing_tables.py:53: in register_object_with_provider
    return await p.register_model(obj)
llama_stack/providers/utils/telemetry/trace_protocol.py:91: in async_wrapper
    result = await method(self, *args, **kwargs)
llama_stack/providers/remote/inference/ollama/ollama.py:368: in register_model
    await check_model_availability(model.provider_resource_id)
llama_stack/providers/remote/inference/ollama/ollama.py:359: in check_model_availability
    raise ValueError(
E   ValueError: Model 'custom-model' is not available in Ollama. Available models: llama3.1:8b, llama3.2:3b-instruct-fp16
__________________ TestModelRegistration.test_initialize_model_during_registering[-ollama] ___________________
llama_stack/providers/tests/inference/test_model_registration.py:85: in test_initialize_model_during_registering
    mock_load_model.assert_called_once()
/opt/homebrew/Cellar/python@3.13/3.13.1/Frameworks/Python.framework/Versions/3.13/lib/python3.13/unittest/mock.py:956: in assert_called_once
    raise AssertionError(msg)
E   AssertionError: Expected 'load_model' to have been called once. Called 0 times.
-------------------------------------------- Captured stderr call --------------------------------------------
W0207 11:55:26.777000 90854 .venv/lib/python3.13/site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
========================================== short test summary info ===========================================
FAILED llama_stack/providers/tests/inference/test_model_registration.py::TestModelRegistration::test_register_with_llama_model[-ollama] - ValueError: Model 'custom-model' is not available in Ollama. Available models: llama3.1:8b, llama3.2:3b-i...
FAILED llama_stack/providers/tests/inference/test_model_registration.py::TestModelRegistration::test_initialize_model_during_registering[-ollama] - AssertionError: Expected 'load_model' to have been called once. Called 0 times.
=========================== 2 failed, 3 passed, 60 deselected, 2 warnings in 1.84s ===========================
``` 

We only "care" about the `test_register_nonexistent_model` for this
code.


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-02-07 09:52:16 -08:00
ehhuang
c9ab72fa82
Support sys_prompt behavior in inference (#937)
# What does this PR do?

The current default system prompt for llama3.2 tends to overindex on
tool calling and doesn't work well when the prompt does not require tool
calling.

This PR adds an option to override the default system prompt, and
organizes tool-related configs into a new config object.

- [ ] Addresses issue (#issue)


## Test Plan

python -m unittest
llama_stack.providers.tests.inference.test_prompt_adapter


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/937).
* #938
* __->__ #937
2025-02-03 23:35:16 -08:00
Yuan Tang
34ab7a3b6c
Fix precommit check after moving to ruff (#927)
Lint check in main branch is failing. This fixes the lint check after we
moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We
need to move to a `ruff.toml` file as well as fixing and ignoring some
additional checks.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-02 06:46:45 -08:00
Sixian Yi
e41873f268
[ez] structured output for /completion ollama & enable tests (#822)
# What does this PR do?

1) enabled structured output for ollama /completion API. It seems we
missed this one.
2) fixed ollama structured output test in client sdk - ollama does not
support list format for structured output
3) enable structured output unit test as the result was stable on
Llama-3.1-8B-Instruct and ollama, fireworks, together.


## Test Plan
1) Run `test_completion_structured_output` on /completion API with 3
providers: ollama, fireworks, together.
pytest -v -s -k "together"
--inference-model="meta-llama/Llama-3.1-8B-Instruct"
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output


```
(base) sxyi@sxyi-mbp llama-stack % pytest -s -v llama_stack/providers/tests/inference --config=ci_test_config.yaml
/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
================================================================================================ test session starts =================================================================================================
platform darwin -- Python 3.13.0, pytest-8.3.4, pluggy-1.5.0 -- /Library/Frameworks/Python.framework/Versions/3.13/bin/python3.13
cachedir: .pytest_cache
metadata: {'Python': '3.13.0', 'Platform': 'macOS-15.1.1-arm64-arm-64bit-Mach-O', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'asyncio': '0.24.0', 'html': '4.1.1', 'metadata': '3.1.1', 'md': '0.2.0', 'dependency': '0.6.0', 'md-report': '0.6.3', 'anyio': '4.6.2.post1'}}
rootdir: /Users/sxyi/llama-stack
configfile: pyproject.toml
plugins: asyncio-0.24.0, html-4.1.1, metadata-3.1.1, md-0.2.0, dependency-0.6.0, md-report-0.6.3, anyio-4.6.2.post1
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 85 items / 82 deselected / 3 selected                                                                                                                                                                      

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct-ollama] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct-fireworks]
PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct-together] PASSED

==================================================================================== 3 passed, 82 deselected, 8 warnings in 5.67s ====================================================================================
```
2)  
` LLAMA_STACK_CONFIG="./llama_stack/templates/ollama/run.yaml"
/opt/miniconda3/envs/stack/bin/pytest -s -v tests/client-sdk/inference`

Before: 
```
________________________________________________________________________________________ test_completion_structured_output __________________________________________________________________________________________
tests/client-sdk/inference/test_inference.py:174: in test_completion_structured_output
    answer = AnswerFormat.model_validate_json(response.content)
E   pydantic_core._pydantic_core.ValidationError: 1 validation error for AnswerFormat
E     Invalid JSON: expected value at line 1 column 2 [type=json_invalid, input_value=' The year he retired, he...5\n\nThe best answer is', input_type=str]
E       For further information visit https://errors.pydantic.dev/2.10/v/json_invalid
```

After: 
test consistently passes


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2025-01-21 21:10:24 -08:00
Dinesh Yeduguru
8af6951106
remove conflicting default for tool prompt format in chat completion (#742)
# What does this PR do?
We are setting a default value of json for tool prompt format, which
conflicts with llama 3.2/3.3 models since they use python list. This PR
changes the defaults to None and in the code, we infer default based on
the model.

Addresses: #695 

Tests:
❯ LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v
tests/client-sdk/inference/test_inference.py -k
"test_text_chat_completion"

 pytest llama_stack/providers/tests/inference/test_prompt_adapter.py
2025-01-10 10:41:53 -08:00
Aidan Do
5d7b611336
Add JSON structured outputs to Ollama Provider (#680)
# What does this PR do?

Addresses issue #679

- Adds support for the response_format field for chat completions and
completions so users can get their outputs in JSON

## Test Plan

<details>

<summary>Integration tests</summary>

`pytest
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output
-k ollama -s -v`

```python
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-ollama] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-ollama] PASSED

================================== 2 passed, 18 deselected, 3 warnings in 41.41s ==================================
```

</details>

<details>
<summary>Manual Tests</summary>

```
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export OLLAMA_INFERENCE_MODEL=llama3.2:3b-instruct-fp16
export LLAMA_STACK_PORT=5000

ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m
llama stack build --template ollama --image-type conda
llama stack run ./run.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=http://localhost:11434
```

```python
    client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")

    MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
    prompt =f"""
        Create a step by step plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French.
        You have 3 different operations you can perform. You can create a file, update a file, or delete a file.
        Limit your step by step plan to only these operations per step.
        Don't create more than 10 steps.

        Please ensure there's a README.md file in the root of the codebase that describes the codebase and how to run it.
        Please ensure there's a requirements.txt file in the root of the codebase that describes the dependencies of the codebase.
        """
    response = client.inference.chat_completion(
        model_id=MODEL_ID,
        messages=[
            {"role": "user", "content": prompt},
        ],
        sampling_params={
            "max_tokens": 200000,
        },
        response_format={
            "type": "json_schema",
            "json_schema": {
                "$schema": "http://json-schema.org/draft-07/schema#",
                "title": "Plan",
                "description": f"A plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French.",
                "type": "object",
                "properties": {
                    "steps": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["steps"],
                "additionalProperties": False,
            }
        },
        stream=True,
    )

    content = ""
    for chunk in response:
        if chunk.event.delta:
            print(chunk.event.delta, end="", flush=True)
            content += chunk.event.delta

    try:
        plan = json.loads(content)
        print(plan)
    except Exception as e:
        print(f"Error parsing plan into JSON: {e}")
        plan = {"steps": []}
```

Outputs:

```json
{
    "steps": [
        "Update the requirements.txt file to include the updated dependencies specified in the peer's feedback, including the Google Cloud Translation API key.",
        "Update the app.py file to address the code smells and incorporate the suggested improvements, such as handling errors and exceptions, initializing the Translator object correctly, adding input validation, using type hints and docstrings, and removing unnecessary logging statements.",
        "Create a README.md file that describes the codebase and how to run it.",
        "Ensure the README.md file is up-to-date and accurate.",
        "Update the requirements.txt file to reflect any additional dependencies specified by the peer's feedback.",
        "Add documentation for each function in the app.py file using docstrings.",
        "Implement logging statements throughout the app.py file to monitor application execution.",
        "Test the API endpoint to ensure it correctly translates text from English to French and handles errors properly.",
        "Refactor the code to follow PEP 8 style guidelines and ensure consistency in naming conventions, indentation, and spacing.",
        "Create a new folder for logs and add a logging configuration file (e.g., logconfig.json) that specifies the logging level and output destination.",
        "Deploy the web server on a production environment (e.g., AWS Elastic Beanstalk or Google Cloud Platform) to make it accessible to external users."
    ]
}
```


</details>

## Sources

- Ollama api docs:
https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion
- Ollama structured output docs:
https://github.com/ollama/ollama/blob/main/docs/api.md#request-structured-outputs

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
2025-01-02 09:05:51 -08:00
Xi Yan
3c72c034e6
[remove import *] clean up import *'s (#689)
# What does this PR do?

- as title, cleaning up `import *`'s
- upgrade tests to make them more robust to bad model outputs
- remove import *'s in llama_stack/apis/* (skip __init__ modules)
<img width="465" alt="image"
src="https://github.com/user-attachments/assets/d8339c13-3b40-4ba5-9c53-0d2329726ee2"
/>

- run `sh run_openapi_generator.sh`, no types gets affected

## Test Plan

### Providers Tests

**agents**
```
pytest -v -s llama_stack/providers/tests/agents/test_agents.py -m "together" --safety-shield meta-llama/Llama-Guard-3-8B --inference-model meta-llama/Llama-3.1-405B-Instruct-FP8
```

**inference**
```bash
# meta-reference
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py

# together
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py

pytest ./llama_stack/providers/tests/inference/test_prompt_adapter.py 
```

**safety**
```
pytest -v -s llama_stack/providers/tests/safety/test_safety.py -m together --safety-shield meta-llama/Llama-Guard-3-8B
```

**memory**
```
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m "sentence_transformers" --env EMBEDDING_DIMENSION=384
```

**scoring**
```
pytest -v -s -m llm_as_judge_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct
pytest -v -s -m basic_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
pytest -v -s -m braintrust_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
```


**datasetio**
```
pytest -v -s -m localfs llama_stack/providers/tests/datasetio/test_datasetio.py
pytest -v -s -m huggingface llama_stack/providers/tests/datasetio/test_datasetio.py
```


**eval**
```
pytest -v -s -m meta_reference_eval_together_inference llama_stack/providers/tests/eval/test_eval.py
pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio llama_stack/providers/tests/eval/test_eval.py
```

### Client-SDK Tests
```
LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v ./tests/client-sdk
```

### llama-stack-apps
```
PORT=5000
LOCALHOST=localhost

python -m examples.agents.hello $LOCALHOST $PORT
python -m examples.agents.inflation $LOCALHOST $PORT
python -m examples.agents.podcast_transcript $LOCALHOST $PORT
python -m examples.agents.rag_as_attachments $LOCALHOST $PORT
python -m examples.agents.rag_with_memory_bank $LOCALHOST $PORT
python -m examples.safety.llama_guard_demo_mm $LOCALHOST $PORT
python -m examples.agents.e2e_loop_with_custom_tools $LOCALHOST $PORT

# Vision model
python -m examples.interior_design_assistant.app
python -m examples.agent_store.app $LOCALHOST $PORT
```

### CLI
```
which llama
llama model prompt-format -m Llama3.2-11B-Vision-Instruct
llama model list
llama stack list-apis
llama stack list-providers inference

llama stack build --template ollama --image-type conda
```

### Distributions Tests
**ollama**
```
llama stack build --template ollama --image-type conda
ollama run llama3.2:1b-instruct-fp16
llama stack run ./llama_stack/templates/ollama/run.yaml --env INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
```

**fireworks**
```
llama stack build --template fireworks --image-type conda
llama stack run ./llama_stack/templates/fireworks/run.yaml
```

**together**
```
llama stack build --template together --image-type conda
llama stack run ./llama_stack/templates/together/run.yaml
```

**tgi**
```
llama stack run ./llama_stack/templates/tgi/run.yaml --env TGI_URL=http://0.0.0.0:5009 --env INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
```

## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-27 15:45:44 -08:00
Aidan Do
21fb92d7cf
Add 3.3 70B to Ollama inference provider (#681)
# What does this PR do?

Adds 3.3 70B support to Ollama inference provider

## Test Plan

<details>
<summary>Manual</summary>

```bash
# 42GB to download
ollama pull llama3.3:70b

ollama run llama3.3:70b --keepalive 60m

export LLAMA_STACK_PORT=5000
pip install -e . \
  && llama stack build --template ollama --image-type conda \
  && llama stack run ./distributions/ollama/run.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=Llama3.3-70B-Instruct \
  --env OLLAMA_URL=http://localhost:11434

export LLAMA_STACK_PORT=5000
llama-stack-client --endpoint http://localhost:$LLAMA_STACK_PORT \
  inference chat-completion \
  --model-id Llama3.3-70B-Instruct \
  --message "hello, what model are you?"
```

<img width="1221" alt="image"
src="https://github.com/user-attachments/assets/dcffbdd9-94c8-4d47-9f95-4ef6c3756294"
/>

</details>

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-25 22:15:58 -08:00
Ashwin Bharambe
b7a7caa9a8 Fix conversion to RawMessage everywhere 2024-12-17 14:00:43 -08:00
Ashwin Bharambe
8de8eb03c8
Update the "InterleavedTextMedia" type (#635)
## What does this PR do?

This is a long-pending change and particularly important to get done
now.

Specifically:
- we cannot "localize" (aka download) any URLs from media attachments
anywhere near our modeling code. it must be done within llama-stack.
- `PIL.Image` is infesting all our APIs via `ImageMedia ->
InterleavedTextMedia` and that cannot be right at all. Anything in the
API surface must be "naturally serializable". We need a standard `{
type: "image", image_url: "<...>" }` which is more extensible
- `UserMessage`, `SystemMessage`, etc. are moved completely to
llama-stack from the llama-models repository.

See https://github.com/meta-llama/llama-models/pull/244 for the
corresponding PR in llama-models.

## Test Plan

```bash
cd llama_stack/providers/tests

pytest -s -v -k "fireworks or ollama or together" inference/test_vision_inference.py
pytest -s -v -k "(fireworks or ollama or together) and llama_3b" inference/test_text_inference.py
pytest -s -v -k chroma memory/test_memory.py \
  --env EMBEDDING_DIMENSION=384 --env CHROMA_DB_PATH=/tmp/foobar

pytest -s -v -k fireworks agents/test_agents.py  \
   --safety-shield=meta-llama/Llama-Guard-3-8B \
   --inference-model=meta-llama/Llama-3.1-8B-Instruct
```

Updated the client sdk (see PR ...), installed the SDK in the same
environment and then ran the SDK tests:

```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=together pytest -s -v agents/test_agents.py
LLAMA_STACK_CONFIG=ollama pytest -s -v memory/test_memory.py

# this one needed a bit of hacking in the run.yaml to ensure I could register the vision model correctly
INFERENCE_MODEL=llama3.2-vision:latest LLAMA_STACK_CONFIG=ollama pytest -s -v inference/test_inference.py
```
2024-12-17 11:18:31 -08:00
Dinesh Yeduguru
516e1a3e59
add embedding model by default to distribution templates (#617)
# What does this PR do?
Adds the sentence transformer provider and the `all-MiniLM-L6-v2`
embedding model to the default models to register in the run.yaml for
all providers.

## Test Plan
llama stack build --template together --image-type conda
llama stack run
~/.llama/distributions/llamastack-together/together-run.yaml
2024-12-13 12:48:00 -08:00
Dinesh Yeduguru
96e158eaac
Make embedding generation go through inference (#606)
This PR does the following:
1) adds the ability to generate embeddings in all supported inference
providers.
2) Moves all the memory providers to use the inference API and improved
the memory tests to setup the inference stack correctly and use the
embedding models

This is a merge from #589 and #598
2024-12-12 11:47:50 -08:00
Ashwin Bharambe
14f973a64f
Make LlamaStackLibraryClient work correctly (#581)
This PR does a few things:

- it moves "direct client" to llama-stack repo instead of being in the
llama-stack-client-python repo
- renames it to `LlamaStackLibraryClient`
- actually makes synchronous generators work 
- makes streaming and non-streaming work properly

In many ways, this PR makes things finally "work"

## Test Plan

See a `library_client_test.py` I added. This isn't really quite a test
yet but it demonstrates that this mode now works. Here's the invocation
and the response:

```
INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct python llama_stack/distribution/tests/library_client_test.py ollama
```


![image](https://github.com/user-attachments/assets/17d4e116-4457-4755-a14e-d9a668801fe0)
2024-12-07 14:59:36 -08:00
Kai Wu
b6500974ec
removed assertion in ollama.py and fixed typo in the readme (#563)
# What does this PR do?
1. removed [incorrect
assertion](435f34b05e/llama_stack/providers/remote/inference/ollama/ollama.py (L183))
in ollama.py
2. fixed a typo in [this
line](435f34b05e/docs/source/distributions/importing_as_library.md (L24)),
as `model=` should be `model_id=` .

- [x] Addresses issue
([#issue562](https://github.com/meta-llama/llama-stack/issues/562))


## Test Plan

tested with code:

```python
import asyncio
import os

# pip install aiosqlite ollama faiss
from llama_stack_client.lib.direct.direct import LlamaStackDirectClient
from llama_stack_client.types import SystemMessage, UserMessage


async def main():
    os.environ["INFERENCE_MODEL"] = "meta-llama/Llama-3.2-1B-Instruct"
    client = await LlamaStackDirectClient.from_template("ollama")
    await client.initialize()
    response = await client.models.list()
    print(response)
    model_name = response[0].identifier
    response = await client.inference.chat_completion(
        messages=[
            SystemMessage(content="You are a friendly assistant.", role="system"),
            UserMessage(
                content="hello world, write me a 2 sentence poem about the moon",
                role="user",
            ),
        ],
        model_id=model_name,
        stream=False,
    )
    print("\nChat completion response:")
    print(response, type(response))


asyncio.run(main())

```
OUTPUT:
```
python test.py
Using template ollama with config:
apis:
- agents
- inference
- memory
- safety
- telemetry
conda_env: ollama
datasets: []
docker_image: null
eval_tasks: []
image_name: ollama
memory_banks: []
metadata_store:
  db_path: /Users/kaiwu/.llama/distributions/ollama/registry.db
  namespace: null
  type: sqlite
models:
- metadata: {}
  model_id: meta-llama/Llama-3.2-1B-Instruct
  provider_id: ollama
  provider_model_id: null
providers:
  agents:
  - config:
      persistence_store:
        db_path:
/Users/kaiwu/.llama/distributions/ollama/agents_store.db
        namespace: null
        type: sqlite
    provider_id: meta-reference
    provider_type: inline::meta-reference
  inference:
  - config:
      url: http://localhost:11434
    provider_id: ollama
    provider_type: remote::ollama
  memory:
  - config:
      kvstore:
        db_path:
/Users/kaiwu/.llama/distributions/ollama/faiss_store.db
        namespace: null
        type: sqlite
    provider_id: faiss
    provider_type: inline::faiss
  safety:
  - config: {}
    provider_id: llama-guard
    provider_type: inline::llama-guard
  telemetry:
  - config: {}
    provider_id: meta-reference
    provider_type: inline::meta-reference
scoring_fns: []
shields: []
version: '2'

[Model(identifier='meta-llama/Llama-3.2-1B-Instruct', provider_resource_id='llama3.2:1b-instruct-fp16', provider_id='ollama', type='model', metadata={})]

Chat completion response:
completion_message=CompletionMessage(role='assistant', content='Here is a short poem about the moon:\n\nThe moon glows bright in the midnight sky,\nA silver crescent shining, catching the eye.', stop_reason=<StopReason.end_of_turn: 'end_of_turn'>, tool_calls=[]) logprobs=None <class 'llama_stack.apis.inference.inference.ChatCompletionResponse'>
```

## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-03 20:11:19 -08:00
Martin Hickey
76fc5d9f31
Update Ollama supported llama model list (#483)
# What does this PR do?

Update the llama model supported list for Ollama.

- [x] Addresses issue (#462)

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
2024-11-22 21:56:43 -08:00