Commit graph

27 commits

Author SHA1 Message Date
Yuan Tang
34ab7a3b6c
Fix precommit check after moving to ruff (#927)
Lint check in main branch is failing. This fixes the lint check after we
moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We
need to move to a `ruff.toml` file as well as fixing and ignoring some
additional checks.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-02 06:46:45 -08:00
Ashwin Bharambe
d78027f3b5 Move runpod provider to the correct directory
Also cleanup the test code to avoid skipping tests. Let failures be
known and public.
2025-01-23 12:25:12 -08:00
snova-edwardm
22dc684da6
Sambanova inference provider (#555)
# What does this PR do?

This PR adds SambaNova as one of the Provider

- Add SambaNova as a provider

## Test Plan
Test the functional command
```
pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_embeddings.py llama_stack/providers/tests/inference/test_prompt_adapter.py llama_stack/providers/tests/inference/test_text_inference.py llama_stack/providers/tests/inference/test_vision_inference.py --env SAMBANOVA_API_KEY=<sambanova-api-key>
```

Test the distribution template:
```
# Docker
LLAMA_STACK_PORT=5001
docker run -it -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  llamastack/distribution-sambanova \
  --port $LLAMA_STACK_PORT \
  --env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY

# Conda
llama stack build --template sambanova --image-type conda
llama stack run ./run.yaml \
  --port $LLAMA_STACK_PORT \
  --env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY
```

## Source
[SambaNova API Documentation](https://cloud.sambanova.ai/apis)

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [Y] Ran pre-commit to handle lint / formatting issues.
- [Y] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [Y] Updated relevant documentation.
- [Y ] Wrote necessary unit or integration tests.

---------

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-01-23 12:20:28 -08:00
Aidan Do
910717c1fd
Add vLLM raw completions API (#823)
# What does this PR do?

Adds raw completions API to vLLM

## Test Plan

<details>
<summary>Setup</summary>

```bash
# Run vllm server
conda create -n vllm python=3.12 -y
conda activate vllm
pip install vllm

# Run llamastack
conda create --name llamastack-vllm python=3.10
conda activate llamastack-vllm

export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct && \
pip install -e . && \
pip install --no-cache --index-url https://pypi.org/simple/ --extra-index-url https://test.pypi.org/simple/ llama-stack==0.1.0rc7 && \
llama stack build --template remote-vllm --image-type conda && \
llama stack run ./distributions/remote-vllm/run.yaml \
  --port 5000 \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env VLLM_URL=http://localhost:8000/v1 | tee -a llama-stack.log
```
</details>

<details>
<summary>Integration</summary>

```bash
# Run
conda activate llamastack-vllm
export VLLM_URL=http://localhost:8000/v1
pip install pytest pytest_html pytest_asyncio aiosqlite
pytest llama_stack/providers/tests/inference/test_text_inference.py -v -k vllm

# Results
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm_remote] PASSED            [ 11%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm_remote] PASSED            [ 22%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm_remote] SKIPPED  [ 33%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm_remote] SKIPPED [ 44%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm_remote] PASSED [ 55%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm_remote] PASSED     [ 66%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm_remote] PASSED [ 77%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm_remote] PASSED [ 88%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm_remote] PASSED [100%]

====================================== 7 passed, 2 skipped, 99 deselected, 1 warning in 9.80s ======================================
```
</details>

<details>
<summary>Manual</summary>

```bash
# Install
pip install --no-cache --index-url https://pypi.org/simple/ --extra-index-url https://test.pypi.org/simple/ llama-stack==0.1.0rc7
```

Apply this diff
```diff
diff --git a/llama_stack/distribution/server/server.py b/llama_stack/distribution/server/server.py
index 8dbb193..95173e2 100644
--- a/llama_stack/distribution/server/server.py
+++ b/llama_stack/distribution/server/server.py
@@ -250,7 +250,7 @@ class ClientVersionMiddleware:
                     server_version_parts = tuple(
                         map(int, self.server_version.split(".")[:2])
                     )
-                    if client_version_parts != server_version_parts:
+                    if False and client_version_parts != server_version_parts:
 
                         async def send_version_error(send):
                             await send(
diff --git a/llama_stack/templates/remote-vllm/run.yaml b/llama_stack/templates/remote-vllm/run.yaml
index 4eac4da..32eb50e 100644
--- a/llama_stack/templates/remote-vllm/run.yaml
+++ b/llama_stack/templates/remote-vllm/run.yaml
@@ -94,7 +94,8 @@ metadata_store:
   type: sqlite
   db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/remote-vllm}/registry.db
 models:
-- metadata: {}
+- metadata: 
+    llama_model: meta-llama/Llama-3.2-3B-Instruct
   model_id: ${env.INFERENCE_MODEL}
   provider_id: vllm-inference
   model_type: llm
```

Test 1:

```python
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url="http://localhost:5000",
)

response = client.inference.completion(
    model_id="meta-llama/Llama-3.2-3B-Instruct",
    content="Hello, world client!",
)

print(response)
```

Test 2

```
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url="http://localhost:5000",
)

response = client.inference.completion(
    model_id="meta-llama/Llama-3.2-3B-Instruct",
    content="Hello, world client!",
    stream=True,
)

for chunk in response:
    print(chunk.delta, end="", flush=True)
```

```
 I'm excited to introduce you to our latest project, a comprehensive guide to the best coffee shops in [City]. As a coffee connoisseur, you're in luck because we've scoured the city to bring you the top picks for the perfect cup of joe.

In this guide, we'll take you on a journey through the city's most iconic coffee shops, highlighting their unique features, must-try drinks, and insider tips from the baristas themselves. From cozy cafes to trendy cafes, we've got you covered.

**Top 5 Coffee Shops in [City]**

1. **The Daily Grind**: This beloved institution has been serving up expertly crafted pour-overs and lattes for over 10 years. Their expert baristas are always happy to guide you through their menu, which features a rotating selection of single-origin beans from around the world...
```

</details>

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2025-01-22 22:58:27 -08:00
Ashwin Bharambe
07b87365ab
[inference api] modify content types so they follow a more standard structure (#841)
Some small updates to the inference types to make them more standard

Specifically:
- image data is now located in a "image" subkey
- similarly tool call data is located in a "tool_call" subkey

The pattern followed is `dict(type="foo", foo=<...>)`
2025-01-22 12:16:18 -08:00
Sixian Yi
e41873f268
[ez] structured output for /completion ollama & enable tests (#822)
# What does this PR do?

1) enabled structured output for ollama /completion API. It seems we
missed this one.
2) fixed ollama structured output test in client sdk - ollama does not
support list format for structured output
3) enable structured output unit test as the result was stable on
Llama-3.1-8B-Instruct and ollama, fireworks, together.


## Test Plan
1) Run `test_completion_structured_output` on /completion API with 3
providers: ollama, fireworks, together.
pytest -v -s -k "together"
--inference-model="meta-llama/Llama-3.1-8B-Instruct"
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output


```
(base) sxyi@sxyi-mbp llama-stack % pytest -s -v llama_stack/providers/tests/inference --config=ci_test_config.yaml
/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
================================================================================================ test session starts =================================================================================================
platform darwin -- Python 3.13.0, pytest-8.3.4, pluggy-1.5.0 -- /Library/Frameworks/Python.framework/Versions/3.13/bin/python3.13
cachedir: .pytest_cache
metadata: {'Python': '3.13.0', 'Platform': 'macOS-15.1.1-arm64-arm-64bit-Mach-O', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'asyncio': '0.24.0', 'html': '4.1.1', 'metadata': '3.1.1', 'md': '0.2.0', 'dependency': '0.6.0', 'md-report': '0.6.3', 'anyio': '4.6.2.post1'}}
rootdir: /Users/sxyi/llama-stack
configfile: pyproject.toml
plugins: asyncio-0.24.0, html-4.1.1, metadata-3.1.1, md-0.2.0, dependency-0.6.0, md-report-0.6.3, anyio-4.6.2.post1
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 85 items / 82 deselected / 3 selected                                                                                                                                                                      

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct-ollama] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct-fireworks]
PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct-together] PASSED

==================================================================================== 3 passed, 82 deselected, 8 warnings in 5.67s ====================================================================================
```
2)  
` LLAMA_STACK_CONFIG="./llama_stack/templates/ollama/run.yaml"
/opt/miniconda3/envs/stack/bin/pytest -s -v tests/client-sdk/inference`

Before: 
```
________________________________________________________________________________________ test_completion_structured_output __________________________________________________________________________________________
tests/client-sdk/inference/test_inference.py:174: in test_completion_structured_output
    answer = AnswerFormat.model_validate_json(response.content)
E   pydantic_core._pydantic_core.ValidationError: 1 validation error for AnswerFormat
E     Invalid JSON: expected value at line 1 column 2 [type=json_invalid, input_value=' The year he retired, he...5\n\nThe best answer is', input_type=str]
E       For further information visit https://errors.pydantic.dev/2.10/v/json_invalid
```

After: 
test consistently passes


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2025-01-21 21:10:24 -08:00
Hardik Shah
b2ac29b9da
fix provider model list test (#800)
Fixes provider tests

```
pytest -v -s -k "together or fireworks or ollama" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py 
```
```
...
.... 
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-together] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-together] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-together] PASSED

================ 21 passed, 6 skipped, 81 deselected, 5 warnings in 32.11s =================
```

Co-authored-by: Hardik Shah <hjshah@fb.com>
2025-01-16 19:27:29 -08:00
Hardik Shah
a51c8b4efc
Convert SamplingParams.strategy to a union (#767)
# What does this PR do?

Cleans up how we provide sampling params. Earlier, strategy was an enum
and all params (top_p, temperature, top_k) across all strategies were
grouped. We now have a strategy union object with each strategy (greedy,
top_p, top_k) having its corresponding params.
Earlier, 
```
class SamplingParams: 
    strategy: enum ()
    top_p, temperature, top_k and other params
```
However, the `strategy` field was not being used in any providers making
it confusing to know the exact sampling behavior purely based on the
params since you could pass temperature, top_p, top_k and how the
provider would interpret those would not be clear.

Hence we introduced -- a union where the strategy and relevant params
are all clubbed together to avoid this confusion.

Have updated all providers, tests, notebooks, readme and otehr places
where sampling params was being used to use the new format.
   

## Test Plan
`pytest llama_stack/providers/tests/inference/groq/test_groq_utils.py`
// inference on ollama, fireworks and together 
`with-proxy pytest -v -s -k "ollama"
--inference-model="meta-llama/Llama-3.1-8B-Instruct"
llama_stack/providers/tests/inference/test_text_inference.py `
// agents on fireworks 
`pytest -v -s -k 'fireworks and create_agent'
--inference-model="meta-llama/Llama-3.1-8B-Instruct"
llama_stack/providers/tests/agents/test_agents.py
--safety-shield="meta-llama/Llama-Guard-3-8B"`

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Ran pre-commit to handle lint / formatting issues.
- [X] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [X] Updated relevant documentation.
- [X] Wrote necessary unit or integration tests.

---------

Co-authored-by: Hardik Shah <hjshah@fb.com>
2025-01-15 05:38:51 -08:00
Ashwin Bharambe
d9d34433fc Update spec 2025-01-13 23:16:53 -08:00
Ashwin Bharambe
9a5803a429 move all implementations to use updated type 2025-01-13 23:16:53 -08:00
Aidan Do
fdcc74fda2
[#432] Add Groq Provider - tool calls (#630)
# What does this PR do?

Contributes to issue #432

- Adds tool calls to Groq provider
- Enables tool call integration tests

### PR Train

- https://github.com/meta-llama/llama-stack/pull/609 
- https://github.com/meta-llama/llama-stack/pull/630 👈

## Test Plan
Environment:

```shell
export GROQ_API_KEY=<api-key>

# build.yaml and run.yaml files
wget https://raw.githubusercontent.com/aidando73/llama-stack/9165502582cd7cb178bc1dcf89955b45768ab6c1/build.yaml
wget https://raw.githubusercontent.com/aidando73/llama-stack/9165502582cd7cb178bc1dcf89955b45768ab6c1/run.yaml

# Create environment if not already
conda create --prefix ./envs python=3.10
conda activate ./envs

# Build
pip install -e . && llama stack build --config ./build.yaml --image-type conda

# Activate built environment
conda activate llamastack-groq
```

<details>
<summary>Unit tests</summary>

```shell
# Setup
conda activate llamastack-groq
pytest llama_stack/providers/tests/inference/groq/test_groq_utils.py -vv -k groq -s

# Result
llama_stack/providers/tests/inference/groq/test_groq_utils.py .....................

======================================== 21 passed, 1 warning in 0.05s ========================================
```
</details>

<details>
<summary>Integration tests</summary>

```shell
# Run
conda activate llamastack-groq
pytest llama_stack/providers/tests/inference/test_text_inference.py -k groq -s

# Result
llama_stack/providers/tests/inference/test_text_inference.py .sss.s.ss.sss.s...

========================== 8 passed, 10 skipped, 180 deselected, 7 warnings in 2.73s ==========================
```
</details>

<details>
<summary>Manual</summary>

```bash
llama stack run ./run.yaml --port 5001
```

Via this Jupyter notebook:
9165502582/hello.ipynb
</details>

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [x] Updated relevant documentation. (no relevant documentation it
seems)
- [x] Wrote necessary unit or integration tests.
2025-01-13 18:17:38 -08:00
Fred Reiss
8b2376bfb3
Add inline vLLM inference provider to regression tests and fix regressions (#662)
# What does this PR do?

This PR adds the inline vLLM inference provider to the regression tests
for inference providers. The PR also fixes some regressions in that
inference provider in order to make the tests pass.


## Test Plan

Command to run the new tests (from root of project):
```
pytest \
    -vvv \
    llama_stack/providers/tests/inference/test_text_inference.py \
    --providers inference=vllm \
    --inference-model meta-llama/Llama-3.2-3B-Instruct \
```

Output of the above command after these changes:
```
/mnt/datadisk1/freiss/llama/env/lib/python3.12/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=================================================================== test session starts ===================================================================
platform linux -- Python 3.12.7, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: asyncio-0.25.0, anyio-4.6.2.post1
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 9 items                                                                                                                                         

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED                                          [ 11%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] SKIPPED (Other inference providers don't
support completion() yet)                                                                                                                           [ 22%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] SKIPPED (Other inference providers
don't support completion() yet)                                                                                                                     [ 33%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] SKIPPED (This test is not
quite robust)                                                                                                                                       [ 44%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED                       [ 55%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] SKIPPED (Other inference providers don't
support structured output yet)                                                                                                                      [ 66%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED                           [ 77%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED                   [ 88%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED         [100%]

======================================================== 5 passed, 4 skipped, 2 warnings in 25.56s ========================================================
Task was destroyed but it is pending!
task: <Task pending name='Task-6' coro=<AsyncLLMEngine.run_engine_loop() running at /mnt/datadisk1/freiss/llama/env/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py:848> cb=[_log_task_completion(error_callback=<bound method...7cfc479440b0>>)() at /mnt/datadisk1/freiss/llama/env/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py:45, shield.<locals>._inner_done_callback() at /mnt/datadisk1/freiss/llama/env/lib/python3.12/asyncio/tasks.py:905]>
[rank0]:[W1219 11:38:34.689424319 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
```

The warning about "asyncio_default_fixture_loop_scope" appears to be due
to my environment having a newer version of pytest-asyncio.

The warning about a pending task appears to be due to a bug in
`vllm.AsyncLLMEngine.shutdown_background_loop()`. It looks like that
method returns without stopping a pending task. I will look into that
issue separately.

## Sources


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Ran pre-commit to handle lint / formatting issues.
- [X] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [X] Wrote necessary unit or integration tests.
2025-01-10 16:35:16 -08:00
Aidan Do
e1f42eb5a5
[#432] Add Groq Provider - chat completions (#609)
# What does this PR do?

Contributes towards issue (#432)

- Groq text chat completions
- Streaming
- All the sampling params that Groq supports

A lot of inspiration taken from @mattf's good work at
https://github.com/meta-llama/llama-stack/pull/355

**What this PR does not do**

- Tool calls (Future PR)
- Adding llama-guard model
- See if we can add embeddings

### PR Train

- https://github.com/meta-llama/llama-stack/pull/609 👈 
- https://github.com/meta-llama/llama-stack/pull/630


## Test Plan

<details>

<summary>Environment</summary>

```bash
export GROQ_API_KEY=<api_key>

wget https://raw.githubusercontent.com/aidando73/llama-stack/240e6e2a9c20450ffdcfbabd800a6c0291f19288/build.yaml
wget https://raw.githubusercontent.com/aidando73/llama-stack/92c9b5297f9eda6a6e901e1adbd894e169dbb278/run.yaml

# Build and run environment
pip install -e . \
&& llama stack build --config ./build.yaml --image-type conda \
&& llama stack run ./run.yaml \
  --port 5001
```

</details>

<details>

<summary>Manual tests</summary>

Using this jupyter notebook to test manually:
2140976d76/hello.ipynb

Use this code to test passing in the api key from provider_data

```
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url="http://localhost:5001",
)

response = client.inference.chat_completion(
    model_id="Llama3.2-3B-Instruct",
    messages=[
        {"role": "user", "content": "Hello, world client!"},
    ],
    # Test passing in groq_api_key from the client
    # Need to comment out the groq_api_key in the run.yaml file
    x_llama_stack_provider_data='{"groq_api_key": "<api-key>"}',
    # stream=True,
)
response
```

</details>

<details>
<summary>Integration</summary>

`pytest llama_stack/providers/tests/inference/test_text_inference.py -v
-k groq`

(run in same environment)

```
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_3b-groq] PASSED                 [  6%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_3b-groq] SKIPPED (Other inf...) [ 12%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[llama_3b-groq] SKIPPED [ 18%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_3b-groq] PASSED [ 25%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-groq] SKIPPED (Ot...) [ 31%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_3b-groq] PASSED  [ 37%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_3b-groq] SKIPPED [ 43%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_3b-groq] SKIPPED [ 50%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_8b-groq] PASSED                 [ 56%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_8b-groq] SKIPPED (Other inf...) [ 62%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[llama_8b-groq] SKIPPED [ 68%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_8b-groq] PASSED [ 75%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-groq] SKIPPED (Ot...) [ 81%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_8b-groq] PASSED  [ 87%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_8b-groq] SKIPPED [ 93%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_8b-groq] SKIPPED [100%]

======================================= 6 passed, 10 skipped, 160 deselected, 7 warnings in 2.05s ========================================
```
</details>

<details>
<summary>Unit tests</summary>

`pytest llama_stack/providers/tests/inference/groq/ -v`

```
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_sets_model PASSED            [  5%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_converts_user_message PASSED [ 10%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_converts_system_message PASSED [ 15%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_converts_completion_message PASSED [ 20%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_does_not_include_logprobs PASSED [ 25%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_does_not_include_response_format PASSED [ 30%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_does_not_include_repetition_penalty PASSED [ 35%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_includes_stream PASSED       [ 40%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_n_is_1 PASSED                [ 45%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_if_max_tokens_is_0_then_it_is_not_included PASSED [ 50%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_includes_max_tokens_if_set PASSED [ 55%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_includes_temperature PASSED  [ 60%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_includes_top_p PASSED        [ 65%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertNonStreamChatCompletionResponse::test_returns_response PASSED [ 70%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertNonStreamChatCompletionResponse::test_maps_stop_to_end_of_message PASSED [ 75%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertNonStreamChatCompletionResponse::test_maps_length_to_end_of_message PASSED [ 80%]
llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertStreamChatCompletionResponse::test_returns_stream PASSED [ 85%]
llama_stack/providers/tests/inference/groq/test_init.py::TestGroqInit::test_raises_runtime_error_if_config_is_not_groq_config PASSED [ 90%]
llama_stack/providers/tests/inference/groq/test_init.py::TestGroqInit::test_returns_groq_adapter PASSED                            [ 95%]
llama_stack/providers/tests/inference/groq/test_init.py::TestGroqConfig::test_api_key_defaults_to_env_var PASSED                   [100%]

==================================================== 20 passed, 11 warnings in 0.08s =====================================================
```

</details>

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [x] Updated relevant documentation
- [x] Wrote necessary unit or integration tests.
2025-01-03 08:27:49 -08:00
Aidan Do
5d7b611336
Add JSON structured outputs to Ollama Provider (#680)
# What does this PR do?

Addresses issue #679

- Adds support for the response_format field for chat completions and
completions so users can get their outputs in JSON

## Test Plan

<details>

<summary>Integration tests</summary>

`pytest
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output
-k ollama -s -v`

```python
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-ollama] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-ollama] PASSED

================================== 2 passed, 18 deselected, 3 warnings in 41.41s ==================================
```

</details>

<details>
<summary>Manual Tests</summary>

```
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export OLLAMA_INFERENCE_MODEL=llama3.2:3b-instruct-fp16
export LLAMA_STACK_PORT=5000

ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m
llama stack build --template ollama --image-type conda
llama stack run ./run.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=http://localhost:11434
```

```python
    client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")

    MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
    prompt =f"""
        Create a step by step plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French.
        You have 3 different operations you can perform. You can create a file, update a file, or delete a file.
        Limit your step by step plan to only these operations per step.
        Don't create more than 10 steps.

        Please ensure there's a README.md file in the root of the codebase that describes the codebase and how to run it.
        Please ensure there's a requirements.txt file in the root of the codebase that describes the dependencies of the codebase.
        """
    response = client.inference.chat_completion(
        model_id=MODEL_ID,
        messages=[
            {"role": "user", "content": prompt},
        ],
        sampling_params={
            "max_tokens": 200000,
        },
        response_format={
            "type": "json_schema",
            "json_schema": {
                "$schema": "http://json-schema.org/draft-07/schema#",
                "title": "Plan",
                "description": f"A plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French.",
                "type": "object",
                "properties": {
                    "steps": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["steps"],
                "additionalProperties": False,
            }
        },
        stream=True,
    )

    content = ""
    for chunk in response:
        if chunk.event.delta:
            print(chunk.event.delta, end="", flush=True)
            content += chunk.event.delta

    try:
        plan = json.loads(content)
        print(plan)
    except Exception as e:
        print(f"Error parsing plan into JSON: {e}")
        plan = {"steps": []}
```

Outputs:

```json
{
    "steps": [
        "Update the requirements.txt file to include the updated dependencies specified in the peer's feedback, including the Google Cloud Translation API key.",
        "Update the app.py file to address the code smells and incorporate the suggested improvements, such as handling errors and exceptions, initializing the Translator object correctly, adding input validation, using type hints and docstrings, and removing unnecessary logging statements.",
        "Create a README.md file that describes the codebase and how to run it.",
        "Ensure the README.md file is up-to-date and accurate.",
        "Update the requirements.txt file to reflect any additional dependencies specified by the peer's feedback.",
        "Add documentation for each function in the app.py file using docstrings.",
        "Implement logging statements throughout the app.py file to monitor application execution.",
        "Test the API endpoint to ensure it correctly translates text from English to French and handles errors properly.",
        "Refactor the code to follow PEP 8 style guidelines and ensure consistency in naming conventions, indentation, and spacing.",
        "Create a new folder for logs and add a logging configuration file (e.g., logconfig.json) that specifies the logging level and output destination.",
        "Deploy the web server on a production environment (e.g., AWS Elastic Beanstalk or Google Cloud Platform) to make it accessible to external users."
    ]
}
```


</details>

## Sources

- Ollama api docs:
https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion
- Ollama structured output docs:
https://github.com/ollama/ollama/blob/main/docs/api.md#request-structured-outputs

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
2025-01-02 09:05:51 -08:00
Xi Yan
3c72c034e6
[remove import *] clean up import *'s (#689)
# What does this PR do?

- as title, cleaning up `import *`'s
- upgrade tests to make them more robust to bad model outputs
- remove import *'s in llama_stack/apis/* (skip __init__ modules)
<img width="465" alt="image"
src="https://github.com/user-attachments/assets/d8339c13-3b40-4ba5-9c53-0d2329726ee2"
/>

- run `sh run_openapi_generator.sh`, no types gets affected

## Test Plan

### Providers Tests

**agents**
```
pytest -v -s llama_stack/providers/tests/agents/test_agents.py -m "together" --safety-shield meta-llama/Llama-Guard-3-8B --inference-model meta-llama/Llama-3.1-405B-Instruct-FP8
```

**inference**
```bash
# meta-reference
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py

# together
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py

pytest ./llama_stack/providers/tests/inference/test_prompt_adapter.py 
```

**safety**
```
pytest -v -s llama_stack/providers/tests/safety/test_safety.py -m together --safety-shield meta-llama/Llama-Guard-3-8B
```

**memory**
```
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m "sentence_transformers" --env EMBEDDING_DIMENSION=384
```

**scoring**
```
pytest -v -s -m llm_as_judge_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct
pytest -v -s -m basic_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
pytest -v -s -m braintrust_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
```


**datasetio**
```
pytest -v -s -m localfs llama_stack/providers/tests/datasetio/test_datasetio.py
pytest -v -s -m huggingface llama_stack/providers/tests/datasetio/test_datasetio.py
```


**eval**
```
pytest -v -s -m meta_reference_eval_together_inference llama_stack/providers/tests/eval/test_eval.py
pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio llama_stack/providers/tests/eval/test_eval.py
```

### Client-SDK Tests
```
LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v ./tests/client-sdk
```

### llama-stack-apps
```
PORT=5000
LOCALHOST=localhost

python -m examples.agents.hello $LOCALHOST $PORT
python -m examples.agents.inflation $LOCALHOST $PORT
python -m examples.agents.podcast_transcript $LOCALHOST $PORT
python -m examples.agents.rag_as_attachments $LOCALHOST $PORT
python -m examples.agents.rag_with_memory_bank $LOCALHOST $PORT
python -m examples.safety.llama_guard_demo_mm $LOCALHOST $PORT
python -m examples.agents.e2e_loop_with_custom_tools $LOCALHOST $PORT

# Vision model
python -m examples.interior_design_assistant.app
python -m examples.agent_store.app $LOCALHOST $PORT
```

### CLI
```
which llama
llama model prompt-format -m Llama3.2-11B-Vision-Instruct
llama model list
llama stack list-apis
llama stack list-providers inference

llama stack build --template ollama --image-type conda
```

### Distributions Tests
**ollama**
```
llama stack build --template ollama --image-type conda
ollama run llama3.2:1b-instruct-fp16
llama stack run ./llama_stack/templates/ollama/run.yaml --env INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
```

**fireworks**
```
llama stack build --template fireworks --image-type conda
llama stack run ./llama_stack/templates/fireworks/run.yaml
```

**together**
```
llama stack build --template together --image-type conda
llama stack run ./llama_stack/templates/together/run.yaml
```

**tgi**
```
llama stack run ./llama_stack/templates/tgi/run.yaml --env TGI_URL=http://0.0.0.0:5009 --env INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
```

## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-27 15:45:44 -08:00
Matthew Farrellee
2a9b13dd52
add test for completion logprobs (#532)
# What does this PR do?

adds a test for the completion api's logprobs parameter

tbd which providers pass this test


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
2024-12-12 12:19:48 -08:00
Matthew Farrellee
b52df5fe5b
add completion api support to nvidia inference provider (#533)
# What does this PR do?

add the completion api to the nvidia inference provider


## Test Plan

while running the meta/llama-3.1-8b-instruct NIM from
https://build.nvidia.com/meta/llama-3_1-8b-instruct?snippet_tab=Docker

```
➜ pytest -s -v --providers inference=nvidia llama_stack/providers/tests/inference/ --env NVIDIA_BASE_URL=http://localhost:8000 -k test_completion --inference-model Llama3.1-8B-Instruct
=============================================== test session starts ===============================================
platform linux -- Python 3.10.15, pytest-8.3.3, pluggy-1.5.0 -- /home/matt/.conda/envs/stack/bin/python
cachedir: .pytest_cache
rootdir: /home/matt/Documents/Repositories/meta-llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.6.2.post1, asyncio-0.24.0, httpx-0.34.0
asyncio: mode=strict, default_loop_scope=None
collected 20 items / 18 deselected / 2 selected                                                                             

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-nvidia] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-nvidia] SKIPPED

============================= 1 passed, 1 skipped, 18 deselected, 6 warnings in 5.40s =============================
```

the structured output functionality works but the accuracy fails

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
2024-12-11 10:08:38 -08:00
Aidan Do
095125e463
[#391] Add support for json structured output for vLLM (#528)
# What does this PR do?

Addresses issue (#391)

- Adds json structured output for vLLM
- Enables structured output tests for vLLM

> Give me a recipe for Spaghetti Bolognaise:

```json
{
  "recipe_name": "Spaghetti Bolognaise",
  "preamble": "Ah, spaghetti bolognaise - the quintessential Italian dish that fills my kitchen with the aromas of childhood nostalgia. As a child, I would watch my nonna cook up a big pot of spaghetti bolognaise every Sunday, filling our small Italian household with the savory scent of simmering meat and tomatoes. The way the sauce would thicken and the spaghetti would al dente - it was love at first bite. And now, as a chef, I want to share that same love with you, so you can recreate these warm, comforting memories at home.",
  "ingredients": [
    "500g minced beef",
    "1 medium onion, finely chopped",
    "2 cloves garlic, minced",
    "1 carrot, finely chopped",
    " celery, finely chopped",
    "1 (28 oz) can whole peeled tomatoes",
    "1 tbsp tomato paste",
    "1 tsp dried basil",
    "1 tsp dried oregano",
    "1 tsp salt",
    "1/2 tsp black pepper",
    "1/2 tsp sugar",
    "1 lb spaghetti",
    "Grated Parmesan cheese, for serving",
    "Extra virgin olive oil, for serving"
  ],
  "steps": [
    "Heat a large pot over medium heat and add a generous drizzle of extra virgin olive oil.",
    "Add the chopped onion, garlic, carrot, and celery and cook until the vegetables are soft and translucent, about 5-7 minutes.",
    "Add the minced beef and cook until browned, breaking it up with a spoon as it cooks.",
    "Add the tomato paste and cook for 1-2 minutes, stirring constantly.",
    "Add the canned tomatoes, dried basil, dried oregano, salt, black pepper, and sugar. Stir well to combine.",
    "Bring the sauce to a simmer and let it cook for 20-30 minutes, stirring occasionally, until the sauce has thickened and the flavors have melded together.",
    "While the sauce cooks, bring a large pot of salted water to a boil and cook the spaghetti according to the package instructions until al dente. Reserve 1 cup of pasta water before draining the spaghetti.",
    "Add the reserved pasta water to the sauce and stir to combine.",
    "Combine the cooked spaghetti and sauce, tossing to coat the pasta evenly.",
    "Serve hot, topped with grated Parmesan cheese and a drizzle of extra virgin olive oil.",
    "Enjoy!"
  ]
}
```

Generated with Llama-3.2-3B-Instruct model - pretty good for a 3B
parameter model 👍

## Test Plan

`pytest -v -s
llama_stack/providers/tests/inference/test_text_inference.py -k
llama_3b-vllm_remote`

With the following setup:

```bash
# Environment
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export INFERENCE_PORT=8000
export VLLM_URL=http://localhost:8000/v1

# vLLM server
sudo docker run --gpus all \
    -v $STORAGE_DIR/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$(cat ~/.cache/huggingface/token)" \
    -p 8000:$INFERENCE_PORT \
    --ipc=host \
    --net=host \
    vllm/vllm-openai:v0.6.3.post1 \
    --model $INFERENCE_MODEL

# llama-stack server
llama stack build --template remote-vllm --image-type conda && llama stack run distributions/remote-vllm/run.yaml \
  --port 5001 \
  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
```

Results:

```
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_3b-vllm_remote] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_3b-vllm_remote] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_3b-vllm_remote] PASSED

================================ 6 passed, 2 skipped, 120 deselected, 2 warnings in 13.26s ================================
```

## Sources

- https://github.com/vllm-project/vllm/discussions/8300
- By default, vLLM uses https://github.com/dottxt-ai/outlines for
structured outputs
[[1](32e7db2536/vllm/engine/arg_utils.py (L279-L280))]

## Before submitting

[N/A] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case)

- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?

[N/A?] Updated relevant documentation. Couldn't find any relevant
documentation. Lmk if I've missed anything.

- [x] Wrote necessary unit or integration tests.
2024-12-08 15:02:51 -08:00
Henry Tu
64c6df8392
Cerebras Inference Integration (#265)
Adding Cerebras Inference as an API provider.

## Testing

### Conda
```
$ llama stack build --template cerebras --image-type conda
$ llama stack run ~/.llama/distributions/llamastack-cerebras/cerebras-run.yaml
...
Listening on ['::', '0.0.0.0']:5000
INFO:     Started server process [12443]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
```

### Chat Completion
```
$ curl --location 'http://localhost:5000/alpha/inference/chat-completion' --header 'Content-Type: application/json' --data '{
    "model_id": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": "What is the temperature in Seattle right now?"
        }
    ],
    "stream": false,
    "sampling_params": {
        "strategy": "top_p",
        "temperature": 0.5,
        "max_tokens": 100
    },                   
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "tools": [                   
        {
            "tool_name": "getTemperature",
            "description": "Gets the current temperature of a location.",
            "parameters": {                                              
                "location": {
                    "param_type": "string",
                    "description": "The name of the place to get the temperature from in degress celsius.",
                    "required": true                                                                       
                }                   
            }    
        }    
    ]    
}' 
```

#### Non-Streaming Response
```
{
  "completion_message": {
    "role": "assistant",
    "content": "",
    "stop_reason": "end_of_message",
    "tool_calls": [
      {
        "call_id": "6f42fdcc-6cbb-46ad-a17b-5d20ac64b678",
        "tool_name": "getTemperature",
        "arguments": {
          "location": "Seattle"
        }
      }
    ]
  },
  "logprobs": null
}
```

#### Streaming Response
```
data: {"event":{"event_type":"start","delta":"","logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"","parse_status":"started"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"{\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"type","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"function","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"name","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"get","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"Temperature","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"parameters","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" {\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"location","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"Seattle","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\"}}","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":{"call_id":"e742df1f-0ae9-40ad-a49e-18e5c905484f","tool_name":"getTemperature","arguments":{"location":"Seattle"}},"parse_status":"success"},"logprobs":null,"stop_reason":"end_of_message"}}
data: {"event":{"event_type":"complete","delta":"","logprobs":null,"stop_reason":"end_of_message"}}
```

### Completion
```
$ curl --location 'http://localhost:5000/alpha/inference/completion' --header 'Content-Type: application/json' --data '{
    "model_id": "meta-llama/Llama-3.1-8B-Instruct",
    "content": "1,2,3,",
    "stream": true,
    "sampling_params": {
        "strategy": "top_p",
        "temperature": 0.5,
        "max_tokens": 10
    },                   
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "tools": [                   
        {
            "tool_name": "getTemperature",
            "description": "Gets the current temperature of a location.",
            "parameters": {                                              
                "location": {
                    "param_type": "string",
                    "description": "The name of the place to get the temperature from in degress celsius.",
                    "required": true                                                                       
                }                   
            }    
        }    
    ]    
}'
```

#### Non-Streaming Response
```
{
  "content": "4,5,6,7,8,",
  "stop_reason": "out_of_tokens",
  "logprobs": null
}
```

#### Streaming Response
```
data: {"delta":"4","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"5","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"6","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"7","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"8","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"","stop_reason":null,"logprobs":null}
data: {"delta":"","stop_reason":"out_of_tokens","logprobs":null}
```

### Pre-Commit Checks
```
trim trailing whitespace.................................................Passed
check python ast.........................................................Passed
check for merge conflicts................................................Passed
check for added large files..............................................Passed
fix end of files.........................................................Passed
Insert license in comments...............................................Passed
flake8...................................................................Passed
Format files with µfmt...................................................Passed
```

### Testing with `test_inference.py`
```
$ export CEREBRAS_API_KEY=<insert API key here>
$ pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -m "cerebras and llama_8b" 
/net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=================================================== test session starts ===================================================
platform linux -- Python 3.12.3, pytest-8.3.3, pluggy-1.5.0 -- /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/bin/python3.12
cachedir: .pytest_cache
rootdir: /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack
configfile: pyproject.toml
plugins: anyio-4.6.2.post1, asyncio-0.24.0
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 128 items / 120 deselected / 8 selected                                                                         

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_8b-cerebras] Resolved 4 providers
 inner-inference => cerebras
 models => __routing_table__
 inference => __autorouted__
 inspect => __builtin__

Models: meta-llama/Llama-3.1-8B-Instruct served by cerebras

PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_8b-cerebras] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-cerebras] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_8b-cerebras] PASSED

================================ 6 passed, 2 skipped, 120 deselected, 6 warnings in 3.95s =================================
```

I ran `python llama_stack/scripts/distro_codegen.py` to run codegen.
2024-12-03 21:15:32 -08:00
Matthew Farrellee
435f34b05e
reduce the accuracy requirements to pass the chat completion structured output test (#522)
i find `test_structured_output` to be flakey. it's both a functionality
and accuracy test -

```
        answer = AnswerFormat.model_validate_json(response.completion_message.content)
        assert answer.first_name == "Michael"
        assert answer.last_name == "Jordan"
        assert answer.year_of_birth == 1963
        assert answer.num_seasons_in_nba == 15
```

it's an accuracy test because it checks the value of first/last name,
birth year, and num seasons.

i find that -
- llama-3.1-8b-instruct and llama-3.2-3b-instruct pass the functionality
portion
- llama-3.2-3b-instruct consistently fails the accuracy portion
(thinking MJ was in the NBA for 14 seasons)
 - llama-3.1-8b-instruct occasionally fails the accuracy portion

suggestions (not mutually exclusive) -
1. turn the test into functionality only, skip the value checks
2. split the test into a functionality version and an xfail accuracy
version
3. add context to the prompt so the llm can answer without accessing
embedded memory

# What does this PR do?

implements option (3) by adding context to the system prompt.


## Test Plan


`pytest -s -v ... llama_stack/providers/tests/inference/ ... -k
structured_output`


## Before submitting

- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
2024-12-03 02:55:14 -08:00
Matthew Farrellee
4e6c984c26
add NVIDIA NIM inference adapter (#355)
# What does this PR do?

this PR adds a basic inference adapter to NVIDIA NIMs

what it does -
 - chat completion api
   - tool calls
   - streaming
   - structured output
   - logprobs
 - support hosted NIM on integrate.api.nvidia.com
 - support downloaded NIM containers

what it does not do -
 - completion api
 - embedding api
 - vision models
 - builtin tools
 - have certainty that sampling strategies are correct

## Feature/Issue validation/testing/test plan

`pytest -s -v --providers inference=nvidia
llama_stack/providers/tests/inference/ --env NVIDIA_API_KEY=...`

all tests should pass. there are pydantic v1 warnings.


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
- [x] Did you write any new necessary tests?

Thanks for contributing 🎉!
2024-11-23 15:59:00 -08:00
Ashwin Bharambe
d790be28b3 Don't skip meta-reference for the tests 2024-11-21 13:29:53 -08:00
Ashwin Bharambe
05d1ead02f Update condition in tests to handle llama-3.1 vs llama3.1 (HF names) 2024-11-19 13:25:36 -08:00
Ashwin Bharambe
12947ac19e
Kill "remote" providers and fix testing with a remote stack properly (#435)
# What does this PR do?

This PR kills the notion of "pure passthrough" remote providers. You
cannot specify a single provider you must specify a whole distribution
(stack) as remote.

This PR also significantly fixes / upgrades testing infrastructure so
you can now test against a remotely hosted stack server by just doing

```bash
pytest -s -v -m remote  test_agents.py \
  --inference-model=Llama3.1-8B-Instruct --safety-shield=Llama-Guard-3-1B \
  --env REMOTE_STACK_URL=http://localhost:5001
```

Also fixed `test_agents_persistence.py` (which was broken) and killed
some deprecated testing functions.

## Test Plan

All the tests.
2024-11-12 21:51:29 -08:00
Dinesh Yeduguru
fdff24e77a
Inference to use provider resource id to register and validate (#428)
This PR changes the way model id gets translated to the final model name
that gets passed through the provider.
Major changes include:
1) Providers are responsible for registering an object and as part of
the registration returning the object with the correct provider specific
name of the model provider_resource_id
2) To help with the common look ups different names a new ModelLookup
class is created.



Tested all inference providers including together, fireworks, vllm,
ollama, meta reference and bedrock
2024-11-12 20:02:00 -08:00
Dinesh Yeduguru
ec644d3418
migrate model to Resource and new registration signature (#410)
* resource oriented object design for models

* add back llama_model field

* working tests

* register singature fix

* address feedback

---------

Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
2024-11-08 16:12:57 -08:00
Ashwin Bharambe
7c340f0236 rename test_inference -> test_text_inference 2024-11-06 16:12:50 -08:00
Renamed from llama_stack/providers/tests/inference/test_inference.py (Browse further)