The `--image-name __system__` thing was a hack and a bad one at that.
The actual intent was to somehow automatically detect the notebook
environment so we could avoid unnecessarily confusing things in the
llama stack build cmd-line. But I failed which led us to use the backup
`__system__` thing.
Let's just do the simple thing.
Note that `build_venv.sh` I haven't changed for now (so it still honors
the __system__ special name just that no new user should use it.)
## Test Plan
Open the notebooks from this branch in Colab (see example url below) and
ensure the builds work.
https://colab.research.google.com/github/meta-llama/llama-stack/blob/foo/docs/getting_started.ipynb
In the notebook, install llama-stack from this branch directly using:
```
!pip install -U https://github.com/meta-llama/llama-stack/archive/refs/heads/foo.zip
```
Verify that `!UV_SYSTEM_PYTHON=1 llama stack build --template together
--image-type venv` afterwards succeeds and the library client
initialization also works.
# Summary:
Right now we would include toolgroup args when we encode messages with
tool_calls, which is confusing the model since they not in the function
description (see test plan for example).
# Test Plan:
Add a print statement before raw prompt is sent to providers (no good
way to test this currently)
Before:
```
cated in the same neighborhood?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood", vector_db_ids=["829a68735d744dc3830409dcc782964a"])]<|eot_id|><|start_header_id|>ipython<|end_header_id|>\n\nknowledge_search tool found 5 chunks:\nBEGIN of
```
Note the extra `vector_db_ids`
After
```
>user<|end_header_id|>\n\nAre the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood")]<|eot_id|><|start_header_id|>ipython<|end_header_id|>\n\nknowledge_search tool found
```
Groq has never supported raw completions anyhow. So this makes it easier
to switch it to LiteLLM. All our test suite passes.
I also updated all the openai-compat providers so they work with api
keys passed from headers. `provider_data`
## Test Plan
```bash
LLAMA_STACK_CONFIG=groq \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=groq/llama-3.3-70b-versatile --vision-inference-model=""
```
Also tested (openai, anthropic, gemini) providers. No regressions.
# What does this PR do?
Tool format depends on the model. @ehhuang introduced a
`get_default_tool_prompt_format` function for this purpose. We should
use that instead of hacky model ID matching we had before.
Secondly, non llama models don't have this concept so testing with those
models should work as is.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```bash
for distro in fireworks ollama; do
LLAMA_STACK_CONFIG=$distro \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=meta-llama/Llama-3.2-3B-Instruct \
--vision-inference-model=""
done
LLAMA_STACK_CONFIG=dev \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=openai/gpt-4o \
--vision-inference-model=""
```
[//]: # (## Documentation)
# What does this PR do?
Model context protocol (MCP) allows for remote tools to be connected
with Agents. The current Ollama provider does not support it. This PR
adds necessary code changes to ensure that the integration between
Ollama backend and MCP works.
This PR is an extension of #816 for Ollama.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
1. Run llama-stack server with the command:
```
llama stack build --template ollama --image-type conda
llama stack run ./templates/ollama/run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://localhost:11434
```
2. Run the sample client agent with MCP tool:
```
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_client.types.shared_params.url import URL
from llama_stack_client import LlamaStackClient
from termcolor import cprint
## Start the local MCP server
# git clone https://github.com/modelcontextprotocol/python-sdk
# Follow instructions to get the env ready
# cd examples/servers/simple-tool
# uv run mcp-simple-tool --transport sse --port 8000
# Connect to the llama stack server
base_url="http://localhost:8321"
model_id="meta-llama/Llama-3.2-3B-Instruct"
client = LlamaStackClient(base_url=base_url)
# Register MCP tools
client.toolgroups.register(
toolgroup_id="mcp::filesystem",
provider_id="model-context-protocol",
mcp_endpoint=URL(uri="http://localhost:8000/sse"))
# Define an agent with MCP toolgroup
agent_config = AgentConfig(
model=model_id,
instructions="You are a helpful assistant",
toolgroups=["mcp::filesystem"],
input_shields=[],
output_shields=[],
enable_session_persistence=False,
)
agent = Agent(client, agent_config)
user_prompts = [
"Fetch content from https://www.google.com and print the response"
]
# Run a session with the agent
session_id = agent.create_session("test-session")
for prompt in user_prompts:
cprint(f"User> {prompt}", "green")
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=session_id,
)
for log in EventLogger().log(response):
log.print()
```
# Documentation
The file docs/source/distributions/self_hosted_distro/ollama.md is
updated to indicate the MCP tool runtime availability.
Signed-off-by: Shreyanand <shanand@redhat.com>
This is a follow up to:
https://github.com/meta-llama/llama-stack/pull/1140
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
Avoid unnecessary GPU memory clean attempt when the GPU is not used for
training.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
With CPU:
```
INFO 2025-02-26 16:43:56,267 torchtune.utils._logging:121: Model checkpoint of size 6.43 GB saved to /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth
INFO 2025-02-26 16:43:56,274 torchtune.utils._logging:132: Adapter checkpoint of size 0.00 GB saved to /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth
model_file_path /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0
```
With CUDA:
```
INFO 2025-02-26 21:39:24,314 torchtune.utils._logging:121: Model checkpoint of size 6.43 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth
INFO 2025-02-26 21:39:24,333 torchtune.utils._logging:132: Adapter checkpoint of size 0.00 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth
model_file_path /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0
```
[//]: # (## Documentation)
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# Summary:
Our tests sometimes error out with
```
========================== 11 passed, 342 warnings in 58.86s ==========================
Error exporting span to SQLite: Cannot operate on a closed database.
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x000000012af04280)
Current thread 0x00000001fa29c240 (most recent call first):
<no Python frame>
```
Usually able to repro this by running 10 times.
The proposed fix is to use threadsafe var for creating sqlite connection
to ensure connection is only used by one thread. Not 100% if this is the
fix, but am not able to repro with this.
# Test Plan:
Run 10 times and saw no more errors
```
for i in {1..10}; do
echo "=== Starting Run $i ==="
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B
if [[ $? -ne 0 ]]; then
echo "=== Run $i FAILED with exit code $? ==="
break
else
echo "=== Run $i PASSED ==="
fi
echo
done
```
Summary:
Lets the model decide which tool it needs to call to respond to a query.
Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/ --safety-shield meta-llama/Llama-Guard-3-8B
```
Also evaluated on a small benchmark with 20 questions from HotpotQA.
With this PR and some prompting, the performance is 77% recall compared
to 50% currently.
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1015).
* #1268
* #1239
* __->__ #1015
# What does this PR do?
This PR makes a couple of changes required to get the test
`tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search`
passing on the remote-vllm provider.
First, we adjust agent_instance to also pass in the description and
parameters of builtin tools. We need these parameters so we can pass the
tool's expected parameters into vLLM. The meta-reference implementations
may not have needed these for builtin tools, as they are able to take
advantage of the Llama-model specific support for certain builtin tools.
However, with vLLM, our server-side chat templates for tool calling
treat all tools the same and don't separate out Llama builtin vs custom
tools. So, we need to pass the full set of parameter definitions and
list of required parameters for builtin tools as well.
Next, we adjust the vllm streaming chat completion code to fix up some
edge cases where it was returning an extra ChatCompletionResponseEvent
with an empty ToolCall with empty string call_id, tool_name, and
arguments properties. This is a bug discovered after the above fix,
where after a successful tool invocation we were sending extra chunks
back to the client with these empty ToolCalls.
## Test Plan
With these changes, the following test that previously failed now
passes:
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v \
tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search \
--inference-model "meta-llama/Llama-3.2-3B-Instruct"
```
Additionally, I ran the remote-vllm client-sdk and provider inference
tests as below to ensure they all still passed with this change:
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v \
tests/client-sdk/inference/test_text_inference.py \
--inference-model "meta-llama/Llama-3.2-3B-Instruct"
```
```
VLLM_URL="http://localhost:8000/v1" \
python -m pytest -s -v \
llama_stack/providers/tests/inference/test_text_inference.py \
--providers "inference=vllm_remote"
```
[//]: # (## Documentation)
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
I think this got accidentally removed as part of
https://github.com/meta-llama/llama-stack/pull/1250. cc @leseb
## Test Plan
After the change, this arg is no longer required.
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
## context
Now, in llama stack, we only support inference / eval a finetuned
checkpoint with meta-reference as inference provider. This is
sub-optimal since meta-reference is pretty slow.
Our vision is that developer can inference / eval a finetuned checkpoint
produced by post training apis with all the inference providers on the
stack. To achieve this, we'd like to define an unified output checkpoint
format for post training providers. So that, all the inference provider
can respect that format for customized model inference.
By spotting check how
[ollama](https://github.com/ollama/ollama/blob/main/docs/import.md) and
[fireworks](https://docs.fireworks.ai/models/uploading-custom-models) do
inference on a customized model, we defined the output checkpoint format
as /adapter/adapter_config.json and /adapter/adapter_model.safetensors
(as we only support LoRA post training now, we begin from adapter only
checkpoint)
## test
we kick off a post training job and configured checkpoint format as
'huggingface'. Output files

we did a proof of concept with ollama to see if ollama can inference our
finetuned checkpoint
1. create Modelfile like
<img width="799" alt="Screenshot 2025-01-22 at 5 04 18 PM"
src="https://github.com/user-attachments/assets/7fca9ac3-a294-44f8-aab1-83852c600609"
/>
2. create a customized model with `ollama create llama_3_2_finetuned`
and run inference successfully

This is just a proof of concept with ollama cmd line. As next step, we'd
like to wrap loading / inference customized model logic in the inference
provider implementation.
# What does this PR do?
This PR introduces more non-llama model support to llama stack.
Providers introduced: openai, anthropic and gemini. All of these
providers use essentially the same piece of code -- the implementation
works via the `litellm` library.
We will expose only specific models for providers we enable making sure
they all work well and pass tests. This setup (instead of automatically
enabling _all_ providers and models allowed by LiteLLM) ensures we can
also perform any needed prompt tuning on a per-model basis as needed
(just like we do it for llama models.)
## Test Plan
```bash
#!/bin/bash
args=("$@")
for model in openai/gpt-4o anthropic/claude-3-5-sonnet-latest gemini/gemini-1.5-flash; do
LLAMA_STACK_CONFIG=dev pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--embedding-model=all-MiniLM-L6-v2 \
--vision-inference-model="" \
--inference-model=$model "${args[@]}"
done
```
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
to the llama-stack-client-swift repo - PR:
https://github.com/meta-llama/llama-stack-client-swift/pull/22
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
Actually, the incorrect token also will hit `RepositoryNotFoundError`,
e.g.
```
$ llama model download --source huggingface --model-id Llama3.2-1B-Instruct:int4-qlora-eo8 --hf-token xx ### xx is incorrect token
----RepositoryNotFoundError--->
usage: llama model download [-h] [--source {meta,huggingface}] [--model-id MODEL_ID]
[--hf-token HF_TOKEN] [--meta-url META_URL]
[--max-parallel MAX_PARALLEL] [--ignore-patterns IGNORE_PATTERNS]
[--manifest-file MANIFEST_FILE]
llama model download: error: Repository 'meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8' not found on the Hugging Face Hub.
so update to:
llama model download --source huggingface --model-id Llama3.2-1B-Instruct:int4-qlora-eo8 --hf-token xx
----RepositoryNotFoundError--->
usage: llama model download [-h] [--source {meta,huggingface}] [--model-id MODEL_ID]
[--hf-token HF_TOKEN] [--meta-url META_URL]
[--max-parallel MAX_PARALLEL] [--ignore-patterns IGNORE_PATTERNS]
[--manifest-file MANIFEST_FILE]
llama model download: error: Repository 'meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8' not found on the Hugging Face Hub or incorrect Hugging Face token.
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
If not passed the `headers`, it will display empty for the first row,
also might break the second row, make the `Model` row as `headers`.
```
Before:
$ llama model describe -m Llama3.1-70B
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ┃ ┃ <<<---------
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Model │ Llama3.1-70B │ <<<---------
├─────────────────────────────┼────────────────────────────────┤
│ Hugging Face ID │ meta-llama/Llama-3.1-70B │
├─────────────────────────────┼────────────────────────────────┤
│ Description │ Llama 3.1 70b model │
├─────────────────────────────┼────────────────────────────────┤
......
after:
$ llama model describe -m Llama3.1-70B
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Model ┃ Llama3.1-70B ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Hugging Face ID │ meta-llama/Llama-3.1-70B │
├─────────────────────────────┼────────────────────────────────┤
│ Description │ Llama 3.1 70b model │
├─────────────────────────────┼────────────────────────────────┤
......
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
- Introduced logging in `StackRun` to replace print-based messages
- Improved error handling for config file loading and parsing
- Replaced `cprint` with `logger.error` for consistent error messaging
- Ensured logging is used in `server.py` for startup, shutdown, and
runtime messages
- Added missing exception handling for invalid providers
Signed-off-by: Sébastien Han <seb@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Create a distribution template using Groq as inference provider.
Link to issue: https://github.com/meta-llama/llama-stack/issues/958
## Test Plan
Run `python llama_stack/scripts/distro_codegen.py` to generate run.yaml
and build.yaml
Test the newly created template by running
`llama stack build --template <template-name>`
`llama stack run <template-name>`
# What does this PR do?
currently, build_venv.sh expects a `distribution_type` as the first
argument but the only things ever passed are:
1. image name
2. pip dependencies
so distribution_type is never passed in meaning the script errors when
calling something like:
`llama stack build --image-type venv --template ollama --image-name
test`
before output:
```
llama stack build --image-type venv --template ollama --image-name venv-test
Usage: /Users/charliedoern/projects/Documents/llama-stack/llama_stack/distribution/build_venv.sh <distribution_type> <env_name> <pip_dependencies> [<special_pip_deps>]
Example: /Users/charliedoern/projects/Documents/llama-stack/llama_stack/distribution/build_venv.sh <distribution_type> mybuild ./my-stack-build.yaml 'numpy pandas scipy'
Failed to build target venv-test with return code 1
Run config path is empty
```
after:
```
llama stack build --image-type venv --template ollama --image-name venv-test
Environment 'venv-test' already exists, re-using it.
Using virtual environment venv-test
Using CPython 3.13.0 interpreter at: /opt/homebrew/opt/python@3.13/bin/python3.13
Creating virtual environment at: venv-test
Activate with: source venv-test/bin/activate
Using Python 3.13.0 environment at: venv-test
Resolved 55 packages in 640ms
Built fire==0.7.0
Prepared 54 packages in 1.14s
Installed 55 packages in 82ms
+ annotated-types==0.7.0
```
## Test Plan
ran locally with output above
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
- Fixed type hinting and missing imports across multiple modules.
- Improved compatibility by using `TYPE_CHECKING` for conditional
imports.
- Updated `pyproject.toml` to enforce stricter linting.
Signed-off-by: Sébastien Han <seb@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
When there are issues with the tool call function, an exception is
raised but the error message is not informative. This adds a clearer
message to tell users to check their functions.
```
Traceback (most recent call last):
File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/distribution/server/server.py", line 208, in sse_generator
async for item in event_gen:
File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agents.py", line 165, in _create_agent_turn_streaming
async for event in agent.create_and_execute_turn(request):
File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 197, in create_and_execute_turn
async for chunk in self.run(
File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 389, in run
async for res in self._run(
File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 811, in _run
content=tool_result.content,
AttributeError: 'NoneType' object has no attribute 'content'
```
## Test Plan
Ran the same script and exception is raised with clearer error message.
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Summary:
kotlin SDK expects this format
Test Plan:
python prints the expected format
>>> str(datetime.now().astimezone())
'2025-02-24 22:02:58.729763-08:00'
# What does this PR do?
now that llama stack supports running in venv, conda, and container
modes and the 3 scripts overlap alot, combine these three into ons
`start_stack.sh` script
## Test Plan
tested this locally on venv, conda, and container
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Summary:
Currently we don't set the best tool_prompt_format according to model as
promisd.
Test Plan:
Added print around raw model input and inspected manually
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1214).
* #1234
* __->__ #1214
# What does this PR do?
When building providers in a virtual environment or containers, special
pip dependencies may not always be provided (e.g., for Ollama). The
check should only fail if the required number of arguments is missing.
Currently, two arguments are mandatory:
1. Environment name
2. Pip dependencies
Additionally, return statements were replaced with sys.exit(1) in error
conditions to ensure immediate termination on critical failures. Error
handling in the stack build process was also improved to guarantee the
program exits with status 1 when facing configuration issues or build
failures.
Signed-off-by: Sébastien Han <seb@redhat.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
This command shouldn't fail:
```
llama stack build --template ollama --image-type venv
```
[//]: # (## Documentation)
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
--run runs the stack that was just build using the same arguments during
the build process (image-name, type, etc)
This simplifies the workflow a lot and makes the UX better for most
local users trying to get started rather than having to match the flags
of the two commands (build and then run)
Also, moved `ImageType` to distribution.utils since there were circular
import errors with its old location
## Test Plan
tested locally using the following command:
`llama stack build --run --template ollama --image-type venv`
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
`llama model list` or `llama model list --show-all` will list more or
all for the models, so add the `search` option to simplify the output.
```
$ llama model list --help
usage: llama model list [-h] [--show-all] [-s SEARCH]
Show available llama models
options:
-h, --help show this help message and exit
--show-all Show all models (not just defaults)
-s SEARCH, --search SEARCH
Search for the input string as a substring in the model descriptor(ID)
$ llama model list -s 70b
+-----------------------+-----------------------------------+----------------+
| Model Descriptor(ID) | Hugging Face Repo | Context Length |
+-----------------------+-----------------------------------+----------------+
| Llama3.1-70B | meta-llama/Llama-3.1-70B | 128K |
+-----------------------+-----------------------------------+----------------+
| Llama3.1-70B-Instruct | meta-llama/Llama-3.1-70B-Instruct | 128K |
+-----------------------+-----------------------------------+----------------+
| Llama3.3-70B-Instruct | meta-llama/Llama-3.3-70B-Instruct | 128K |
+-----------------------+-----------------------------------+----------------+
$ llama model list -s 3.1-8b
+----------------------+----------------------------------+----------------+
| Model Descriptor(ID) | Hugging Face Repo | Context Length |
+----------------------+----------------------------------+----------------+
| Llama3.1-8B | meta-llama/Llama-3.1-8B | 128K |
+----------------------+----------------------------------+----------------+
| Llama3.1-8B-Instruct | meta-llama/Llama-3.1-8B-Instruct | 128K |
+----------------------+----------------------------------+----------------+
$ llama model list --show-all -s pro
+----------------------+-----------------------------+----------------+
| Model Descriptor(ID) | Hugging Face Repo | Context Length |
+----------------------+-----------------------------+----------------+
| Prompt-Guard-86M | meta-llama/Prompt-Guard-86M | 2K |
+----------------------+-----------------------------+----------------+
$ llama model list -s k
Not found for search.
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
This PR begins the process of supporting non-llama models within Llama
Stack. We start simple by adding support for this functionality within a
few existing providers: fireworks, together and ollama.
## Test Plan
```bash
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model accounts/fireworks/models/phi-3-vision-128k-instruct
```
^ this passes most of the tests but as expected fails the tool calling
related tests since they are very specific to Llama models
```
inference/test_text_inference.py::test_text_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_completion_log_probs_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_completion_log_probs_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_text_completion_structured_output[accounts/fireworks/models/phi-3-vision-128k-instruct-completion-01] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-Which planet do humans live on?-Earth] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-Which planet has rings around it with a name starting w
ith letter S?-Saturn] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-What's the name of the Sun in latin?-Sol] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-What is the name of the US captial?-Washington] PASSED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_none[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_text_chat_completion_structured_output[accounts/fireworks/models/phi-3-vision-128k-instruct] ERROR
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[accounts/fireworks/models/phi-3-vision-128k-instruct-True] PASSED
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[accounts/fireworks/models/phi-3-vision-128k-instruct-False] PASSED
```
# What does this PR do?
- Enable mypy to run in the CI on a subset of the repository
- Fix a few mypy errors
- Run mypy from pre-commit
Signed-off-by: Sébastien Han <seb@redhat.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Sébastien Han <seb@redhat.com>