llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-04 02:03:44 +00:00

Author	SHA1	Message	Date
Reid	56c1a50b86	fix: fix the describe table display issue (#1221 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] If not passed the `headers`, it will display empty for the first row, also might break the second row, make the `Model` row as `headers`. ``` Before: $ llama model describe -m Llama3.1-70B ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ┃ ┃ <<<--------- ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ Model │ Llama3.1-70B │ <<<--------- ├─────────────────────────────┼────────────────────────────────┤ │ Hugging Face ID │ meta-llama/Llama-3.1-70B │ ├─────────────────────────────┼────────────────────────────────┤ │ Description │ Llama 3.1 70b model │ ├─────────────────────────────┼────────────────────────────────┤ ...... after: $ llama model describe -m Llama3.1-70B ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Model ┃ Llama3.1-70B ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ Hugging Face ID │ meta-llama/Llama-3.1-70B │ ├─────────────────────────────┼────────────────────────────────┤ │ Description │ Llama 3.1 70b model │ ├─────────────────────────────┼────────────────────────────────┤ ...... ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-02-25 21:34:53 -08:00
Sébastien Han	929c5f0842	refactor(server): replace print statements with logger (#1250 ) # What does this PR do? - Introduced logging in `StackRun` to replace print-based messages - Improved error handling for config file loading and parsing - Replaced `cprint` with `logger.error` for consistent error messaging - Ensured logging is used in `server.py` for startup, shutdown, and runtime messages - Added missing exception handling for invalid providers Signed-off-by: Sébastien Han <seb@redhat.com> Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-25 21:31:37 -08:00
Hardik Shah	c0c7622295	fix: dont assume SentenceTransformer is imported as titled	2025-02-25 16:53:01 -08:00
Vladislav Bronzov	967cff4533	feat: Add Groq distribution template (#1173 ) # What does this PR do? Create a distribution template using Groq as inference provider. Link to issue: https://github.com/meta-llama/llama-stack/issues/958 ## Test Plan Run `python llama_stack/scripts/distro_codegen.py` to generate run.yaml and build.yaml Test the newly created template by running `llama stack build --template <template-name>` `llama stack run <template-name>`	2025-02-25 14:16:56 -08:00
LESSuseLESS	3a31611486	feat: completing text /chat-completion and /completion tests (#1223 ) # What does this PR do? The goal is to have a fairly complete set of provider and e2e tests for /chat-completion and /completion. This is the current list, ``` grep -oE "def test_[a-zA-Z_+]" llama_stack/providers/tests/inference/test_text_inference.py \| cut -d' ' -f2 ``` - test_model_list - test_text_completion_non_streaming - test_text_completion_streaming - test_text_completion_logprobs_non_streaming - test_text_completion_logprobs_streaming - test_text_completion_structured_output - test_text_chat_completion_non_streaming - test_text_chat_completion_structured_output - test_text_chat_completion_streaming - test_text_chat_completion_with_tool_calling - test_text_chat_completion_with_tool_calling_streaming ``` grep -oE "def test_[a-zA-Z_+]" tests/client-sdk/inference/test_text_inference.py \| cut -d' ' -f2 ``` - test_text_completion_non_streaming - test_text_completion_streaming - test_text_completion_log_probs_non_streaming - test_text_completion_log_probs_streaming - test_text_completion_structured_output - test_text_chat_completion_non_streaming - test_text_chat_completion_streaming - test_text_chat_completion_with_tool_calling_and_non_streaming - test_text_chat_completion_with_tool_calling_and_streaming - test_text_chat_completion_with_tool_choice_required - test_text_chat_completion_with_tool_choice_none - test_text_chat_completion_structured_output - test_text_chat_completion_tool_calling_tools_not_in_request ## Test plan == Set up Ollama local server ``` OLLAMA_HOST=127.0.0.1:8321 with-proxy ollama serve OLLAMA_HOST=127.0.0.1:8321 ollama run llama3.2:3b-instruct-fp16 --keepalive 60m ``` == Run a provider test ``` conda activate stack OLLAMA_URL="http://localhost:8321" \ pytest -v -s -k "ollama" --inference-model="llama3.2:3b-instruct-fp16" \ llama_stack/providers/tests/inference/test_text_inference.py::TestInference ``` == Run an e2e test ``` conda activate sherpa with-proxy pip install llama-stack export INFERENCE_MODEL=llama3.2:3b-instruct-fp16 export LLAMA_STACK_PORT=8322 with-proxy llama stack build --template ollama with-proxy llama stack run --env OLLAMA_URL=http://localhost:8321 ollama ``` ``` conda activate stack LLAMA_STACK_PORT=8322 LLAMA_STACK_BASE_URL="http://localhost:8322" \ pytest -v -s --inference-model="llama3.2:3b-instruct-fp16" \ tests/client-sdk/inference/test_text_inference.py ```	2025-02-25 11:37:04 -08:00
Charlie Doern	9b130f96a7	fix: build_venv expects an extra argument (#1233 ) # What does this PR do? currently, build_venv.sh expects a `distribution_type` as the first argument but the only things ever passed are: 1. image name 2. pip dependencies so distribution_type is never passed in meaning the script errors when calling something like: `llama stack build --image-type venv --template ollama --image-name test` before output: ``` llama stack build --image-type venv --template ollama --image-name venv-test Usage: /Users/charliedoern/projects/Documents/llama-stack/llama_stack/distribution/build_venv.sh <distribution_type> <env_name> <pip_dependencies> [<special_pip_deps>] Example: /Users/charliedoern/projects/Documents/llama-stack/llama_stack/distribution/build_venv.sh <distribution_type> mybuild ./my-stack-build.yaml 'numpy pandas scipy' Failed to build target venv-test with return code 1 Run config path is empty ``` after: ``` llama stack build --image-type venv --template ollama --image-name venv-test Environment 'venv-test' already exists, re-using it. Using virtual environment venv-test Using CPython 3.13.0 interpreter at: /opt/homebrew/opt/python@3.13/bin/python3.13 Creating virtual environment at: venv-test Activate with: source venv-test/bin/activate Using Python 3.13.0 environment at: venv-test Resolved 55 packages in 640ms Built fire==0.7.0 Prepared 54 packages in 1.14s Installed 55 packages in 82ms + annotated-types==0.7.0 ``` ## Test Plan ran locally with output above Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-02-25 11:08:50 -08:00
Sébastien Han	c223b1862b	fix: resolve type hint issues and import dependencies (#1176 ) # What does this PR do? - Fixed type hinting and missing imports across multiple modules. - Improved compatibility by using `TYPE_CHECKING` for conditional imports. - Updated `pyproject.toml` to enforce stricter linting. Signed-off-by: Sébastien Han <seb@redhat.com> Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-25 11:06:47 -08:00
Yuan Tang	1a044ef894	fix: Raise exception when tool call result is None (#1253 ) # What does this PR do? When there are issues with the tool call function, an exception is raised but the error message is not informative. This adds a clearer message to tell users to check their functions. ``` Traceback (most recent call last): File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/distribution/server/server.py", line 208, in sse_generator async for item in event_gen: File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agents.py", line 165, in _create_agent_turn_streaming async for event in agent.create_and_execute_turn(request): File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 197, in create_and_execute_turn async for chunk in self.run( File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 389, in run async for res in self._run( File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 811, in _run content=tool_result.content, AttributeError: 'NoneType' object has no attribute 'content' ``` ## Test Plan Ran the same script and exception is raised with clearer error message. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-25 13:10:50 -05:00
Jeff Tang	73a0c7a0e7	LocalInferenceImpl update for LS013 (#1242 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation)	2025-02-25 09:58:34 -08:00
ehhuang	dc3c881ffe	fix: include timezone in Agent steps' timestamps (#1247 ) Summary: kotlin SDK expects this format Test Plan: python prints the expected format >>> str(datetime.now().astimezone()) '2025-02-24 22:02:58.729763-08:00'	2025-02-25 09:49:25 -08:00
Charlie Doern	4684fd3f8d	refactor: combine start scripts for each env (#1139 ) # What does this PR do? now that llama stack supports running in venv, conda, and container modes and the 3 scripts overlap alot, combine these three into ons `start_stack.sh` script ## Test Plan tested this locally on venv, conda, and container --------- Signed-off-by: Charlie Doern <cdoern@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-24 16:53:31 -08:00
Ashwin Bharambe	9b0f783e54	test: add a ci-tests distro template for running e2e tests (#1237 )	2025-02-24 14:43:21 -08:00
ehhuang	14c38acf97	fix: set default tool_prompt_format in inference api (#1214 ) Summary: Currently we don't set the best tool_prompt_format according to model as promisd. Test Plan: Added print around raw model input and inspected manually --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1214). * #1234 * __->__ #1214	2025-02-24 12:38:37 -08:00
Sébastien Han	c4987bc349	fix: avoid failure when no special pip deps and better exit (#1228 ) # What does this PR do? When building providers in a virtual environment or containers, special pip dependencies may not always be provided (e.g., for Ollama). The check should only fail if the required number of arguments is missing. Currently, two arguments are mandatory: 1. Environment name 2. Pip dependencies Additionally, return statements were replaced with sys.exit(1) in error conditions to ensure immediate termination on critical failures. Error handling in the stack build process was also improved to guarantee the program exits with status 1 when facing configuration issues or build failures. Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan This command shouldn't fail: ``` llama stack build --template ollama --image-type venv ``` [//]: # (## Documentation) Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-24 13:18:52 -05:00
Ashwin Bharambe	e8e8fe7c93	fix: add LLAMA_STACK_CLIENT_DIR mount when installing in docker from source	2025-02-24 10:00:57 -08:00
Ashwin Bharambe	641549c631	Add llama stack client overrides also; necessary for correct docker building	2025-02-24 07:51:11 -08:00
Ashwin Bharambe	0973d386e6	fix: update build_container.sh to ensure llama-models is installed first	2025-02-23 21:47:26 -08:00
Charlie Doern	34e3faa4e8	feat: add --run to llama stack build (#1156 ) # What does this PR do? --run runs the stack that was just build using the same arguments during the build process (image-name, type, etc) This simplifies the workflow a lot and makes the UX better for most local users trying to get started rather than having to match the flags of the two commands (build and then run) Also, moved `ImageType` to distribution.utils since there were circular import errors with its old location ## Test Plan tested locally using the following command: `llama stack build --run --template ollama --image-type venv` Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-02-23 22:06:09 -05:00
Ashwin Bharambe	6227e1e3b9	fix: update virtualenv building so llamastack- prefix is not added, make notebook experience easier (#1225 ) Make sure venv behaves like conda (no prefix is added to image_name) and `--image-type venv` inside a notebook "just works" without any fiddling	2025-02-23 16:57:11 -08:00
Reid	187524d4ae	feat: add substring search for model list (#1099 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] `llama model list` or `llama model list --show-all` will list more or all for the models, so add the `search` option to simplify the output. ``` $ llama model list --help usage: llama model list [-h] [--show-all] [-s SEARCH] Show available llama models options: -h, --help show this help message and exit --show-all Show all models (not just defaults) -s SEARCH, --search SEARCH Search for the input string as a substring in the model descriptor(ID) $ llama model list -s 70b +-----------------------+-----------------------------------+----------------+ \| Model Descriptor(ID) \| Hugging Face Repo \| Context Length \| +-----------------------+-----------------------------------+----------------+ \| Llama3.1-70B \| meta-llama/Llama-3.1-70B \| 128K \| +-----------------------+-----------------------------------+----------------+ \| Llama3.1-70B-Instruct \| meta-llama/Llama-3.1-70B-Instruct \| 128K \| +-----------------------+-----------------------------------+----------------+ \| Llama3.3-70B-Instruct \| meta-llama/Llama-3.3-70B-Instruct \| 128K \| +-----------------------+-----------------------------------+----------------+ $ llama model list -s 3.1-8b +----------------------+----------------------------------+----------------+ \| Model Descriptor(ID) \| Hugging Face Repo \| Context Length \| +----------------------+----------------------------------+----------------+ \| Llama3.1-8B \| meta-llama/Llama-3.1-8B \| 128K \| +----------------------+----------------------------------+----------------+ \| Llama3.1-8B-Instruct \| meta-llama/Llama-3.1-8B-Instruct \| 128K \| +----------------------+----------------------------------+----------------+ $ llama model list --show-all -s pro +----------------------+-----------------------------+----------------+ \| Model Descriptor(ID) \| Hugging Face Repo \| Context Length \| +----------------------+-----------------------------+----------------+ \| Prompt-Guard-86M \| meta-llama/Prompt-Guard-86M \| 2K \| +----------------------+-----------------------------+----------------+ $ llama model list -s k Not found for search. ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-02-21 16:38:10 -08:00
Ashwin Bharambe	45ffe87d7c	Kill noise from test output	2025-02-21 15:37:23 -08:00
Ashwin Bharambe	e7d261ef4a	Fix test infra, sentence embeddings mixin	2025-02-21 15:11:46 -08:00
Ashwin Bharambe	ab54b8cd58	feat(providers): support non-llama models for inference providers (#1200 ) This PR begins the process of supporting non-llama models within Llama Stack. We start simple by adding support for this functionality within a few existing providers: fireworks, together and ollama. ## Test Plan ```bash LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/inference/test_text_inference.py \ --inference-model accounts/fireworks/models/phi-3-vision-128k-instruct ``` ^ this passes most of the tests but as expected fails the tool calling related tests since they are very specific to Llama models ``` inference/test_text_inference.py::test_text_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED inference/test_text_inference.py::test_completion_log_probs_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED inference/test_text_inference.py::test_completion_log_probs_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED inference/test_text_inference.py::test_text_completion_structured_output[accounts/fireworks/models/phi-3-vision-128k-instruct-completion-01] PASSED inference/test_text_inference.py::test_text_chat_completion_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-Which planet do humans live on?-Earth] PASSED inference/test_text_inference.py::test_text_chat_completion_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-Which planet has rings around it with a name starting w ith letter S?-Saturn] PASSED inference/test_text_inference.py::test_text_chat_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-What's the name of the Sun in latin?-Sol] PASSED inference/test_text_inference.py::test_text_chat_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-What is the name of the US captial?-Washington] PASSED inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_none[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED inference/test_text_inference.py::test_text_chat_completion_structured_output[accounts/fireworks/models/phi-3-vision-128k-instruct] ERROR inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[accounts/fireworks/models/phi-3-vision-128k-instruct-True] PASSED inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[accounts/fireworks/models/phi-3-vision-128k-instruct-False] PASSED ```	2025-02-21 13:21:28 -08:00
Sébastien Han	9bbe34694d	ci: add mypy for static type checking (#1101 ) # What does this PR do? - Enable mypy to run in the CI on a subset of the repository - Fix a few mypy errors - Run mypy from pre-commit Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-21 13:15:40 -08:00
ehhuang	25fddccfd8	feat: tool outputs metadata (#1155 ) Summary: Allows tools to output metadata. This is useful for evaluating tool outputs, e.g. RAG tool will output document IDs, which can be used to score recall. Will need to make a similar change on the client side to support ClientTool outputting metadata. Test Plan: LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py	2025-02-21 13:15:31 -08:00
Ashwin Bharambe	36162c8c82	fix(ollama): register model with the helper first so it gets normalized	2025-02-21 12:51:38 -08:00
Xi Yan	0fe071764f	feat(1/n): api: unify agents for handling server & client tools (#1178 ) # Problem Our current Agent framework has discrepancies in definition on how we handle server side and client side tools. 1. Server Tools: a single Turn is returned including `ToolExecutionStep` in agenst 2. Client Tools: `create_agent_turn` is called in loop with client agent lib yielding the agent chunk `ad6ffc63df/src/llama_stack_client/lib/agents/agent.py (L186-L211)` This makes it inconsistent to work with server & client tools. It also complicates the logs to telemetry to get information about agents turn / history for observability. #### Principle The same `turn_id` should be used to represent the steps required to complete a user message including client tools. ## Solution 1. `AgentTurnResponseEventType.turn_awaiting_input` status to indicate that the current turn is not completed, and awaiting tool input 2. `continue_agent_turn` endpoint to update agent turn with client's tool response. # What does this PR do? - Skeleton API as example ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] - Just API update, no functionality change ``` llama stack run + client-sdk test ``` <img width="842" alt="image" src="https://github.com/user-attachments/assets/7ac56b5f-f424-4632-9476-7e0f57555bc3" /> [//]: # (## Documentation)	2025-02-21 11:48:27 -08:00
Ashwin Bharambe	992f865b2e	chore: move embedding deps to RAG tool where they are needed (#1210 ) `EMBEDDING_DEPS` were wrongly associated with `vector_io` providers. They are needed by https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/utils/memory/vector_store.py#L142 and related code and is used by the RAG tool and as such should only be needed by the `inline::rag-runtime` provider.	2025-02-21 11:33:41 -08:00
Ashwin Bharambe	11697f85c5	fix: pull ollama embedding model if necessary (#1209 ) Embedding models are tiny and can be pulled on-demand. Let's do that so the user doesn't have to do "yet another thing" to get themselves set up. Thanks @hardikjshah for the suggestion. Also fixed a build dependency miss (TODO: distro_codegen needs to actually check that the build template contains all providers mentioned for the run.yaml file) ## Test Plan First run `ollama rm all-minilm:latest`. Run `llama stack build --template ollama && llama stack run ollama --env INFERENCE_MODEL=llama3.2:3b-instruct-fp16`. See that it outputs a "Pulling embedding model `all-minilm:latest`" output and the stack starts up correctly. Verify that `ollama list` shows the model is correctly downloaded.	2025-02-21 10:35:56 -08:00
Jamie Land	840fae2259	fix: Updating images so that they are able to run without root access (#1208 ) # What does this PR do? Addresses issues where the container is unable to run as root. Gives write access to required folders. [//]: # (If resolving an issue, uncomment and update the line below) (Closes #[1207]) ## Test Plan I built locally and ran `llama stack build --template remote-vllm --image-type container` and validated I could see my changes in the output: ``` #11 1.186 Installed 11 packages in 61ms #11 1.186 + llama-models==0.1.3 #11 1.186 + llama-stack==0.1.3 #11 1.186 + llama-stack-client==0.1.3 #11 1.186 + markdown-it-py==3.0.0 #11 1.186 + mdurl==0.1.2 #11 1.186 + prompt-toolkit==3.0.50 #11 1.186 + pyaml==25.1.0 #11 1.186 + pygments==2.19.1 #11 1.186 + rich==13.9.4 #11 1.186 + tiktoken==0.9.0 #11 1.186 + wcwidth==0.2.13 #11 DONE 1.6s #12 [ 9/10] RUN mkdir -p /.llama /.cache #12 DONE 0.3s #13 [10/10] RUN chmod -R g+rw /app /.llama /.cache #13 DONE 0.3s #14 exporting to image #14 exporting layers #14 exporting layers 3.7s done #14 writing image sha256:11cc8bd954db6d036037bcaf471b173ddd5261ac4b1e72074cccf85d18aefb96 done #14 naming to docker.io/library/distribution-remote-vllm:0.1.3 done #14 DONE 3.7s + set +x Success! ``` This is what the resulting image looks like: ![image](https://github.com/user-attachments/assets/070b9c05-b40f-4e7e-aa24-fef260c395e3) Also tagged the image as `0.1.3-test` and [pushed to quay](https://quay.io/repository/jland/distribution-remote-vllm?tab=tags) (note there are a bunch of critical vulnerabilities we may want to look into) And for good measure I deployed the resulting image on my Openshift environment using the default Security Context and validated that there were no issue with it coming up. My validation was all done with the `vllm-remote` distribution, but if I am understanding everything correctly the other distributions are just different run.yaml configs. [//]: # (## Documentation) Please let me know if there is anything else I need to do. Co-authored-by: Jamie Land <hokie10@gmail.com>	2025-02-21 11:32:56 -05:00
Reid	9898589f12	fix: convert back to model descriptor for model in list --downloaded (#1201 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] Currently , `model` in `--downloaded` just use the directory(already replace `:`), so covert back to descriptor keep the same with ` llama model list`, and remove command also use `descriptor`. ``` before: $ llama model list --downloaded +-------------------------------------+----------+---------------------+ \| Model \| Size \| Modified Time \| +-------------------------------------+----------+---------------------+ \| Llama3.2-1B-Instruct-int4-qlora-eo8 \| 1.53 GB \| 2025-02-20 16:32:49 \| +-------------------------------------+----------+---------------------+ after: $ llama model list --downloaded +-------------------------------------+----------+---------------------+ \| Model \| Size \| Modified Time \| +-------------------------------------+----------+---------------------+ \| Llama3.2-1B-Instruct:int4-qlora-eo8 \| 1.53 GB \| 2025-02-20 16:32:49 \| +-------------------------------------+----------+---------------------+ ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-02-21 08:10:34 -08:00
Rashmi Pawar	da9f0b7869	test(client-sdk): Update embedding test types to use latest imports (#1203 ) # What does this PR do? - Updates ImageContentItemImageURL import - fixes `embedding_dimensions` metadata param ## Test Plan - Ran pytest locally, verified embedding tests pass with new types ![Screenshot 2025-02-21 at 6 54 27 PM](https://github.com/user-attachments/assets/f80e3785-04c3-415e-9276-88aa8136bf00) cc: @dglogo @sumitb	2025-02-21 08:09:17 -08:00
Reid	d2701b0d6a	chore: remove configure subcommand (#1202 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] When tried to use `configure`, and found it `DEPRECATED`, and found pr https://github.com/meta-llama/llama-stack/pull/371 to remove it, not sure why not remove the `configure.py`? ``` $ llama stack configure /tmp/test.yaml usage: llama stack configure [-h] [--output-dir OUTPUT_DIR] config llama stack configure: error: DEPRECATED! llama stack configure has been deprecated. Please use llama stack run <path/to/run.yaml> instead. Please see example run.yaml in /distributions folder. ``` It would better better to tell when user check it how to use with `--help` first: ``` before: $ llama stack configure --help usage: llama stack configure [-h] [--output-dir OUTPUT_DIR] config Configure a llama stack distribution positional arguments: after: $ llama stack configure --help usage: llama stack configure [-h] [--output-dir OUTPUT_DIR] config Configure a llama stack distribution DEPRECATED! llama stack configure has been deprecated. Please use llama stack run <path/to/run.yaml> instead. Please see example run.yaml in /distributions folder. ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) --------- Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-02-21 08:06:25 -08:00
Reid	c9c4a3c921	feat: model remove cmd (#1128 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] add a subcommand, help to clean the unneeded model: ``` $ llama model --help usage: llama model [-h] {download,list,prompt-format,describe,verify-download,remove} ... Work with llama models options: -h, --help show this help message and exit $ llama model remove --help usage: llama model remove [-h] -m MODEL [-f] Remove the downloaded llama model options: -h, --help show this help message and exit -m MODEL, --model MODEL Specify the llama downloaded model name -f, --force Used to forcefully remove the llama model from the storage without further confirmation $ llama model remove -m Llama3.2-1B-Instruct:int4-qlora-eo8 Are you sure you want to remove Llama3.2-1B-Instruct:int4-qlora-eo8? (y/n): n Removal aborted. $ llama model remove -mLlama3.2-1B-Instruct:int4-qlora-eo8-f Llama3.2-1B-Instruct:int4-qlora-eo8 removed. ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) --------- Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-02-21 08:05:12 -08:00
Ashwin Bharambe	81ce39a607	feat(api): Add options for supporting various embedding models (#1192 ) We need to support: - asymmetric embedding models (#934) - truncation policies (#933) - varying dimensional output (#932) ## Test Plan ```bash $ cd llama_stack/providers/tests/inference $ pytest -s -v -k fireworks test_embeddings.py \ --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k together test_embeddings.py \ --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k ollama test_embeddings.py \ --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784 ```	2025-02-20 22:27:12 -08:00
Ashwin Bharambe	6f9d622340	fix(api): update embeddings signature so inputs and outputs list align (#1161 ) See Issue #922 The change is slightly backwards incompatible but no callsite (in our client codebases or stack-apps) every passes a depth-2 `List[List[InterleavedContentItem]]` (which is now disallowed.) ## Test Plan ```bash $ cd llama_stack/providers/tests/inference $ pytest -s -v -k fireworks test_embeddings.py \ --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k together test_embeddings.py \ --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k ollama test_embeddings.py \ --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784 ``` Also ran `tests/client-sdk/inference/test_embeddings.py`	2025-02-20 21:43:13 -08:00
ehhuang	cfa752fc92	fix: pass tool_prompt_format to chat_formatter (#1198 ) Summary: Need this to format the completion message with tool_calls correctly. See added unittest. Test Plan: python -m unittest llama_stack.providers.tests.inference.test_prompt_adapter	2025-02-20 21:38:35 -08:00
Ashwin Bharambe	dd43494847	Fix inference test fixture	2025-02-20 21:24:49 -08:00
Ben Browning	6820718b71	fix: BuiltinTool JSON serialization in remote vLLM provider (#1183 ) # What does this PR do? The `tool_name` attribute of `ToolDefinition` instances can either be a str or a BuiltinTool enum type. This fixes the remote vLLM provider to use the value of those BuiltinTool enums when serializing to JSON instead of attempting to serialize the actual enum to JSON. Reference of how this is handled in some other areas, since I followed that same pattern for the remote vLLM provider here: - [remote nvidia provider](https://github.com/meta-llama/llama-stack/blob/v0.1.3/llama_stack/providers/remote/inference/nvidia/openai_utils.py#L137-L140) - [meta reference provider](https://github.com/meta-llama/llama-stack/blob/v0.1.3/llama_stack/providers/inline/agents/meta_reference/agent_instance.py#L635-L636) There is opportunity to potentially reconcile the remove nvidia and remote vllm bits where they are both translating Llama Stack Inference APIs to OpenAI client requests, but that's a can of worms I didn't want to open for this bug fix. This explicitly fixes this error when using the remote vLLM provider and the agent tests: ``` TypeError: Object of type BuiltinTool is not JSON serializable ``` So, this is related to #1144 and addresses the immediate issue raised there. With this fix, `tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search` now gets past the JSON serialization error when using the remote vLLM provider and actually attempts to call the web search tool. I don't have any API keys setup for the actual web search providers yet, so I cannot verify everything works after that point. ## Test Plan I ran the `test_builtin_tool_web_search` locally with the remote vLLM provider like: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search --inference-model "meta-llama/Llama-3.2-3B-Instruct" ``` Before my change, that reproduced the `TypeError: Object of type BuiltinTool is not JSON serializable` error. After my change, that error is gone and the test actually attempts the web search. That failed for me locally, due to lack of API key, but it gets past the JSON serialization error. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-02-20 21:18:37 -08:00
Ashwin Bharambe	35ae0e16a1	Fix sqlite_vec config defaults	2025-02-20 17:50:33 -08:00
Matthew Farrellee	832c535aaf	feat(providers): add NVIDIA Inference embedding provider and tests (#935 ) # What does this PR do? add /v1/inference/embeddings implementation to NVIDIA provider open topics - - asymmetric models. NeMo Retriever includes asymmetric models, which are models that embed differently depending on if the input is destined for storage or lookup against storage. the /v1/inference/embeddings api does not allow the user to indicate the type of embedding to perform. see https://github.com/meta-llama/llama-stack/issues/934 - truncation. embedding models typically have a limited context window, e.g. 1024 tokens is common though newer models have 8k windows. when the input is larger than this window the endpoint cannot perform its designed function. two options: 0. return an error so the user can reduce the input size and retry; 1. perform truncation for the user and proceed (common strategies are left or right truncation). many users encounter context window size limits and will struggle to write reliable programs. this struggle is especially acute without access to the model's tokenizer. the /v1/inference/embeddings api does not allow the user to delegate truncation policy. see https://github.com/meta-llama/llama-stack/issues/933 - dimensions. "Matryoshka" embedding models are available. they allow users to control the number of embedding dimensions the model produces. this is a critical feature for managing storage constraints. embeddings of 1024 dimensions what achieve 95% recall for an application may not be worth the storage cost if a 512 dimensions can achieve 93% recall. controlling embedding dimensions allows applications to determine their recall and storage tradeoffs. the /v1/inference/embeddings api does not allow the user to control the output dimensions. see https://github.com/meta-llama/llama-stack/issues/932 ## Test Plan - `llama stack run llama_stack/templates/nvidia/run.yaml` - `LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/inference/test_embedding.py --embedding-model baai/bge-m3` ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [x] Wrote necessary unit or integration tests. --------- Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-02-20 16:59:48 -08:00
Ashwin Bharambe	2608b6074f	Update embedding dimension singular	2025-02-20 16:14:46 -08:00
Ashwin Bharambe	9436dd570d	feat: register embedding models for ollama, together, fireworks (#1190 ) # What does this PR do? We have support for embeddings in our Inference providers, but so far we haven't done the final step of actually registering the known embedding models and making sure they are extremely easy to use. This is one step towards that. ## Test Plan Run existing inference tests. ```bash $ cd llama_stack/providers/tests/inference $ pytest -s -v -k fireworks test_embeddings.py \ --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k together test_embeddings.py \ --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k ollama test_embeddings.py \ --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784 ``` The value of the EMBEDDING_DIMENSION isn't actually used in these tests, it is merely used by the test fixtures to check if the model is an LLM or Embedding.	2025-02-20 15:39:08 -08:00
Ashwin Bharambe	736560ceba	Remove os.getenv() from ollama config	2025-02-20 14:30:32 -08:00
LESSuseLESS	2cbe9395b0	feat: D69478008 [llama-stack] turning tests into data-driven (#1180 ) # What does this PR do? We have several places running tests for different purposes. - oss llama stack - provider tests - e2e tests - provider llama stack - unit tests - e2e tests It would be nice if they can share the same set of test data, so we maintain the consistency between spec and implementation. This is what this diff is about, isolating test data from test coding, so that we can reuse the same data at different places by writing different test coding. ## Test Plan == Set up Ollama local server == Run a provider test conda activate stack OLLAMA_URL="http://localhost:8321" \ pytest -v -s -k "ollama" --inference-model="llama3.2:3b-instruct-fp16" \ llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output // test_structured_output should also work == Run an e2e test conda activate sherpa with-proxy pip install llama-stack export INFERENCE_MODEL=llama3.2:3b-instruct-fp16 export LLAMA_STACK_PORT=8322 with-proxy llama stack build --template ollama with-proxy llama stack run --env OLLAMA_URL=http://localhost:8321 ollama - Run test client, LLAMA_STACK_PORT=8322 LLAMA_STACK_BASE_URL="http://localhost:8322" \ pytest -v -s --inference-model="llama3.2:3b-instruct-fp16" \ tests/client-sdk/inference/test_text_inference.py::test_text_completion_structured_output // test_text_chat_completion_structured_output should also work ## Notes - This PR was automatically generated by oss_sync - Please refer to D69478008 for more details.	2025-02-20 14:13:06 -08:00
ehhuang	1166afdf76	fix: some telemetry APIs don't currently work (#1188 ) Summary: This bug is surfaced by using the http LS client. The issue is that non-scalar values in 'GET' method are `body` params in fastAPI, but our spec generation script doesn't respect that. We fix by just making them POST method instead. Test Plan: Test API call with newly sync'd client (https://github.com/meta-llama/llama-stack-client-python/pull/149) <img width="1114" alt="image" src="https://github.com/user-attachments/assets/7710aca5-d163-4e00-a465-14e6fcaac2b2" />	2025-02-20 14:09:25 -08:00
Xi Yan	ea1faae50e	chore!: deprecate eval/tasks (#1186 ) # What does this PR do? - Fully deprecate eval/tasks [//]: # (If resolving an issue, uncomment and update the line below) Closes #1088 NOTE: this will be a breaking change. We have introduced the new API in 0.1.3 . Notebook has been updated to use the new endpoints. ## Test Plan ``` pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb ``` <img width="611" alt="image" src="https://github.com/user-attachments/assets/79f6efe1-81ba-494e-bf36-1fc0c2b9bc6f" /> cc @SLR722 for awareness [//]: # (## Documentation)	2025-02-20 14:06:21 -08:00
Ashwin Bharambe	07ccf908f7	ModelAlias -> ProviderModelEntry	2025-02-20 14:02:36 -08:00
Vladimir Ivić	f7161611c6	feat: adding endpoints for files and uploads (#1070 ) Summary: Adds spec definitions for file uploads operations. This API focuses around two high level operations: * Initiating and managing upload session * Accessing uploaded file information Usage examples: To start a file upload session: ``` curl -X POST https://localhost:8321/v1/files \ -d '{ "key": "image123.jpg', "bucket": "images", "mime_type": "image/jpg", "size": 12345 }' # Returns { “id”: <session_id> “url”: “https://localhost:8321/v1/files/session:<session_id>”, "offset": 0, "size": 12345 } ``` To upload file content to an existing session ``` curl -i -X POST "https://localhost:8321/v1/files/session:<session_id> \ --data-binary @<path_to_local_file> # Returns { "key": "image123.jpg", "bucket": "images", "mime_type": "image/jpg", "bytes": 12345, "created_at": 1737492240 } # Implementing on server side (Flask example for simplicity): @app.route('/uploads/{upload_id}', methods=['POST']) def upload_content_to_session(upload_id): try: # Get the binary file data from the request body file_data = request.data # Save the file to disk save_path = f"./uploads/{upload_id}" with open(save_path, 'wb') as f: f.write(file_data) return {__uploaded_file_json__}, 200 except Exception as e: return 500 ``` To read information about an existing upload session ``` curl -i -X GET "https://localhost:8321/v1/files/session:<session_id> # Returns { “id”: <session_id> “url”: “https://localhost:8321/v1/files/session:<session_id>”, "offset": 1024, "size": 12345 } ``` To list buckets ``` GET /files # Returns { "data": [ {"name": "bucket1"}, {"name": "bucket2"}, ] } ``` To list all files in a bucket ``` GET /files/{bucket} # Returns { "data": [ { "key": "shiba.jpg", "bucket": "dogs", "mime_type": "image/jpg", "bytes": 82334, "created_at": 1737492240, }, { "key": "persian_cat.jpg", "mime_type": "image/jpg", "bucket": "cats", "bytes": 39924, "created_at": 1727493440, }, ] } ``` To get specific file info ``` GET /files/{bucket}/{key} { "key": "shiba.jpg", "bucket": "dogs", "mime_type": "image/jpg", "bytes": 82334, "created_at": 1737492240, } ``` To delete specific file ``` DELETE /files/{bucket}/{key} { "key": "shiba.jpg", "bucket": "dogs", "mime_type": "image/jpg", "bytes": 82334, "created_at": 1737492240, } ```	2025-02-20 13:09:00 -08:00
Ashwin Bharambe	eddef0b2ae	chore: slight renaming of model alias stuff (#1181 ) Quick test by running: ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk ```	2025-02-20 11:48:46 -08:00

1 2 3 4 5 ...

748 commits