llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-06-28 19:04:19 +00:00

Author	SHA1	Message	Date
Dinesh Yeduguru	314806cde3	Add provider data passing for library client (#750 ) # What does this PR do? This PR adds the provider data passing for the library client and changes the provider's api keys be unique ## Test Plan LLAMA_STACK_CONFIG="/Users/dineshyv/.llama/distributions/llamastack-fireworks/fireworks-run.yaml" pytest -v tests/client-sdk/agents/test_agents.py run.yaml: https://gist.github.com/dineshyv/0c10b5c7d0a2fb7ba4f0ecc8dcf860d1	2025-01-13 15:12:10 -08:00
Yuan Tang	e45592e229	Support building UBI9 base container image (#676 ) This adds support for [UBI9 (Red Hat Universal Base Image 9)](`615bcf606f`). Tested `registry.access.redhat.com/ubi9/ubi-minimal:9.5`. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-13 13:41:56 -08:00
Sarthak Deshpande	ec8601ce88	Replaced zrangebylex method in the range method (#521 ) # What does this PR do? In short, provide a summary of what this PR does and why. Usually, the relevant context should be present in a linked issue. - [Currently redis as a kvstore is bugged, as the range method uses zrangebylex method. zrangebylex method is used when it is a sorted set but we are storing the value using .set method in the redis. This causes an error. Another issue is that zrangebylex method takes 3 args but only 2 are mentioned in the range method. This causes a runtime error. That method has been replaced with the current implementation in the PR ] Addresses issue (#520 ) ## Test Plan Please describe: - tests you ran to verify your changes with result summaries. - provide instructions so it can be reproduced. `python llama_stack/apis/agents/client.py localhost 8001 tools_llama_3_1 meta-llama/Llama-3.1-70B-Instruct` <img width="1711" alt="Screenshot 2024-11-25 at 2 59 55 PM" src="https://github.com/user-attachments/assets/c2551555-bc73-4427-b09b-c86d6deb2956"> <img width="634" alt="Screenshot 2024-11-25 at 3 00 33 PM" src="https://github.com/user-attachments/assets/a087718f-fc2a-424b-b096-4ecad08a07bf"> Have used redis in the run.yaml file as well for the persistence_store. Also enable_session_persistence turned to True for this test. Have also tested this in a jupyter notebook to make sure the current flow does not work through multiple turns in the same session. ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-11 22:04:34 -08:00
Fred Reiss	8b2376bfb3	Add inline vLLM inference provider to regression tests and fix regressions (#662 ) # What does this PR do? This PR adds the inline vLLM inference provider to the regression tests for inference providers. The PR also fixes some regressions in that inference provider in order to make the tests pass. ## Test Plan Command to run the new tests (from root of project): ``` pytest \ -vvv \ llama_stack/providers/tests/inference/test_text_inference.py \ --providers inference=vllm \ --inference-model meta-llama/Llama-3.2-3B-Instruct \ ``` Output of the above command after these changes: ``` /mnt/datadisk1/freiss/llama/env/lib/python3.12/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset. The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session" warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET)) =================================================================== test session starts =================================================================== platform linux -- Python 3.12.7, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12 cachedir: .pytest_cache rootdir: /mnt/datadisk1/freiss/llama/llama-stack configfile: pyproject.toml plugins: asyncio-0.25.0, anyio-4.6.2.post1 asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None collected 9 items llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED [ 11%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] SKIPPED (Other inference providers don't support completion() yet) [ 22%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] SKIPPED (Other inference providers don't support completion() yet) [ 33%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] SKIPPED (This test is not quite robust) [ 44%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED [ 55%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] SKIPPED (Other inference providers don't support structured output yet) [ 66%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED [ 77%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED [ 88%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED [100%] ======================================================== 5 passed, 4 skipped, 2 warnings in 25.56s ======================================================== Task was destroyed but it is pending! task: <Task pending name='Task-6' coro=<AsyncLLMEngine.run_engine_loop() running at /mnt/datadisk1/freiss/llama/env/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py:848> cb=[_log_task_completion(error_callback=<bound method...7cfc479440b0>>)() at /mnt/datadisk1/freiss/llama/env/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py:45, shield.<locals>._inner_done_callback() at /mnt/datadisk1/freiss/llama/env/lib/python3.12/asyncio/tasks.py:905]> [rank0]:[W1219 11:38:34.689424319 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) ``` The warning about "asyncio_default_fixture_loop_scope" appears to be due to my environment having a newer version of pytest-asyncio. The warning about a pending task appears to be due to a bug in `vllm.AsyncLLMEngine.shutdown_background_loop()`. It looks like that method returns without stopping a pending task. I will look into that issue separately. ## Sources ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [X] Ran pre-commit to handle lint / formatting issues. - [X] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [X] Wrote necessary unit or integration tests.	2025-01-10 16:35:16 -08:00
raghotham	ff182ff6de	rename LLAMASTACK_PORT to LLAMA_STACK_PORT for consistency with other env vars (#744 ) # What does this PR do? Rename environment var for consistency ## Test Plan No regressions ## Sources ## Before submitting - [X] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [X] Ran pre-commit to handle lint / formatting issues. - [X] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [X] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests. --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-10 11:09:49 -08:00
Dinesh Yeduguru	8af6951106	remove conflicting default for tool prompt format in chat completion (#742 ) # What does this PR do? We are setting a default value of json for tool prompt format, which conflicts with llama 3.2/3.3 models since they use python list. This PR changes the defaults to None and in the code, we infer default based on the model. Addresses: #695 Tests: ❯ LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v tests/client-sdk/inference/test_inference.py -k "test_text_chat_completion" pytest llama_stack/providers/tests/inference/test_prompt_adapter.py	2025-01-10 10:41:53 -08:00
Yuan Tang	24fa1adc2f	Expose LLAMASTACK_PORT in cli.stack.run (#722 ) This was missed in https://github.com/meta-llama/llama-stack/pull/706. I tested `llama_stack.distribution.server.server` but didn't test `llama stack run`. cc @ashwinb Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-10 09:13:49 -08:00
Vladislav Bronzov	96735e961d	Add persistence for localfs datasets (#557 ) # What does this PR do? Add persistency logic for localfs datasetio provider - [ ] Addresses issue (#issue) ## Test Plan Please describe: - tests you ran to verify your changes with result summaries. - provide instructions so it can be reproduced. ## Sources Please link relevant resources if necessary. https://github.com/meta-llama/llama-stack/issues/539 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-09 17:34:18 -08:00
Ashwin Bharambe	4938f2fe5d	Check version incompatibility (#738 ) When we bump up `major.minor` we want to make sure clients can immediately detect a version change and appropriately error out. It is not reasonable to keep checking for API-level backwards compatibility across such version bumps. Over time, we will make the check based only on the major version perhaps. ### Test Plan Manually updated `__version__` in the client SDK to be "0.1.0" which is incompatible with server's current version "0.0.63", got the following error: <img width="1077" alt="image" src="https://github.com/user-attachments/assets/06ae4659-0a25-4c4c-a999-ce44678d4e6f" /> Without this update, the CLI worked correctly.	2025-01-09 14:52:06 -08:00
Ashwin Bharambe	ffc6bd4805	Add X-LlamaStack-Client-Version, rename ProviderData -> Provider-Data (#735 ) Add another header so client SDKs can identify their versions which can be used for immediate detection of possible compatibility issues. A semver mismatch against the wrong server should be immediately flagged and requests should be denied. Also change `X-LlamaStack-ProviderData` to `X-LlamaStack-Provider-Data` since that hyphenation is better.	2025-01-09 11:51:36 -08:00
Dinesh Yeduguru	a5c57cd381	agents to use tools api (#673 ) # What does this PR do? PR #639 introduced the notion of Tools API and ability to invoke tools through API just as any resource. This PR changes the Agents to start using the Tools API to invoke tools. Major changes include: 1) Ability to specify tool groups with AgentConfig 2) Agent gets the corresponding tool definitions for the specified tools and pass along to the model 3) Attachements are now named as Documents and their behavior is mostly unchanged from user perspective 4) You can specify args that can be injected to a tool call through Agent config. This is especially useful in case of memory tool, where you want the tool to operate on a specific memory bank. 5) You can also register tool groups with args, which lets the agent inject these as well into the tool call. 6) All tests have been migrated to use new tools API and fixtures including client SDK tests 7) Telemetry just works with tools API because of our trace protocol decorator ## Test Plan ``` pytest -s -v -k fireworks llama_stack/providers/tests/agents/test_agents.py \ --safety-shield=meta-llama/Llama-Guard-3-8B \ --inference-model=meta-llama/Llama-3.1-8B-Instruct pytest -s -v -k together llama_stack/providers/tests/tools/test_tools.py \ --safety-shield=meta-llama/Llama-Guard-3-8B \ --inference-model=meta-llama/Llama-3.1-8B-Instruct LLAMA_STACK_CONFIG="/Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml" pytest -v tests/client-sdk/agents/test_agents.py ``` run.yaml: https://gist.github.com/dineshyv/0365845ad325e1c2cab755788ccc5994 Notebook: https://colab.research.google.com/drive/1ck7hXQxRl6UvT-ijNRZ-gMZxH1G3cN2d?usp=sharing	2025-01-08 19:01:00 -08:00
Xi Yan	596afc6497	add --version to llama stack CLI & /version endpoint (#732 ) # What does this PR do? - add --version to llama stack CLI - add /version endpoint - run OpenAPI generator for the new endpoint ## Test Plan CLI <img width="184" alt="image" src="https://github.com/user-attachments/assets/3acb1d22-453e-4b79-baf6-e98e88d0671c" /> endpoint <img width="430" alt="image" src="https://github.com/user-attachments/assets/79cdd670-493b-40cf-8f9e-28a4ac0988ac" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-08 16:30:06 -08:00
Xi Yan	7a4383e4c1	add 3.3 to together inference provider (#729 ) # What does this PR do? - add llama3.3 model for together - fix fireworks distro_codegen ``` python llama_stack/scripts/distro_codegen.py ``` ## Test Plan <img width="1132" alt="image" src="https://github.com/user-attachments/assets/bf94b933-9200-4e73-878e-d1a95d450a88" /> Tests ``` pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.3-70B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py ``` <img width="1139" alt="image" src="https://github.com/user-attachments/assets/407dc98b-8de3-4841-8cb1-75e4b5128544" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-06 15:39:41 -08:00
Xi Yan	7a90fc5854	move DataSchemaValidatorMixin into standalone utils (#720 ) # What does this PR do? - there's no value in keeping data schema validation logic in a DataSchemaValidatorMixin - move into data schema validation logic into standalone utils ## Test Plan ``` pytest -v -s -m llm_as_judge_scoring_together_inference scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct pytest -v -s -m basic_scoring_together_inference scoring/test_scoring.py pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py pytest -v -s -m meta_reference_eval_together_inference eval/test_eval.py pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio eval/test_eval.py ``` ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-06 13:25:09 -08:00
Dinesh Yeduguru	0bc5d05243	remove default logger handlers when using libcli with notebook (#718 ) # What does this PR do? Remove the default log handlers for notebook to avoid polluting logs	2025-01-06 13:06:22 -08:00
Botao Chen	e86271aeac	support llama3.1 8B instruct in post training (#698 ) ## What does this PR do? - Change to support llama3.1 8B instruct model other than llama3 8B model as llama3.1 8B instruct model is a better model to finetune on top of - Make the copy files logic in checkpointer safer in case the file be copied doesn't exist in source path ## test issue a post training request from client and verify training works as expect <img width="1101" alt="Screenshot 2025-01-02 at 12 18 45 PM" src="https://github.com/user-attachments/assets/47cc4df9-3edc-4afd-b5dd-abe1f039f1ed" /> <img width="782" alt="Screenshot 2025-01-02 at 12 18 52 PM" src="https://github.com/user-attachments/assets/b9435274-ef1d-4570-bd8e-0880c3a4b2e9" />	2025-01-03 17:33:05 -08:00
Aidan Do	485476c29a	Fix Groq invalid self.config reference (#719 ) # What does this PR do? Contributes towards: #432 RE: https://github.com/meta-llama/llama-stack/pull/609 I missed this one while refactoring. Fixes: ```python Traceback (most recent call last): File "/Users/aidand/dev/llama-stack/llama_stack/distribution/server/server.py", line 191, in endpoint return await maybe_await(value) File "/Users/aidand/dev/llama-stack/llama_stack/distribution/server/server.py", line 155, in maybe_await return await value File "/Users/aidand/dev/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper result = await method(self, args, kwargs) File "/Users/aidand/dev/llama-stack/llama_stack/distribution/routers/routers.py", line 156, in chat_completion return await provider.chat_completion(params) File "/Users/aidand/dev/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper result = await method(self, args, kwargs) File "/Users/aidand/dev/llama-stack/llama_stack/providers/remote/inference/groq/groq.py", line 127, in chat_completion response = self._get_client().chat.completions.create(request) File "/Users/aidand/dev/llama-stack/llama_stack/providers/remote/inference/groq/groq.py", line 143, in _get_client return Groq(api_key=self.config.api_key) AttributeError: 'GroqInferenceAdapter' object has no attribute 'config'. Did you mean: '_config'? ``` ## Test Plan Environment: ```shell export GROQ_API_KEY=<api-key> # build.yaml and run.yaml files wget https://raw.githubusercontent.com/aidando73/llama-stack/9165502582cd7cb178bc1dcf89955b45768ab6c1/build.yaml wget https://raw.githubusercontent.com/aidando73/llama-stack/9165502582cd7cb178bc1dcf89955b45768ab6c1/run.yaml # Create environment if not already conda create --prefix ./envs python=3.10 conda activate ./envs # Build pip install -e . && llama stack build --config ./build.yaml --image-type conda # Activate built environment conda activate llamastack-groq ``` <details> <summary>Manual</summary> ```bash llama stack run ./run.yaml --port 5001 ``` Via this Jupyter notebook: `9165502582/hello.ipynb` </details> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [x] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-03 15:47:10 -08:00
Yuan Tang	04d5b9814f	Fix assert message and call to completion_request_to_prompt in remote:vllm (#709 ) The current message is incorrect and model arg is not needed in `completion_request_to_prompt`. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-03 13:44:49 -08:00
Yuan Tang	96d8375663	Fix incorrect entrypoint for broken `llama stack run` (#706 ) This fixes the issue when using `llama stack run` by correctly specifying entrypoint: ``` LLAMA_STACK_DIR=. llama stack run /home/yutang/.llama/distributions/llamastack-vllm/vllm-run.yaml Using config file: /home/yutang/.llama/distributions/llamastack-vllm/vllm-run.yaml + command -v selinuxenabled + selinuxenabled + DOCKER_OPTS=' --security-opt label=disable' + mounts= + '[' -n . ']' ++ readlink -f . + mounts=' -v /home/yutang/repos/llama-stack:/app/llama-stack-source' + '[' -n '' ']' + version_tag=latest + '[' -n '' ']' + '[' -n . ']' + version_tag=dev + podman run --security-opt label=disable -it -p 5000:5000 -v /home/yutang/.llama/distributions/llamastack-vllm/vllm-run.yaml:/app/config.yaml -v /home/yutang/repos/llama-stack:/app/llama-stack-source localhost/distribution-vllm:dev python -m llama_stack.distribution.server.server --yaml-config /app/config.yaml --port 5000 usage: server.py [-h] [--yaml-config YAML_CONFIG] [--template TEMPLATE] [--port PORT] [--disable-ipv6] [--env ENV] server.py: error: unrecognized arguments: python -m llama_stack.distribution.server.server ++ error_handler 88 ++ echo 'Error occurred in script at line: 88' Error occurred in script at line: 88 ++ exit 1 ``` --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-03 09:47:10 -08:00
Ashwin Bharambe	21357a6dee	Kill autocomplete slop	2025-01-03 09:29:25 -08:00
Botao Chen	4320b0ebb2	[Post training] make validation steps configurable (#715 ) ## what does this PR do? The current code hardcode the validation steps to run (forgot to change it after testing). in this PR, we make it configurable by training config ## test On client side, issue a post training request with 20 validation steps, server side logging shows that it runs 20 validation steps successfully <img width="1128" alt="Screenshot 2025-01-02 at 8 21 06 PM" src="https://github.com/user-attachments/assets/7a757516-c6ba-41d4-85c5-361a80ecf46e" />	2025-01-03 08:43:24 -08:00
Botao Chen	f450a0fd32	Change post training run.yaml inference config (#710 ) ## Context Colab notebook provides some limited free T4 GPU. Making post training template e2e works with colab notebook T4 is critical for early adoption of the stack post training apis. However, we found that the existing LlamaModelParallelGenerator (https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/inline/inference/meta_reference/inference.py#L82) in meta-reference inference implementation isn't compatible with T4 machine. In this PR, We change to disable create_distributed_process_group for inference api in post training run.yaml config and setup up the distributed env variables in notebook <img width="493" alt="Screenshot 2025-01-02 at 3 48 08 PM" src="https://github.com/user-attachments/assets/dd159f70-4cff-475c-b459-1fc6e2c720ba" /> to make meta reference inference compatible with the free T4 machine ## test Test with the WIP post training showcase colab notebook https://colab.research.google.com/drive/1K4Q2wZq232_Bpy2ud4zL9aRxvCWAwyQs?usp=sharing	2025-01-03 08:37:48 -08:00
Aidan Do	e1f42eb5a5	[#432 ] Add Groq Provider - chat completions (#609 ) # What does this PR do? Contributes towards issue (#432) - Groq text chat completions - Streaming - All the sampling params that Groq supports A lot of inspiration taken from @mattf's good work at https://github.com/meta-llama/llama-stack/pull/355 What this PR does not do - Tool calls (Future PR) - Adding llama-guard model - See if we can add embeddings ### PR Train - https://github.com/meta-llama/llama-stack/pull/609 👈 - https://github.com/meta-llama/llama-stack/pull/630 ## Test Plan <details> <summary>Environment</summary> ```bash export GROQ_API_KEY=<api_key> wget https://raw.githubusercontent.com/aidando73/llama-stack/240e6e2a9c20450ffdcfbabd800a6c0291f19288/build.yaml wget https://raw.githubusercontent.com/aidando73/llama-stack/92c9b5297f9eda6a6e901e1adbd894e169dbb278/run.yaml # Build and run environment pip install -e . \ && llama stack build --config ./build.yaml --image-type conda \ && llama stack run ./run.yaml \ --port 5001 ``` </details> <details> <summary>Manual tests</summary> Using this jupyter notebook to test manually: `2140976d76/hello.ipynb` Use this code to test passing in the api key from provider_data ``` from llama_stack_client import LlamaStackClient client = LlamaStackClient( base_url="http://localhost:5001", ) response = client.inference.chat_completion( model_id="Llama3.2-3B-Instruct", messages=[ {"role": "user", "content": "Hello, world client!"}, ], # Test passing in groq_api_key from the client # Need to comment out the groq_api_key in the run.yaml file x_llama_stack_provider_data='{"groq_api_key": "<api-key>"}', # stream=True, ) response ``` </details> <details> <summary>Integration</summary> `pytest llama_stack/providers/tests/inference/test_text_inference.py -v -k groq` (run in same environment) ``` llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_3b-groq] PASSED [ 6%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_3b-groq] SKIPPED (Other inf...) [ 12%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[llama_3b-groq] SKIPPED [ 18%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_3b-groq] PASSED [ 25%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-groq] SKIPPED (Ot...) [ 31%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_3b-groq] PASSED [ 37%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_3b-groq] SKIPPED [ 43%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_3b-groq] SKIPPED [ 50%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_8b-groq] PASSED [ 56%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_8b-groq] SKIPPED (Other inf...) [ 62%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[llama_8b-groq] SKIPPED [ 68%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_8b-groq] PASSED [ 75%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-groq] SKIPPED (Ot...) [ 81%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_8b-groq] PASSED [ 87%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_8b-groq] SKIPPED [ 93%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_8b-groq] SKIPPED [100%] ======================================= 6 passed, 10 skipped, 160 deselected, 7 warnings in 2.05s ======================================== ``` </details> <details> <summary>Unit tests</summary> `pytest llama_stack/providers/tests/inference/groq/ -v` ``` llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_sets_model PASSED [ 5%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_converts_user_message PASSED [ 10%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_converts_system_message PASSED [ 15%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_converts_completion_message PASSED [ 20%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_does_not_include_logprobs PASSED [ 25%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_does_not_include_response_format PASSED [ 30%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_does_not_include_repetition_penalty PASSED [ 35%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_includes_stream PASSED [ 40%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_n_is_1 PASSED [ 45%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_if_max_tokens_is_0_then_it_is_not_included PASSED [ 50%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_includes_max_tokens_if_set PASSED [ 55%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_includes_temperature PASSED [ 60%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertChatCompletionRequest::test_includes_top_p PASSED [ 65%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertNonStreamChatCompletionResponse::test_returns_response PASSED [ 70%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertNonStreamChatCompletionResponse::test_maps_stop_to_end_of_message PASSED [ 75%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertNonStreamChatCompletionResponse::test_maps_length_to_end_of_message PASSED [ 80%] llama_stack/providers/tests/inference/groq/test_groq_utils.py::TestConvertStreamChatCompletionResponse::test_returns_stream PASSED [ 85%] llama_stack/providers/tests/inference/groq/test_init.py::TestGroqInit::test_raises_runtime_error_if_config_is_not_groq_config PASSED [ 90%] llama_stack/providers/tests/inference/groq/test_init.py::TestGroqInit::test_returns_groq_adapter PASSED [ 95%] llama_stack/providers/tests/inference/groq/test_init.py::TestGroqConfig::test_api_key_defaults_to_env_var PASSED [100%] ==================================================== 20 passed, 11 warnings in 0.08s ===================================================== ``` </details> ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [x] Updated relevant documentation - [x] Wrote necessary unit or integration tests.	2025-01-03 08:27:49 -08:00
Ashwin Bharambe	e3f187fb83	Redact sensitive information from configs when printing, etc.	2025-01-02 13:54:02 -08:00
Botao Chen	d9f75cc98f	Import from the right path (#708 ) Import BaseModel and Field from pydantic	2025-01-02 13:15:31 -08:00
Botao Chen	750604c7af	[Post Training] Fix missing import (#705 ) ## context Post training apis are broken after the import * refactor https://github.com/meta-llama/llama-stack/pull/689. This PR is adding the missing import back ## Test Issue a post training request from client and the training finishes successfully <img width="1101" alt="Screenshot 2025-01-02 at 12 18 45 PM" src="https://github.com/user-attachments/assets/8c781459-f340-4021-85e1-fc68b1dcb8c8" /> <img width="782" alt="Screenshot 2025-01-02 at 12 18 52 PM" src="https://github.com/user-attachments/assets/14b04b7d-e5c7-4662-8fa6-748446ad3511" />	2025-01-02 13:08:20 -08:00
Xi Yan	3a269c4635	[rag evals] refactor & add ability to eval retrieval + generation in agentic eval pipeline (#664 ) # What does this PR do? - See https://github.com/meta-llama/llama-stack/pull/666 & https://github.com/meta-llama/llama-stack/pull/668 - Refactor BaseScoringFn to be just a minimal interface, add new RegistrableBaseScoring - Refactor data schema check - To separately evaluate retrieval component in RAG, we will have scoring functions needing "context" column additionally. - Refactor braintrust eval (more scoring fn added & tested in following PR) ## Test Plan ``` pytest -v -s -m llm_as_judge_scoring_together_inference scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct pytest -v -s -m basic_scoring_together_inference scoring/test_scoring.py pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py ``` <img width="847" alt="image" src="https://github.com/user-attachments/assets/d099cb2d-6f9c-4bdf-9d0d-f388cf758c0f" /> ``` pytest -v -s -m meta_reference_eval_together_inference eval/test_eval.py pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio eval/test_eval.py ``` <img width="850" alt="image" src="https://github.com/user-attachments/assets/dce28fc3-0493-4d34-820a-567260873cc8" /> ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-02 11:21:33 -08:00
Aidan Do	49ad168336	[#407 ] Agents: Avoid calling tools that haven't been explicitly enabled (#637 ) # What does this PR do? Contributes to issue (#407) tl;dr - @subramen was getting a 500 error because llama-stack called code_interpreter when it never was defined as a tool. Prevents failures like: <img width="544" alt="image" src="https://github.com/user-attachments/assets/392683d2-4670-414c-aaba-07ebc006d748" /> ``` # Server side Traceback (most recent call last): File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 206, in sse_generator async for item in await event_gen: File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/agents/agents.py", line 138, in _create_agent_turn_streaming async for event in agent.create_and_execute_turn(request): File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/agents/agent_instance.py", line 179, in create_and_execute_turn async for chunk in self.run( File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/agents/agent_instance.py", line 252, in run async for res in self._run( File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/agents/agent_instance.py", line 560, in _run result_messages = await execute_tool_call_maybe( File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/agents/agent_instance.py", line 824, in execute_tool_call_maybe assert name in tools_dict, f"Tool {name} not found" AssertionError: Tool code_interpreter not found ``` Instead, if the model hallucinates, we just let it hallucinate and let the client know. <img width="544" alt="image" src="https://github.com/user-attachments/assets/d2418583-d45a-48db-b476-45a584f2986f" /> ## Test Plan <details> <summary>pytest llama_stack/providers/tests/agents/test_agents.py -k ollama</summary> ``` llama stack build --template ollama --image-type conda conda activate llamastack-ollama ``` ``` llama_stack/providers/tests/agents/test_agents.py ..Fss [100%] ======================================================================= FAILURES ======================================================================= _________________________________________ TestAgents.test_rag_agent_as_attachments[--ollama][ollama] __________________________________________ llama_stack/providers/tests/agents/test_agents.py:261: in test_rag_agent_as_attachments turn_response = [ llama_stack/providers/tests/agents/test_agents.py:261: in <listcomp> turn_response = [ llama_stack/providers/inline/agents/meta_reference/agents.py:153: in _create_agent_turn_streaming async for event in agent.create_and_execute_turn(request): llama_stack/providers/inline/agents/meta_reference/agent_instance.py:179: in create_and_execute_turn async for chunk in self.run( llama_stack/providers/inline/agents/meta_reference/agent_instance.py:250: in run async for res in self._run( llama_stack/providers/inline/agents/meta_reference/agent_instance.py:363: in _run rag_context, bank_ids = await self._retrieve_context( llama_stack/providers/inline/agents/meta_reference/agent_instance.py:698: in _retrieve_context bank_id = await self._ensure_memory_bank(session_id) llama_stack/providers/inline/agents/meta_reference/agent_instance.py:653: in _ensure_memory_bank await self.memory_banks_api.register_memory_bank( llama_stack/providers/utils/telemetry/trace_protocol.py:101: in async_wrapper result = await method(self, args, *kwargs) llama_stack/distribution/routers/routing_tables.py:312: in register_memory_bank raise ValueError( E ValueError: Embeddings are now served via Inference providers. Please upgrade your run.yaml to include inline::sentence-transformer as an additional inference provider. See https://github.com/meta-llama/llama-stack/blob/main/llama_stack/templates/together/run.yaml for an example. =============================================================== short test summary info ================================================================ FAILED llama_stack/providers/tests/agents/test_agents.py::TestAgents::test_rag_agent_as_attachments[--ollama] - ValueError: Embeddings are now served via Inference providers. Please upgrade your run.yaml to include inline::sentence-transformer as an additiona... ========================================== 1 failed, 2 passed, 2 skipped, 20 deselected, 5 warnings in 14.24s ========================================== ``` Unrelated test is failing (also failing on main) </details> <details> <summary>Manual</summary> Using this client code: `7ebc257b27/client.py` <img width="544" alt="Screenshot 2024-12-16 at 17 41 31" src="https://github.com/user-attachments/assets/7425deaf-c94a-4dda-a635-922728e373f1" /> </details> ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-02 09:21:35 -08:00
Aidan Do	5d7b611336	Add JSON structured outputs to Ollama Provider (#680 ) # What does this PR do? Addresses issue #679 - Adds support for the response_format field for chat completions and completions so users can get their outputs in JSON ## Test Plan <details> <summary>Integration tests</summary> `pytest llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output -k ollama -s -v` ```python llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-ollama] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-ollama] PASSED ================================== 2 passed, 18 deselected, 3 warnings in 41.41s ================================== ``` </details> <details> <summary>Manual Tests</summary> ``` export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct export OLLAMA_INFERENCE_MODEL=llama3.2:3b-instruct-fp16 export LLAMA_STACK_PORT=5000 ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m llama stack build --template ollama --image-type conda llama stack run ./run.yaml \ --port $LLAMA_STACK_PORT \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env OLLAMA_URL=http://localhost:11434 ``` ```python client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}") MODEL_ID=meta-llama/Llama-3.2-3B-Instruct prompt =f""" Create a step by step plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French. You have 3 different operations you can perform. You can create a file, update a file, or delete a file. Limit your step by step plan to only these operations per step. Don't create more than 10 steps. Please ensure there's a README.md file in the root of the codebase that describes the codebase and how to run it. Please ensure there's a requirements.txt file in the root of the codebase that describes the dependencies of the codebase. """ response = client.inference.chat_completion( model_id=MODEL_ID, messages=[ {"role": "user", "content": prompt}, ], sampling_params={ "max_tokens": 200000, }, response_format={ "type": "json_schema", "json_schema": { "$schema": "http://json-schema.org/draft-07/schema#", "title": "Plan", "description": f"A plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French.", "type": "object", "properties": { "steps": { "type": "array", "items": { "type": "string" } } }, "required": ["steps"], "additionalProperties": False, } }, stream=True, ) content = "" for chunk in response: if chunk.event.delta: print(chunk.event.delta, end="", flush=True) content += chunk.event.delta try: plan = json.loads(content) print(plan) except Exception as e: print(f"Error parsing plan into JSON: {e}") plan = {"steps": []} ``` Outputs: ```json { "steps": [ "Update the requirements.txt file to include the updated dependencies specified in the peer's feedback, including the Google Cloud Translation API key.", "Update the app.py file to address the code smells and incorporate the suggested improvements, such as handling errors and exceptions, initializing the Translator object correctly, adding input validation, using type hints and docstrings, and removing unnecessary logging statements.", "Create a README.md file that describes the codebase and how to run it.", "Ensure the README.md file is up-to-date and accurate.", "Update the requirements.txt file to reflect any additional dependencies specified by the peer's feedback.", "Add documentation for each function in the app.py file using docstrings.", "Implement logging statements throughout the app.py file to monitor application execution.", "Test the API endpoint to ensure it correctly translates text from English to French and handles errors properly.", "Refactor the code to follow PEP 8 style guidelines and ensure consistency in naming conventions, indentation, and spacing.", "Create a new folder for logs and add a logging configuration file (e.g., logconfig.json) that specifies the logging level and output destination.", "Deploy the web server on a production environment (e.g., AWS Elastic Beanstalk or Google Cloud Platform) to make it accessible to external users." ] } ``` </details> ## Sources - Ollama api docs: https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion - Ollama structured output docs: https://github.com/ollama/ollama/blob/main/docs/api.md#request-structured-outputs ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [x] Wrote necessary unit or integration tests.	2025-01-02 09:05:51 -08:00
Yuan Tang	8146dce11e	Add missing newlines before printing the Dockerfile content (#700 ) Before: ``` Dockerfile created successfully in /tmp/tmp.qyMdb0vI8X/DockerfileFROM python:3.10-slim WORKDIR /app RUN apt-get update && apt-get install -y iputils-ping net-tools iproute2 dnsutils telnet curl wget telnet procps psmisc lsof traceroute bubblewrap && rm -rf /var/lib/apt/lists/* ``` After: ``` Dockerfile created successfully in /tmp/tmp.qyMdb0vI8X/Dockerfile FROM python:3.10-slim WORKDIR /app RUN apt-get update && apt-get install -y iputils-ping net-tools iproute2 dnsutils telnet curl wget telnet procps psmisc lsof traceroute bubblewrap && rm -rf /var/lib/apt/lists/* ``` Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-02 09:04:29 -08:00
Yuan Tang	c1987d6143	Fix failing flake8 E226 check (#701 ) This fixes the pre-commit check when running locally (not sure why this was not caught on CI check): ``` > pre-commit run --show-diff-on-failure --color=always --all-files trim trailing whitespace.................................................Passed check python ast.........................................................Passed check for merge conflicts................................................Passed check for added large files..............................................Passed fix end of files.........................................................Passed Insert license in comments...............................................Passed flake8...................................................................Failed - hook id: flake8 - exit code: 1 llama_stack/distribution/ui/page/evaluations/app_eval.py:132:65: E226 missing whitespace around arithmetic operator llama_stack/distribution/ui/page/evaluations/native_eval.py:235:61: E226 missing whitespace around arithmetic operator llama_stack/providers/utils/telemetry/trace_protocol.py:56:78: E226 missing whitespace around arithmetic operator ``` Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-02 09:04:07 -08:00
Xi Yan	a6c206ea66	[bugfix] fix prompt_adapter interleaved_content_convert_to_raw (#696 ) # What does this PR do? - fix interleaved_content_convert_to_raw in prompt_adapter to correctly convert ImageContentItem to RawMediaItem with raw data bytes ## Test Plan ``` torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py ``` Before <img width="844" alt="image" src="https://github.com/user-attachments/assets/f2784b42-2e36-4477-9041-903d5d628a68" /> After <img width="836" alt="image" src="https://github.com/user-attachments/assets/362b6e47-29f7-4119-bcf3-f75db842735f" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-30 16:40:36 -08:00
Xi Yan	7c1e3daa75	[bugfix] fix meta-reference agents w/ safety multiple model loading pytest (#694 ) # What does this PR do? - Fix broken pytest for meta-reference's agents - Safety model needs to be registered to a different provider id from inference model in order to be recognized ## Test Plan ``` torchrun $CONDA_PREFIX/bin/pytest -v -s llama_stack/providers/tests/agents/test_agents.py -m "meta_reference" --safety-shield meta-llama/Llama-Guard-3-1B --inference-model meta-llama/Llama-3.1-8B-Instruct ``` Before <img width="845" alt="image" src="https://github.com/user-attachments/assets/83818fe1-2179-4e9c-a753-bf1472a2f01d" /> After <img width="851" alt="image" src="https://github.com/user-attachments/assets/1cf8124b-14e2-47bf-80fd-ef8b4b3f6fd9" /> Other test not broken ``` pytest -v -s llama_stack/providers/tests/agents/test_agents.py -m "together" --safety-shield meta-llama/Llama-Guard-3-8B --inference-model meta-llama/Llama-3.1-405B-Instruct-FP8 ``` ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-30 16:25:46 -08:00
Xi Yan	694adb1501	[bugfix] fix broken vision inference, change serialization for bytes (#693 ) # What does this PR do? - vision inference via image as binary bytes fails with serialization error - add custom serialization for "bytes" in `_URLOrData` ## Test Plan ``` pytest -v -s -k "fireworks" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py::TestVisionModelInference::test_vision_chat_completion_non_streaming ``` Before <img width="1020" alt="image" src="https://github.com/user-attachments/assets/3803fcee-32ee-4b8e-ba46-47848e1a6247" /> After <img width="1018" alt="image" src="https://github.com/user-attachments/assets/f3e3156e-88ce-40fd-ad1b-44b87f376e03" /> <img width="822" alt="image" src="https://github.com/user-attachments/assets/1898696f-95c0-4694-8a47-8f51c7de0e86" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-30 13:57:41 -08:00
Xi Yan	3c72c034e6	[remove import ] clean up import 's (#689 ) # What does this PR do? - as title, cleaning up `import `'s - upgrade tests to make them more robust to bad model outputs - remove import 's in llama_stack/apis/* (skip __init__ modules) <img width="465" alt="image" src="https://github.com/user-attachments/assets/d8339c13-3b40-4ba5-9c53-0d2329726ee2" /> - run `sh run_openapi_generator.sh`, no types gets affected ## Test Plan ### Providers Tests agents ``` pytest -v -s llama_stack/providers/tests/agents/test_agents.py -m "together" --safety-shield meta-llama/Llama-Guard-3-8B --inference-model meta-llama/Llama-3.1-405B-Instruct-FP8 ``` inference ```bash # meta-reference torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py # together pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py pytest ./llama_stack/providers/tests/inference/test_prompt_adapter.py ``` safety ``` pytest -v -s llama_stack/providers/tests/safety/test_safety.py -m together --safety-shield meta-llama/Llama-Guard-3-8B ``` memory ``` pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m "sentence_transformers" --env EMBEDDING_DIMENSION=384 ``` scoring ``` pytest -v -s -m llm_as_judge_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct pytest -v -s -m basic_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py pytest -v -s -m braintrust_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py ``` datasetio ``` pytest -v -s -m localfs llama_stack/providers/tests/datasetio/test_datasetio.py pytest -v -s -m huggingface llama_stack/providers/tests/datasetio/test_datasetio.py ``` eval ``` pytest -v -s -m meta_reference_eval_together_inference llama_stack/providers/tests/eval/test_eval.py pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio llama_stack/providers/tests/eval/test_eval.py ``` ### Client-SDK Tests ``` LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v ./tests/client-sdk ``` ### llama-stack-apps ``` PORT=5000 LOCALHOST=localhost python -m examples.agents.hello $LOCALHOST $PORT python -m examples.agents.inflation $LOCALHOST $PORT python -m examples.agents.podcast_transcript $LOCALHOST $PORT python -m examples.agents.rag_as_attachments $LOCALHOST $PORT python -m examples.agents.rag_with_memory_bank $LOCALHOST $PORT python -m examples.safety.llama_guard_demo_mm $LOCALHOST $PORT python -m examples.agents.e2e_loop_with_custom_tools $LOCALHOST $PORT # Vision model python -m examples.interior_design_assistant.app python -m examples.agent_store.app $LOCALHOST $PORT ``` ### CLI ``` which llama llama model prompt-format -m Llama3.2-11B-Vision-Instruct llama model list llama stack list-apis llama stack list-providers inference llama stack build --template ollama --image-type conda ``` ### Distributions Tests ollama ``` llama stack build --template ollama --image-type conda ollama run llama3.2:1b-instruct-fp16 llama stack run ./llama_stack/templates/ollama/run.yaml --env INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct ``` fireworks ``` llama stack build --template fireworks --image-type conda llama stack run ./llama_stack/templates/fireworks/run.yaml ``` together ``` llama stack build --template together --image-type conda llama stack run ./llama_stack/templates/together/run.yaml ``` tgi ``` llama stack run ./llama_stack/templates/tgi/run.yaml --env TGI_URL=http://0.0.0.0:5009 --env INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct ``` ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-27 15:45:44 -08:00
Aidan Do	21fb92d7cf	Add 3.3 70B to Ollama inference provider (#681 ) # What does this PR do? Adds 3.3 70B support to Ollama inference provider ## Test Plan <details> <summary>Manual</summary> ```bash # 42GB to download ollama pull llama3.3:70b ollama run llama3.3:70b --keepalive 60m export LLAMA_STACK_PORT=5000 pip install -e . \ && llama stack build --template ollama --image-type conda \ && llama stack run ./distributions/ollama/run.yaml \ --port $LLAMA_STACK_PORT \ --env INFERENCE_MODEL=Llama3.3-70B-Instruct \ --env OLLAMA_URL=http://localhost:11434 export LLAMA_STACK_PORT=5000 llama-stack-client --endpoint http://localhost:$LLAMA_STACK_PORT \ inference chat-completion \ --model-id Llama3.3-70B-Instruct \ --message "hello, what model are you?" ``` <img width="1221" alt="image" src="https://github.com/user-attachments/assets/dcffbdd9-94c8-4d47-9f95-4ef6c3756294" /> </details> ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-25 22:15:58 -08:00
Yuan Tang	987e651755	Add missing venv option in --image-type (#677 ) "venv" option is supported but not mentioned in the prompt. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2024-12-21 21:10:13 -08:00
Botao Chen	bae197c37e	Fix post training apis broken by torchtune release (#674 ) There is a torchtune release this morning https://github.com/pytorch/torchtune/releases/tag/v0.5.0 and breaks post training apis ## test spinning up server and the post training works again after the fix <img width="1314" alt="Screenshot 2024-12-20 at 4 08 54 PM" src="https://github.com/user-attachments/assets/dfae724d-ebf0-4846-9715-096efa060cee" /> ## Note We need to think hard of how to avoid this happen again and have a fast follow up on this after holidays	2024-12-20 16:12:02 -08:00
Botao Chen	06cb0c837e	[torchtune integration] post training + eval (#670 ) ## What does this PR do? - Add related Apis in experimental-post-training template to enable eval on the finetuned checkpoint in the template - A small bug fix on meta reference eval - A small error handle improvement on post training ## Test Plan From client side issued an E2E post training request https://github.com/meta-llama/llama-stack-client-python/pull/70 and get eval results successfully <img width="1315" alt="Screenshot 2024-12-20 at 12 06 59 PM" src="https://github.com/user-attachments/assets/a09bd524-59ae-490c-908f-2e36ccf27c0a" />	2024-12-20 13:43:13 -08:00
Dinesh Yeduguru	c8be0bf1c9	Tools API with brave and MCP providers (#639 ) This PR adds a new Tools api and adds two tool runtime providers: brave and MCP. Test plan: ``` curl -X POST 'http://localhost:5000/alpha/toolgroups/register' \ -H 'Content-Type: application/json' \ -d '{ "tool_group_id": "simple_tool", "tool_group": { "type": "model_context_protocol", "endpoint": {"uri": "http://localhost:56000/sse"} }, "provider_id": "model-context-protocol" }' curl -X POST 'http://localhost:5000/alpha/toolgroups/register' \ -H 'Content-Type: application/json' \ -d '{ "tool_group_id": "search", "provider_id": "brave-search", "tool_group": { "type": "user_defined", "tools": [ { "name": "brave_search", "description": "A web search tool", "parameters": [ { "name": "query", "parameter_type": "string", "description": "The query to search" } ], "metadata": {}, "tool_prompt_format": "json" } ] } }' curl -X GET http://localhost:5000/alpha/tools/list \| jq . % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 662 100 662 0 0 333k 0 --:--:-- --:--:-- --:--:-- 646k [ { "identifier": "brave_search", "provider_resource_id": "brave_search", "provider_id": "brave-search", "type": "tool", "tool_group": "search", "description": "A web search tool", "parameters": [ { "name": "query", "parameter_type": "string", "description": "The query to search" } ], "metadata": {}, "tool_prompt_format": "json" }, { "identifier": "fetch", "provider_resource_id": "fetch", "provider_id": "model-context-protocol", "type": "tool", "tool_group": "simple_tool", "description": "Fetches a website and returns its content", "parameters": [ { "name": "url", "parameter_type": "string", "description": "URL to fetch" } ], "metadata": { "endpoint": "http://localhost:56000/sse" }, "tool_prompt_format": "json" } ] curl -X POST 'http://localhost:5000/alpha/tool-runtime/invoke' \ -H 'Content-Type: application/json' \ -d '{ "tool_name": "fetch", "args": { "url": "http://google.com/" } }' curl -X POST 'http://localhost:5000/alpha/tool-runtime/invoke' \ -H 'Content-Type: application/json' -H 'X-LlamaStack-ProviderData: {"api_key": "<KEY>"}' \ -d '{ "tool_name": "brave_search", "args": { "query": "who is meta ceo" } }' ```	2024-12-19 21:25:17 -08:00
Aidan Do	17fdb47e5e	Add Llama 70B 3.3 to fireworks (#654 ) # What does this PR do? - Makes Llama 70B 3.3 available for fireworks ## Test Plan ```shell pip install -e . \ && llama stack build --config distributions/fireworks/build.yaml --image-type conda \ && llama stack run distributions/fireworks/run.yaml \ --port 5000 ``` ```python response = client.inference.chat_completion( model_id="Llama3.3-70B-Instruct", messages=[ {"role": "user", "content": "hello world"}, ], ) ``` ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-19 17:32:49 -08:00
Dinesh Yeduguru	8b8d1c1ef4	fix trace starting in library client (#655 ) # What does this PR do? Because of the way library client sets up async io boundaries, tracing was broken with streaming. This PR fixes the tracing to start at the right way to caputre the life time of async gen functions correctly. Test plan: Script ran: https://gist.github.com/yanxi0830/f6645129e55ab12de3cd6ec71564c69e Before: No spans returned for a session Now: We see spans <img width="1678" alt="Screenshot 2024-12-18 at 9 50 46 PM" src="https://github.com/user-attachments/assets/58a3b0dd-a41c-489a-b89a-075e698a2c03" />	2024-12-19 16:13:52 -08:00
cdgamarose-nv	ddf37ea467	Fixed imports for inference (#661 ) # What does this PR do? In short, provide a summary of what this PR does and why. Usually, the relevant context should be present in a linked issue. - [x] Addresses issue (#issue) ``` from .nvidia import NVIDIAInferenceAdapter File "/localhome/local-cdgamarose/llama-stack/llama_stack/providers/remote/inference/nvidia/nvidia.py", line 37, in <module> from .openai_utils import ( File "/localhome/local-cdgamarose/llama-stack/llama_stack/providers/remote/inference/nvidia/openai_utils.py", line 11, in <module> from llama_models.llama3.api.datatypes import ( ImportError: cannot import name 'CompletionMessage' from 'llama_models.llama3.api.datatypes' (/localhome/local-cdgamarose/.local/lib/python3.10/site-packages/llama_models/llama3/api/datatypes.py) ++ error_handler 62 ``` ## Test Plan Deploy NIM using docker from https://build.nvidia.com/meta/llama-3_1-8b-instruct?snippet_tab=Docker ``` (lsmyenv) local-cdgamarose@a4u8g-0006:~/llama-stack$ python3 -m pytest -s -v --providers inference=nvidia llama_stack/providers/tests/inference/ --env NVIDIA_BASE_URL=http://localhost:8000 -k test_completion --inference-model Llama3.1-8B-Instruct ======================================================================================== test session starts ========================================================================================= platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /localhome/local-cdgamarose/anaconda3/envs/lsmyenv/bin/python3 cachedir: .pytest_cache rootdir: /localhome/local-cdgamarose/llama-stack configfile: pyproject.toml plugins: anyio-4.7.0, asyncio-0.25.0 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None collected 24 items / 21 deselected / 3 selected llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-nvidia] Initializing NVIDIAInferenceAdapter(http://localhost:8000)... Checking NVIDIA NIM health... Checking NVIDIA NIM health... PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-nvidia] SKIPPED (Other inference providers don't support completion() yet) llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-nvidia] SKIPPED (This test is not quite robust) ====================================================================== 1 passed, 2 skipped, 21 deselected, 2 warnings in 1.57s ======================================================================= ``` ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [x] Wrote necessary unit or integration tests.	2024-12-19 14:19:36 -08:00
Ashwin Bharambe	540fc4d717	Fix Meta reference GPU implementation (#663 ) By performing in-place mutations, we lost. Never in life do that.	2024-12-19 14:09:45 -08:00
Ashwin Bharambe	f19eb8eee3	Update types in parallel_utils for meta-refernece-gpu impl	2024-12-19 13:58:41 -08:00
Xi Yan	5be2ea37b1	fix context_retriever model->model_id	2024-12-19 12:52:00 -08:00
Dinesh Yeduguru	03607a68c7	remove unused telemetry related code for console (#659 ) # What does this PR do? Remove unused code since this now exists in the meta reference provider as a sink ## Test Plan llama stack run ~/.llama/distributions/llamastack-together/together-run.yaml	2024-12-19 11:21:11 -08:00
Botao Chen	36b4fe02cc	[4/n][torchtune integration] support lazy load model during inference (#620 ) ## What does this PR do? In this PR, we refactor the meta reference inference logic to support - load the model during registering model instead of during spinning up server - support inference finetuned model checkpoint on top of native llama model ## Why need these changes To solve the existing pain points that - user cannot lazy load the model and hot switch the inference checkpoint after spinning up the server - this blocks us doing inference and eval on the same sever for a finetuned checkpoint after post training - user cannot do inference on a finetuned checkpoint on top of native llama models ## Expect user experience change - The inference model won't be loaded when spinning up server. Instead, it will be loaded during register model. If user add the model as models resource in run.yaml, it will be registered and loaded automatically when starting server. There is an optional flag 'skip_initialize' in model metadata to skip model loading during registration. - There is an optional flag 'llama_model' in model metadata to identify the base model of the Model class for validation and initialize model arch. model identifier no longer needs to be a native llama model - the default inference model name updates from 'meta-llama/Llama-3.2-3B-Instruct' to 'Llama3.2-3B-Instruct' - It aligns with the checkpoint folder name after running 'llama model download' - It aligns with the descriptor name defined in llama-models SKU list `bf5b0c4fe7/models/datatypes.py (L95)` ## test run python llama_stack/scripts/distro_codegen.py run unit test - torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="Llama3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py - torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="Llama3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_model_registration.py test post training experience on server side run: llama stack run llama_stack/templates/experimental-post-training/run.yaml server is spinning up without model loaded <img width="812" alt="Screenshot 2024-12-17 at 1 24 50 PM" src="https://github.com/user-attachments/assets/ce1f606b-3b6f-452f-b48e-b3761ffd90f3" /> on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 models register Llama3.2-3B-Instruct register model successfully and the model is loaded <img width="1111" alt="Screenshot 2024-12-17 at 1 26 30 PM" src="https://github.com/user-attachments/assets/56e02131-cf7d-4de5-8f63-fbdcb8c55c26" /> <img width="1541" alt="Screenshot 2024-12-17 at 1 26 09 PM" src="https://github.com/user-attachments/assets/a83255a1-20f5-40a2-af51-55641410a115" /> if add "skip_initialize" in metadata, model is registered but isn't loaded on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 inference chat-completion --message "hello, what model are you?" Inference the model succesfully <img width="1121" alt="Screenshot 2024-12-17 at 1 27 33 PM" src="https://github.com/user-attachments/assets/8e708545-3fe7-4a73-8754-1470fa5f1e75" /> test inference experience run: llama stack run llama_stack/templates/meta-reference-gpu/run.yaml model is loaded since the model is in resouce list in run.yaml <img width="1537" alt="Screenshot 2024-12-17 at 1 30 19 PM" src="https://github.com/user-attachments/assets/5c8af817-66eb-43f8-bf4c-f5e24b0a12c6" /> on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 inference chat-completion --message "hello, what model are you?" inference successfully <img width="1123" alt="Screenshot 2024-12-17 at 1 31 08 PM" src="https://github.com/user-attachments/assets/471809aa-c65e-46dc-a37e-7094fb857f97" /> ## inference on a finetuned model register a finetuned model that finetuned by post training api (torchtune) - the model is registered and loaded successfully - the model is shown up in the model list <img width="974" alt="Screenshot 2024-12-18 at 3 56 33 PM" src="https://github.com/user-attachments/assets/2994b4f5-4fa9-40c6-acc6-4b971479f3e2" /> run inference <img width="977" alt="Screenshot 2024-12-18 at 3 57 59 PM" src="https://github.com/user-attachments/assets/d117abbc-b2a0-41d8-a028-1a13128787b2" />	2024-12-18 16:30:53 -08:00
Ashwin Bharambe	3b4b2ea30c	fix replace_env_vars bug	2024-12-18 13:48:30 -08:00
Ashwin Bharambe	12cbed1617	Register Message and ResponseFormat	2024-12-18 10:32:25 -08:00

1 2 3 4 5 ...

527 commits