llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-21 14:42:25 +00:00

Author	SHA1	Message	Date
Xi Yan	8a8550fe9b	cli imports	2024-12-26 17:19:40 -08:00
Xi Yan	21a6bd57ea	fix imports	2024-12-26 17:17:03 -08:00
Xi Yan	c6d3fc6fb6	datatypes	2024-12-26 17:00:56 -08:00
Xi Yan	6c6b5fb091	openai_compat	2024-12-26 16:59:06 -08:00
Xi Yan	9ab0730294	kvstore	2024-12-26 16:55:40 -08:00
Xi Yan	30fee82407	vector_store	2024-12-26 16:54:33 -08:00
Xi Yan	b7bc1c6297	telemetry	2024-12-26 16:48:54 -08:00
Xi Yan	bb0a3f5c8e	remove more imports	2024-12-26 16:43:30 -08:00
Xi Yan	93ed8aa814	remove more imports	2024-12-26 16:39:31 -08:00
Xi Yan	0a0c01fbc2	test agents imports	2024-12-26 16:32:23 -08:00
Xi Yan	9bdb7236b2	Merge branch 'main' into remove_import_stars	2024-12-26 15:50:12 -08:00
Xi Yan	88c967a3e2	fix client-sdk memory/safety test	2024-12-26 15:49:15 -08:00
Xi Yan	b05d8fd956	fix client-sdk agents/inference test	2024-12-26 15:49:14 -08:00
Xi Yan	19c99e36a0	update playground doc video	2024-12-26 15:49:14 -08:00
Xi Yan	70db039ff4	fix client-sdk memory/safety test	2024-12-26 15:48:28 -08:00
Xi Yan	b6aca4c8bb	fix client-sdk agents/inference test	2024-12-26 15:44:34 -08:00
Xi Yan	da26d22f90	remove imports 1/n	2024-12-26 15:19:06 -08:00
Xi Yan	4e1d0a2fc5	update playground doc video	2024-12-26 14:50:19 -08:00
Xi Yan	28ce511986	fix --endpoint docs	2024-12-26 14:32:07 -08:00
Ikko Eltociear Ashimine	7ba95a8e74	docs: update evals_reference/index.md (#675 ) # What does this PR do? minor fix ## Sources Please link relevant resources if necessary. ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-26 11:32:37 -08:00
Aidan Do	21fb92d7cf	Add 3.3 70B to Ollama inference provider (#681 ) # What does this PR do? Adds 3.3 70B support to Ollama inference provider ## Test Plan <details> <summary>Manual</summary> ```bash # 42GB to download ollama pull llama3.3:70b ollama run llama3.3:70b --keepalive 60m export LLAMA_STACK_PORT=5000 pip install -e . \ && llama stack build --template ollama --image-type conda \ && llama stack run ./distributions/ollama/run.yaml \ --port $LLAMA_STACK_PORT \ --env INFERENCE_MODEL=Llama3.3-70B-Instruct \ --env OLLAMA_URL=http://localhost:11434 export LLAMA_STACK_PORT=5000 llama-stack-client --endpoint http://localhost:$LLAMA_STACK_PORT \ inference chat-completion \ --model-id Llama3.3-70B-Instruct \ --message "hello, what model are you?" ``` <img width="1221" alt="image" src="https://github.com/user-attachments/assets/dcffbdd9-94c8-4d47-9f95-4ef6c3756294" /> </details> ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-25 22:15:58 -08:00
Yuan Tang	fa371fdc9e	Removed unnecessary CONDA_PREFIX env var in installation guide (#683 ) This is not needed since `conda activate stack` has already been executed.	2024-12-23 13:17:30 -08:00
Yuan Tang	987e651755	Add missing venv option in --image-type (#677 ) "venv" option is supported but not mentioned in the prompt. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2024-12-21 21:10:13 -08:00
Botao Chen	bae197c37e	Fix post training apis broken by torchtune release (#674 ) There is a torchtune release this morning https://github.com/pytorch/torchtune/releases/tag/v0.5.0 and breaks post training apis ## test spinning up server and the post training works again after the fix <img width="1314" alt="Screenshot 2024-12-20 at 4 08 54 PM" src="https://github.com/user-attachments/assets/dfae724d-ebf0-4846-9715-096efa060cee" /> ## Note We need to think hard of how to avoid this happen again and have a fast follow up on this after holidays	2024-12-20 16:12:02 -08:00
Botao Chen	06cb0c837e	[torchtune integration] post training + eval (#670 ) ## What does this PR do? - Add related Apis in experimental-post-training template to enable eval on the finetuned checkpoint in the template - A small bug fix on meta reference eval - A small error handle improvement on post training ## Test Plan From client side issued an E2E post training request https://github.com/meta-llama/llama-stack-client-python/pull/70 and get eval results successfully <img width="1315" alt="Screenshot 2024-12-20 at 12 06 59 PM" src="https://github.com/user-attachments/assets/a09bd524-59ae-490c-908f-2e36ccf27c0a" />	2024-12-20 13:43:13 -08:00
Dinesh Yeduguru	c8be0bf1c9	Tools API with brave and MCP providers (#639 ) This PR adds a new Tools api and adds two tool runtime providers: brave and MCP. Test plan: ``` curl -X POST 'http://localhost:5000/alpha/toolgroups/register' \ -H 'Content-Type: application/json' \ -d '{ "tool_group_id": "simple_tool", "tool_group": { "type": "model_context_protocol", "endpoint": {"uri": "http://localhost:56000/sse"} }, "provider_id": "model-context-protocol" }' curl -X POST 'http://localhost:5000/alpha/toolgroups/register' \ -H 'Content-Type: application/json' \ -d '{ "tool_group_id": "search", "provider_id": "brave-search", "tool_group": { "type": "user_defined", "tools": [ { "name": "brave_search", "description": "A web search tool", "parameters": [ { "name": "query", "parameter_type": "string", "description": "The query to search" } ], "metadata": {}, "tool_prompt_format": "json" } ] } }' curl -X GET http://localhost:5000/alpha/tools/list \| jq . % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 662 100 662 0 0 333k 0 --:--:-- --:--:-- --:--:-- 646k [ { "identifier": "brave_search", "provider_resource_id": "brave_search", "provider_id": "brave-search", "type": "tool", "tool_group": "search", "description": "A web search tool", "parameters": [ { "name": "query", "parameter_type": "string", "description": "The query to search" } ], "metadata": {}, "tool_prompt_format": "json" }, { "identifier": "fetch", "provider_resource_id": "fetch", "provider_id": "model-context-protocol", "type": "tool", "tool_group": "simple_tool", "description": "Fetches a website and returns its content", "parameters": [ { "name": "url", "parameter_type": "string", "description": "URL to fetch" } ], "metadata": { "endpoint": "http://localhost:56000/sse" }, "tool_prompt_format": "json" } ] curl -X POST 'http://localhost:5000/alpha/tool-runtime/invoke' \ -H 'Content-Type: application/json' \ -d '{ "tool_name": "fetch", "args": { "url": "http://google.com/" } }' curl -X POST 'http://localhost:5000/alpha/tool-runtime/invoke' \ -H 'Content-Type: application/json' -H 'X-LlamaStack-ProviderData: {"api_key": "<KEY>"}' \ -d '{ "tool_name": "brave_search", "args": { "query": "who is meta ceo" } }' ```	2024-12-19 21:25:17 -08:00
Aidan Do	17fdb47e5e	Add Llama 70B 3.3 to fireworks (#654 ) # What does this PR do? - Makes Llama 70B 3.3 available for fireworks ## Test Plan ```shell pip install -e . \ && llama stack build --config distributions/fireworks/build.yaml --image-type conda \ && llama stack run distributions/fireworks/run.yaml \ --port 5000 ``` ```python response = client.inference.chat_completion( model_id="Llama3.3-70B-Instruct", messages=[ {"role": "user", "content": "hello world"}, ], ) ``` ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-19 17:32:49 -08:00
Dinesh Yeduguru	8b8d1c1ef4	fix trace starting in library client (#655 ) # What does this PR do? Because of the way library client sets up async io boundaries, tracing was broken with streaming. This PR fixes the tracing to start at the right way to caputre the life time of async gen functions correctly. Test plan: Script ran: https://gist.github.com/yanxi0830/f6645129e55ab12de3cd6ec71564c69e Before: No spans returned for a session Now: We see spans <img width="1678" alt="Screenshot 2024-12-18 at 9 50 46 PM" src="https://github.com/user-attachments/assets/58a3b0dd-a41c-489a-b89a-075e698a2c03" />	2024-12-19 16:13:52 -08:00
cdgamarose-nv	ddf37ea467	Fixed imports for inference (#661 ) # What does this PR do? In short, provide a summary of what this PR does and why. Usually, the relevant context should be present in a linked issue. - [x] Addresses issue (#issue) ``` from .nvidia import NVIDIAInferenceAdapter File "/localhome/local-cdgamarose/llama-stack/llama_stack/providers/remote/inference/nvidia/nvidia.py", line 37, in <module> from .openai_utils import ( File "/localhome/local-cdgamarose/llama-stack/llama_stack/providers/remote/inference/nvidia/openai_utils.py", line 11, in <module> from llama_models.llama3.api.datatypes import ( ImportError: cannot import name 'CompletionMessage' from 'llama_models.llama3.api.datatypes' (/localhome/local-cdgamarose/.local/lib/python3.10/site-packages/llama_models/llama3/api/datatypes.py) ++ error_handler 62 ``` ## Test Plan Deploy NIM using docker from https://build.nvidia.com/meta/llama-3_1-8b-instruct?snippet_tab=Docker ``` (lsmyenv) local-cdgamarose@a4u8g-0006:~/llama-stack$ python3 -m pytest -s -v --providers inference=nvidia llama_stack/providers/tests/inference/ --env NVIDIA_BASE_URL=http://localhost:8000 -k test_completion --inference-model Llama3.1-8B-Instruct ======================================================================================== test session starts ========================================================================================= platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /localhome/local-cdgamarose/anaconda3/envs/lsmyenv/bin/python3 cachedir: .pytest_cache rootdir: /localhome/local-cdgamarose/llama-stack configfile: pyproject.toml plugins: anyio-4.7.0, asyncio-0.25.0 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None collected 24 items / 21 deselected / 3 selected llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-nvidia] Initializing NVIDIAInferenceAdapter(http://localhost:8000)... Checking NVIDIA NIM health... Checking NVIDIA NIM health... PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-nvidia] SKIPPED (Other inference providers don't support completion() yet) llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-nvidia] SKIPPED (This test is not quite robust) ====================================================================== 1 passed, 2 skipped, 21 deselected, 2 warnings in 1.57s ======================================================================= ``` ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [x] Wrote necessary unit or integration tests.	2024-12-19 14:19:36 -08:00
Ashwin Bharambe	540fc4d717	Fix Meta reference GPU implementation (#663 ) By performing in-place mutations, we lost. Never in life do that.	2024-12-19 14:09:45 -08:00
Ashwin Bharambe	f19eb8eee3	Update types in parallel_utils for meta-refernece-gpu impl	2024-12-19 13:58:41 -08:00
Vladimir Ivic	b33086d632	Adding @vladimirivic to the owners file	2024-12-19 13:22:10 -08:00
Xi Yan	5be2ea37b1	fix context_retriever model->model_id	2024-12-19 12:52:00 -08:00
Dinesh Yeduguru	03607a68c7	remove unused telemetry related code for console (#659 ) # What does this PR do? Remove unused code since this now exists in the meta reference provider as a sink ## Test Plan llama stack run ~/.llama/distributions/llamastack-together/together-run.yaml	2024-12-19 11:21:11 -08:00
Botao Chen	36b4fe02cc	[4/n][torchtune integration] support lazy load model during inference (#620 ) ## What does this PR do? In this PR, we refactor the meta reference inference logic to support - load the model during registering model instead of during spinning up server - support inference finetuned model checkpoint on top of native llama model ## Why need these changes To solve the existing pain points that - user cannot lazy load the model and hot switch the inference checkpoint after spinning up the server - this blocks us doing inference and eval on the same sever for a finetuned checkpoint after post training - user cannot do inference on a finetuned checkpoint on top of native llama models ## Expect user experience change - The inference model won't be loaded when spinning up server. Instead, it will be loaded during register model. If user add the model as models resource in run.yaml, it will be registered and loaded automatically when starting server. There is an optional flag 'skip_initialize' in model metadata to skip model loading during registration. - There is an optional flag 'llama_model' in model metadata to identify the base model of the Model class for validation and initialize model arch. model identifier no longer needs to be a native llama model - the default inference model name updates from 'meta-llama/Llama-3.2-3B-Instruct' to 'Llama3.2-3B-Instruct' - It aligns with the checkpoint folder name after running 'llama model download' - It aligns with the descriptor name defined in llama-models SKU list `bf5b0c4fe7/models/datatypes.py (L95)` ## test run python llama_stack/scripts/distro_codegen.py run unit test - torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="Llama3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py - torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="Llama3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_model_registration.py test post training experience on server side run: llama stack run llama_stack/templates/experimental-post-training/run.yaml server is spinning up without model loaded <img width="812" alt="Screenshot 2024-12-17 at 1 24 50 PM" src="https://github.com/user-attachments/assets/ce1f606b-3b6f-452f-b48e-b3761ffd90f3" /> on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 models register Llama3.2-3B-Instruct register model successfully and the model is loaded <img width="1111" alt="Screenshot 2024-12-17 at 1 26 30 PM" src="https://github.com/user-attachments/assets/56e02131-cf7d-4de5-8f63-fbdcb8c55c26" /> <img width="1541" alt="Screenshot 2024-12-17 at 1 26 09 PM" src="https://github.com/user-attachments/assets/a83255a1-20f5-40a2-af51-55641410a115" /> if add "skip_initialize" in metadata, model is registered but isn't loaded on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 inference chat-completion --message "hello, what model are you?" Inference the model succesfully <img width="1121" alt="Screenshot 2024-12-17 at 1 27 33 PM" src="https://github.com/user-attachments/assets/8e708545-3fe7-4a73-8754-1470fa5f1e75" /> test inference experience run: llama stack run llama_stack/templates/meta-reference-gpu/run.yaml model is loaded since the model is in resouce list in run.yaml <img width="1537" alt="Screenshot 2024-12-17 at 1 30 19 PM" src="https://github.com/user-attachments/assets/5c8af817-66eb-43f8-bf4c-f5e24b0a12c6" /> on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 inference chat-completion --message "hello, what model are you?" inference successfully <img width="1123" alt="Screenshot 2024-12-17 at 1 31 08 PM" src="https://github.com/user-attachments/assets/471809aa-c65e-46dc-a37e-7094fb857f97" /> ## inference on a finetuned model register a finetuned model that finetuned by post training api (torchtune) - the model is registered and loaded successfully - the model is shown up in the model list <img width="974" alt="Screenshot 2024-12-18 at 3 56 33 PM" src="https://github.com/user-attachments/assets/2994b4f5-4fa9-40c6-acc6-4b971479f3e2" /> run inference <img width="977" alt="Screenshot 2024-12-18 at 3 57 59 PM" src="https://github.com/user-attachments/assets/d117abbc-b2a0-41d8-a028-1a13128787b2" />	2024-12-18 16:30:53 -08:00
Ashwin Bharambe	3b4b2ea30c	fix replace_env_vars bug	2024-12-18 13:48:30 -08:00
Ashwin Bharambe	12cbed1617	Register Message and ResponseFormat	2024-12-18 10:32:25 -08:00
Ashwin Bharambe	ceadaf1840	Dont include 3B / 1B models for bedrock since they arent ondemand	2024-12-18 06:30:02 -08:00
Ashwin Bharambe	c39a3777b5	Make bedrock "just" work	2024-12-18 06:22:33 -08:00
Ashwin Bharambe	d6fcdefec7	Bump version to 0.0.63	2024-12-17 23:15:27 -08:00
Ashwin Bharambe	f1d6cb22d7	Update URL type to avoid string-ifying and creating complexity	2024-12-17 22:50:11 -08:00
Xi Yan	75e72cf2fc	model_type=llm for filering available models for playground	2024-12-17 19:42:38 -08:00
Ashwin Bharambe	2f9fdb0ea7	Update notebook	2024-12-17 18:52:02 -08:00
Ashwin Bharambe	0fb4b7de6f	Add more debugging logs to when llama guard fails	2024-12-17 18:52:02 -08:00
Ashwin Bharambe	eea478618d	Bump version to 0.0.62	2024-12-17 18:19:47 -08:00
Xi Yan	af8f1b3531	model selection playground fix	2024-12-17 18:13:52 -08:00
Dinesh Yeduguru	3700022d6f	store attributes values in builtin types to avoid otel warnings (#649 ) # What does this PR do? Serialize objects to built in types to avoid otel warnings ## Test Plan ╰─❯ llama stack run ~/.llama/distributions/llamastack-together/together-run.yaml	2024-12-17 17:10:43 -08:00
Henry Tu	0e2a99e223	Update Cerebras from Llama 3.1 to 3.3 (#645 ) # What does this PR do? Cerebras is rolling out support for llama 3.3 70b and deprecating llama 3.1 70b. This PR updates the documentation, config, and internal mapping to reflect this change. cc: @ashwinb @raghotham	2024-12-17 16:28:24 -08:00
Ashwin Bharambe	b7a7caa9a8	Fix conversion to RawMessage everywhere	2024-12-17 14:00:43 -08:00
Ashwin Bharambe	fbca51d6da	Fix to conda env build script	2024-12-17 12:19:34 -08:00

1 2 3 4 5 ...

784 commits