llama-stack

forked from phoenix-oss/llama-stack-mirror

Author	SHA1	Message	Date
Ashwin Bharambe	d790be28b3	Don't skip meta-reference for the tests	2024-11-21 13:29:53 -08:00
Xi Yan	654722da7d	fix model id for llm_as_judge_405b	2024-11-21 11:34:49 -08:00
Dinesh Yeduguru	6395dadc2b	use logging instead of prints (#499 ) # What does this PR do? This PR moves all print statements to use logging. Things changed: - Had to add `await start_trace("sse_generator")` to server.py to actually get tracing working. else was not seeing any logs - If no telemetry provider is provided in the run.yaml, we will write to stdout - by default, the logs are going to be in JSON, but we expose an option to configure to output in a human readable way.	2024-11-21 11:32:53 -08:00
liyunlu0618	4e1105e563	Fix fp8 quantization script. (#500 ) # What does this PR do? Fix fp8 quantization script. ## Test Plan ``` sh run_quantize_checkpoint.sh localhost fp8 /home/yll/fp8_test/ /home/yll/fp8_test/quantized_2 /home/yll/fp8_test/tokenizer.model 1 1 ``` ## Sources Please link relevant resources if necessary. ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [x] Updated relevant documentation. - [x] Wrote necessary unit or integration tests. Co-authored-by: Yunlu Li <yll@meta.com>	2024-11-21 09:15:28 -08:00
Ashwin Bharambe	cd6ccb664c	Integrate distro docs into the restructured docs	2024-11-20 23:20:05 -08:00
Ashwin Bharambe	2411a44833	Update more distribution docs to be simpler and partially codegen'ed	2024-11-20 22:03:44 -08:00
Ashwin Bharambe	e84d4436b5	Since we are pushing for HF repos, we should accept them in inference configs (#497 ) # What does this PR do? As the title says. ## Test Plan This needs `8752149f58` to also land. So the next package (0.0.54) will make this work properly. The test is: ```bash pytest -v -s -m "llama_3b and meta_reference" test_model_registration.py ```	2024-11-20 16:14:37 -08:00
Ashwin Bharambe	068ac00a3b	Don't depend on templates.py when print llama stack build messages (#496 )	2024-11-20 15:44:49 -08:00
Ashwin Bharambe	00816cc8ef	make sure codegen doesn't cause spurious diffs for no reason	2024-11-20 13:56:30 -08:00
Ashwin Bharambe	681322731b	Make run yaml optional so dockers can start with just --env (#492 ) When running with dockers, the idea is that users be able to work purely with the `llama stack` CLI. They should not need to know about the existence of any YAMLs unless they need to. This PR enables it. The docker command now doesn't need to volume mount a yaml and can simply be: ```bash docker run -v ~/.llama/:/root/.llama \ --env A=a --env B=b ``` ## Test Plan Check with conda first (no regressions): ```bash LLAMA_STACK_DIR=. llama stack build --template ollama llama stack run ollama --port 5001 # server starts up correctly ``` Check with docker ```bash # build the docker LLAMA_STACK_DIR=. llama stack build --template ollama --image-type docker export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" docker run -it -p 5001:5001 \ -v ~/.llama:/root/.llama \ -v $PWD:/app/llama-stack-source \ localhost/distribution-ollama:dev \ --port 5001 \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env OLLAMA_URL=http://host.docker.internal:11434 ``` Note that volume mounting to `/app/llama-stack-source` is only needed because we built the docker with uncommitted source code.	2024-11-20 13:11:40 -08:00
Dinesh Yeduguru	1d8d0593af	register with provider even if present in stack (#491 ) # What does this PR do? Remove a check which skips provider registration if a resource is already in stack registry. Since we do not reconcile state with provider, register should always call into provider's register endpoint. ## Test Plan ``` # stack run ╰─❯ llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml #register memory bank ❯ llama-stack-client memory_banks register your_memory_bank_name --type vector --provider-id inline::faiss-0 Memory Bank Configuration: { │ 'memory_bank_type': 'vector', │ 'chunk_size_in_tokens': 512, │ 'embedding_model': 'all-MiniLM-L6-v2', │ 'overlap_size_in_tokens': 64 } #register again ❯ llama-stack-client memory_banks register your_memory_bank_name --type vector --provider-id inline::faiss-0 Memory Bank Configuration: { │ 'memory_bank_type': 'vector', │ 'chunk_size_in_tokens': 512, │ 'embedding_model': 'all-MiniLM-L6-v2', │ 'overlap_size_in_tokens': 64 } ```	2024-11-20 11:05:50 -08:00
Dinesh Yeduguru	91e7efbc91	fall to back to read from chroma/pgvector when not in cache (#489 ) # What does this PR do? The chroma provider maintains a cache but does not sync up with chroma on a cold start. this change adds a fallback to read from chroma on a cache miss. ## Test Plan ```bash #start stack llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml # Add documents PYTHONPATH=. python -m examples.agents.rag_with_memory_bank localhost 5000 No available shields. Disable safety. Using model: Llama3.1-8B-Instruct Created session_id=b951b14f-a9d2-43a3-8b80-d80114d58322 for Agent(0687a251-6906-4081-8d4c-f52e19db9dd7) memory_retrieval> Retrieved context from banks: ['test_bank']. ==== Here are the retrieved documents for relevant context: === START-RETRIEVED-CONTEXT === id:num-1; content:_ the template from Llama2 to better support multiturn conversations. The same text in the Lla... > inference> Based on the retrieved documentation, the top 5 topics that were explained are: ............... # Kill stack # Bootup stack llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml # Run a RAG app with just the agent flow. it discovers the previously added documents No available shields. Disable safety. Using model: Llama3.1-8B-Instruct Created session_id=7a30c1a7-c87e-4787-936c-d0306589fe5d for Agent(b30420f3-c928-498a-887b-d084f0f3806c) memory_retrieval> Retrieved context from banks: ['test_bank']. ==== Here are the retrieved documents for relevant context: === START-RETRIEVED-CONTEXT === id:num-1; content:_ the template from Llama2 to better support multiturn conversations. The same text in the Lla... > inference> Based on the provided documentation, the top 5 topics that were explained are: ..... ```	2024-11-20 10:30:23 -08:00
Ashwin Bharambe	89f5093dfc	Fix tgi doc	2024-11-19 21:06:11 -08:00
Mengtao Yuan	1086b500f9	Support Tavily as built-in search tool. (#485 ) # What does this PR do? Add Tavily as a built-in search tool, in addition to Brave and Bing. ## Test Plan It's tested using ollama remote, showing parity to the Brave search tool. - Install and run ollama with `ollama run llama3.1:8b-instruct-fp16` - Build ollama distribution `llama stack build --template ollama --image-type conda` - Run ollama `stack run /$USER/.llama/distributions/llamastack-ollama/ollama-run.yaml --port 5001` - Client test command: `python - m agents.test_agents.TestAgents.test_create_agent_turn_with_tavily_search`, with enviroments: MASTER_ADDR=0.0.0.0;MASTER_PORT=5001;RANK=0;REMOTE_STACK_HOST=0.0.0.0;REMOTE_STACK_PORT=5001;TAVILY_SEARCH_API_KEY=tvly-<YOUR-KEY>;WORLD_SIZE=1 Test passes on the specific case (ollama remote). Server output: ``` Listening on ['::', '0.0.0.0']:5001 INFO: Started server process [7220] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:5001 (Press CTRL+C to quit) INFO: 127.0.0.1:65209 - "POST /agents/create HTTP/1.1" 200 OK INFO: 127.0.0.1:65210 - "POST /agents/session/create HTTP/1.1" 200 OK INFO: 127.0.0.1:65211 - "POST /agents/turn/create HTTP/1.1" 200 OK role='user' content='What are the latest developments in quantum computing?' context=None role='assistant' content='' stop_reason=<StopReason.end_of_turn: 'end_of_turn'> tool_calls=[ToolCall(call_id='fc92ccb8-1039-4ce8-ba5e-8f2b0147661c', tool_name=<BuiltinTool.brave_search: 'brave_search'>, arguments={'query': 'latest developments in quantum computing'})] role='ipython' call_id='fc92ccb8-1039-4ce8-ba5e-8f2b0147661c' tool_name=<BuiltinTool.brave_search: 'brave_search'> content='{"query": "latest developments in quantum computing", "top_k": [{"title": "IBM Unveils 400 Qubit-Plus Quantum Processor and Next-Generation IBM ...", "url": "https://newsroom.ibm.com/2022-11-09-IBM-Unveils-400-Qubit-Plus-Quantum-Processor-and-Next-Generation-IBM-Quantum-System-Two", "content": "This system is targeted to be online by the end of 2023 and will be a building b...<more>...onnect large-scale ...", "url": "https://news.mit.edu/2023/quantum-interconnects-photon-emission-0105", "content": "Quantum computers hold the promise of performing certain tasks that are intractable even on the world\'s most powerful supercomputers. In the future, scientists anticipate using quantum computing to emulate materials systems, simulate quantum chemistry, and optimize hard tasks, with impacts potentially spanning finance to pharmaceuticals.", "score": 0.71721, "raw_content": null}]}' Assistant: The latest developments in quantum computing include: * IBM unveiling its 400 qubit-plus quantum processor and next-generation IBM Quantum System Two, which will be a building block of quantum-centric supercomputing. * The development of utility-scale quantum computing, which can serve as a scientific tool to explore utility-scale classes of problems in chemistry, physics, and materials beyond brute force classical simulation of quantum mechanics. * The introduction of advanced hardware across IBM's global fleet of 100+ qubit systems, as well as easy-to-use software that users and computational scientists can now obtain reliable results from quantum systems as they map increasingly larger and more complex problems to quantum circuits. * Research on quantum repeaters, which use defects in diamond to interconnect quantum systems and could provide the foundation for scalable quantum networking. * The development of a new source of quantum light, which could be used to improve the efficiency of quantum computers. * The creation of a new mathematical "blueprint" that is accelerating fusion device development using Dyson maps. * Research on canceling noise to improve quantum devices, with MIT researchers developing a protocol to extend the life of quantum coherence. ``` Verified with tool response. The final model response is updated with the search requests. ## Sources ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [x] Updated relevant documentation. - [x] Wrote necessary unit or integration tests. Co-authored-by: Martin Yuan <myuan@meta.com>	2024-11-19 20:59:02 -08:00
varunfb	08be023290	Added optional md5 validate command once download is completed (#486 ) # What does this PR do? Adds description at the end of successful download the optionally run the verify md5 checksums command. ## Test Plan <img width="2004" alt="Screenshot 2024-11-19 at 12 11 37 PM" src="https://github.com/user-attachments/assets/8d617aef-99f5-4c3b-b93c-eff3e68289ea"> ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [x] Updated relevant documentation. - [x] Wrote necessary unit or integration tests. --------- Co-authored-by: varunfb <vontimitta@devgpu004.eag5.facebook.com>	2024-11-19 17:42:43 -08:00
Ashwin Bharambe	e605d57fb7	use API version in "remote" stack client	2024-11-19 15:59:47 -08:00
Ashwin Bharambe	7bfcfe80b5	Add logs (prints :/) to dump out what URL vllm / tgi is connecting to	2024-11-19 15:50:26 -08:00
Ashwin Bharambe	887ccc2143	Ensure llama-stack-client is installed in the container with TEST_PYPI	2024-11-19 15:21:10 -08:00
Xi Yan	2da93c8835	fix 3.2-1b fireworks	2024-11-19 14:20:07 -08:00
Xi Yan	189df6358a	codegen docs	2024-11-19 14:16:00 -08:00
Xi Yan	185df4b568	fix fireworks registration	2024-11-19 14:09:00 -08:00
Ashwin Bharambe	38ba3b9f0c	Fix fireworks stream completion	2024-11-19 13:36:14 -08:00
Ashwin Bharambe	05d1ead02f	Update condition in tests to handle llama-3.1 vs llama3.1 (HF names)	2024-11-19 13:25:36 -08:00
Ashwin Bharambe	394519d68a	Add llama-stack-client as a legitimate dependency for llama-stack	2024-11-19 11:44:35 -08:00
Ashwin Bharambe	c46b462c22	Updates to docker build script	2024-11-19 11:36:53 -08:00
Ashwin Bharambe	1619d37cc6	codegen per-distro dependencies; not hooked into setup.py yet	2024-11-19 09:54:30 -08:00
Ashwin Bharambe	5e4ac1b7c1	Make sure server code uses version prefixed routes	2024-11-19 09:15:05 -08:00
Ashwin Bharambe	84d5f35a48	Update the model alias for llama guard models in ollama	2024-11-19 00:22:24 -08:00
Ashwin Bharambe	e8d3eee095	Fix docs yet again	2024-11-18 23:51:35 -08:00
Dinesh Yeduguru	02f1c47416	support adding alias for models without hf repo/sku entry (#481 ) # What does this PR do? adds a new method build_model_alias_with_just_llama_model which is needed for cases like ollama's quantized models which do not really have a repo in hf and an entry in SKU list. ## Test Plan pytest -v -s -m "ollama" llama_stack/providers/tests/inference/test_text_inference.py --------- Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>	2024-11-18 23:50:18 -08:00
Ashwin Bharambe	8ed79ad0f3	Fix the pyopenapi generator avoid potential circular imports	2024-11-18 23:37:52 -08:00
Ashwin Bharambe	d463d68e1e	Update docs	2024-11-18 23:21:25 -08:00
Ashwin Bharambe	0dc7f5fa89	Add version to REST API url (#478 ) # What does this PR do? Adds a `/alpha/` prefix to all the REST API urls. Also makes them all use hyphens instead of underscores as is more standard practice. (This is based on feedback from our partners.) ## Test Plan The Stack itself does not need updating. However, client SDKs and documentation will need to be updated.	2024-11-18 22:44:14 -08:00
Xi Yan	05e93bd2f7	together default	2024-11-18 22:39:45 -08:00
Ashwin Bharambe	7693786322	Use HF names for registering fireworks and together models	2024-11-18 22:34:47 -08:00
Xi Yan	6765fd76ff	fix llama stack build for together & llama stack build from templates (#479 ) # What does this PR do? - Fix issue w/ llama stack build using together template <img width="669" alt="image" src="https://github.com/user-attachments/assets/1cbef052-d902-40b9-98f8-37efb494d117"> - For builds from templates, copy over the `templates/<template-name>/run.yaml` file to the `~/.llama/distributions/<name>/<name>-run.yaml` instead of re-building run config. ## Test Plan ``` $ llama stack build --template together --image-type conda .. Build spec configuration saved at /opt/anaconda3/envs/llamastack-together/together-build.yaml Build Successful! Next steps: 1. Set the environment variables: LLAMASTACK_PORT, TOGETHER_API_KEY 2. `llama stack run /Users/xiyan/.llama/distributions/llamastack-together/together-run.yaml` ``` ``` $ llama stack run /Users/xiyan/.llama/distributions/llamastack-together/together-run.yaml ``` ``` $ llama-stack-client models list $ pytest -v -s -m remote agents/test_agents.py --env REMOTE_STACK_URL=http://localhost:5000 --inference-model meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo ``` <img width="764" alt="image" src="https://github.com/user-attachments/assets/b805b6c5-a316-4561-8fe3-24fc3b1f8b80"> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-11-18 22:29:16 -08:00
Ashwin Bharambe	ea52a3ee1c	minor enhancement for test fixtures	2024-11-18 22:21:17 -08:00
Matthew Farrellee	fcc2132e6f	remove pydantic namespace warnings using model_config (#470 ) # What does this PR do? remove another model_ pydantic namespace warning and convert old-style 'class Config' to new-style 'model_config' workaround. also a whitespace change to get past - flake8...................................................................Failed llama_stack/cli/download.py:296:85: E226 missing whitespace around arithmetic operator llama_stack/cli/download.py:297:54: E226 missing whitespace around arithmetic operator ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [x] Wrote necessary unit or integration tests.	2024-11-18 19:24:14 -08:00
Kai Wu	d2b7c5aeae	add quantized model ollama support (#471 ) # What does this PR do? add more quantized model support for ollama. - [ ] Addresses issue (#issue) ## Test Plan Tested with ollama docker that run llama3.2 3b 4bit model. ``` root@docker-desktop:/# ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.2:3b a80c4f17acd5 3.5 GB 100% CPU 3 minutes from now ``` ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-11-18 18:55:23 -08:00
Dinesh Yeduguru	fe19076838	get stack run config based on template name (#477 ) This PR adds a method in stack to return the stackrunconfig object based on the template name. This will be used to instantiate a direct client without the need for an explicit run.yaml --------- Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>	2024-11-18 18:05:05 -08:00
Xi Yan	50d539e6d7	update tests --inference-model to hf id	2024-11-18 17:36:58 -08:00
Ashwin Bharambe	939056e265	More documentation fixes	2024-11-18 17:06:13 -08:00
Ashwin Bharambe	e40404625b	Update to docs	2024-11-18 16:52:48 -08:00
Ashwin Bharambe	91f3009c67	No more built_at	2024-11-18 16:38:51 -08:00
Ashwin Bharambe	afa4f0b19f	Update remote vllm docs	2024-11-18 16:34:33 -08:00
Ashwin Bharambe	fb15ff4a97	Move to use argparse, fix issues with multiple --env cmdline options	2024-11-18 16:31:59 -08:00
Ashwin Bharambe	b87f3ac499	Allow server to accept --env key pairs	2024-11-18 16:17:59 -08:00
Ashwin Bharambe	1fb61137ad	Add conda_env	2024-11-18 16:08:14 -08:00
Ashwin Bharambe	b822149098	Update start conda	2024-11-18 16:07:27 -08:00
Ashwin Bharambe	47c37fd831	Fixes	2024-11-18 16:03:53 -08:00

... 3 4 5 6 7 ...

527 commits