# What does this PR do?
This PR moves all print statements to use logging. Things changed:
- Had to add `await start_trace("sse_generator")` to server.py to
actually get tracing working. else was not seeing any logs
- If no telemetry provider is provided in the run.yaml, we will write to
stdout
- by default, the logs are going to be in JSON, but we expose an option
to configure to output in a human readable way.
# What does this PR do?
Fix fp8 quantization script.
## Test Plan
```
sh run_quantize_checkpoint.sh localhost fp8 /home/yll/fp8_test/ /home/yll/fp8_test/quantized_2 /home/yll/fp8_test/tokenizer.model 1 1
```
## Sources
Please link relevant resources if necessary.
## Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
Co-authored-by: Yunlu Li <yll@meta.com>
# What does this PR do?
As the title says.
## Test Plan
This needs
8752149f58
to also land. So the next package (0.0.54) will make this work properly.
The test is:
```bash
pytest -v -s -m "llama_3b and meta_reference" test_model_registration.py
```
When running with dockers, the idea is that users be able to work purely
with the `llama stack` CLI. They should not need to know about the
existence of any YAMLs unless they need to. This PR enables it.
The docker command now doesn't need to volume mount a yaml and can
simply be:
```bash
docker run -v ~/.llama/:/root/.llama \
--env A=a --env B=b
```
## Test Plan
Check with conda first (no regressions):
```bash
LLAMA_STACK_DIR=. llama stack build --template ollama
llama stack run ollama --port 5001
# server starts up correctly
```
Check with docker
```bash
# build the docker
LLAMA_STACK_DIR=. llama stack build --template ollama --image-type docker
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
docker run -it -p 5001:5001 \
-v ~/.llama:/root/.llama \
-v $PWD:/app/llama-stack-source \
localhost/distribution-ollama:dev \
--port 5001 \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://host.docker.internal:11434
```
Note that volume mounting to `/app/llama-stack-source` is only needed
because we built the docker with uncommitted source code.
# What does this PR do?
Remove a check which skips provider registration if a resource is
already in stack registry. Since we do not reconcile state with
provider, register should always call into provider's register endpoint.
## Test Plan
```
# stack run
╰─❯ llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml
#register memory bank
❯ llama-stack-client memory_banks register your_memory_bank_name --type vector --provider-id inline::faiss-0
Memory Bank Configuration:
{
│ 'memory_bank_type': 'vector',
│ 'chunk_size_in_tokens': 512,
│ 'embedding_model': 'all-MiniLM-L6-v2',
│ 'overlap_size_in_tokens': 64
}
#register again
❯ llama-stack-client memory_banks register your_memory_bank_name --type vector --provider-id inline::faiss-0
Memory Bank Configuration:
{
│ 'memory_bank_type': 'vector',
│ 'chunk_size_in_tokens': 512,
│ 'embedding_model': 'all-MiniLM-L6-v2',
│ 'overlap_size_in_tokens': 64
}
```
# What does this PR do?
The chroma provider maintains a cache but does not sync up with chroma
on a cold start. this change adds a fallback to read from chroma on a
cache miss.
## Test Plan
```bash
#start stack
llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml
# Add documents
PYTHONPATH=. python -m examples.agents.rag_with_memory_bank localhost 5000
No available shields. Disable safety.
Using model: Llama3.1-8B-Instruct
Created session_id=b951b14f-a9d2-43a3-8b80-d80114d58322 for Agent(0687a251-6906-4081-8d4c-f52e19db9dd7)
memory_retrieval> Retrieved context from banks: ['test_bank'].
====
Here are the retrieved documents for relevant context:
=== START-RETRIEVED-CONTEXT ===
id:num-1; content:_
the template from Llama2 to better support multiturn conversations. The same text
in the Lla...
>
inference> Based on the retrieved documentation, the top 5 topics that were explained are:
...............
# Kill stack
# Bootup stack
llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml
# Run a RAG app with just the agent flow. it discovers the previously added documents
No available shields. Disable safety.
Using model: Llama3.1-8B-Instruct
Created session_id=7a30c1a7-c87e-4787-936c-d0306589fe5d for Agent(b30420f3-c928-498a-887b-d084f0f3806c)
memory_retrieval> Retrieved context from banks: ['test_bank'].
====
Here are the retrieved documents for relevant context:
=== START-RETRIEVED-CONTEXT ===
id:num-1; content:_
the template from Llama2 to better support multiturn conversations. The same text
in the Lla...
>
inference> Based on the provided documentation, the top 5 topics that were explained are:
.....
```
# What does this PR do?
Add Tavily as a built-in search tool, in addition to Brave and Bing.
## Test Plan
It's tested using ollama remote, showing parity to the Brave search
tool.
- Install and run ollama with `ollama run llama3.1:8b-instruct-fp16`
- Build ollama distribution `llama stack build --template ollama
--image-type conda`
- Run ollama `stack run
/$USER/.llama/distributions/llamastack-ollama/ollama-run.yaml --port
5001`
- Client test command: `python - m
agents.test_agents.TestAgents.test_create_agent_turn_with_tavily_search`,
with enviroments:
MASTER_ADDR=0.0.0.0;MASTER_PORT=5001;RANK=0;REMOTE_STACK_HOST=0.0.0.0;REMOTE_STACK_PORT=5001;TAVILY_SEARCH_API_KEY=tvly-<YOUR-KEY>;WORLD_SIZE=1
Test passes on the specific case (ollama remote).
Server output:
```
Listening on ['::', '0.0.0.0']:5001
INFO: Started server process [7220]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:5001 (Press CTRL+C to quit)
INFO: 127.0.0.1:65209 - "POST /agents/create HTTP/1.1" 200 OK
INFO: 127.0.0.1:65210 - "POST /agents/session/create HTTP/1.1" 200 OK
INFO: 127.0.0.1:65211 - "POST /agents/turn/create HTTP/1.1" 200 OK
role='user' content='What are the latest developments in quantum computing?' context=None
role='assistant' content='' stop_reason=<StopReason.end_of_turn: 'end_of_turn'> tool_calls=[ToolCall(call_id='fc92ccb8-1039-4ce8-ba5e-8f2b0147661c', tool_name=<BuiltinTool.brave_search: 'brave_search'>, arguments={'query': 'latest developments in quantum computing'})]
role='ipython' call_id='fc92ccb8-1039-4ce8-ba5e-8f2b0147661c' tool_name=<BuiltinTool.brave_search: 'brave_search'> content='{"query": "latest developments in quantum computing", "top_k": [{"title": "IBM Unveils 400 Qubit-Plus Quantum Processor and Next-Generation IBM ...", "url": "https://newsroom.ibm.com/2022-11-09-IBM-Unveils-400-Qubit-Plus-Quantum-Processor-and-Next-Generation-IBM-Quantum-System-Two", "content": "This system is targeted to be online by the end of 2023 and will be a building b...<more>...onnect large-scale ...", "url": "https://news.mit.edu/2023/quantum-interconnects-photon-emission-0105", "content": "Quantum computers hold the promise of performing certain tasks that are intractable even on the world\'s most powerful supercomputers. In the future, scientists anticipate using quantum computing to emulate materials systems, simulate quantum chemistry, and optimize hard tasks, with impacts potentially spanning finance to pharmaceuticals.", "score": 0.71721, "raw_content": null}]}'
Assistant: The latest developments in quantum computing include:
* IBM unveiling its 400 qubit-plus quantum processor and next-generation IBM Quantum System Two, which will be a building block of quantum-centric supercomputing.
* The development of utility-scale quantum computing, which can serve as a scientific tool to explore utility-scale classes of problems in chemistry, physics, and materials beyond brute force classical simulation of quantum mechanics.
* The introduction of advanced hardware across IBM's global fleet of 100+ qubit systems, as well as easy-to-use software that users and computational scientists can now obtain reliable results from quantum systems as they map increasingly larger and more complex problems to quantum circuits.
* Research on quantum repeaters, which use defects in diamond to interconnect quantum systems and could provide the foundation for scalable quantum networking.
* The development of a new source of quantum light, which could be used to improve the efficiency of quantum computers.
* The creation of a new mathematical "blueprint" that is accelerating fusion device development using Dyson maps.
* Research on canceling noise to improve quantum devices, with MIT researchers developing a protocol to extend the life of quantum coherence.
```
Verified with tool response. The final model response is updated with
the search requests.
## Sources
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
Co-authored-by: Martin Yuan <myuan@meta.com>
# What does this PR do?
Adds description at the end of successful download the optionally run
the verify md5 checksums command.
## Test Plan
<img width="2004" alt="Screenshot 2024-11-19 at 12 11 37 PM"
src="https://github.com/user-attachments/assets/8d617aef-99f5-4c3b-b93c-eff3e68289ea">
## Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
---------
Co-authored-by: varunfb <vontimitta@devgpu004.eag5.facebook.com>
# What does this PR do?
adds a new method build_model_alias_with_just_llama_model which is
needed for cases like ollama's quantized models which do not really have
a repo in hf and an entry in SKU list.
## Test Plan
pytest -v -s -m "ollama"
llama_stack/providers/tests/inference/test_text_inference.py
---------
Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
# What does this PR do?
Adds a `/alpha/` prefix to all the REST API urls.
Also makes them all use hyphens instead of underscores as is more
standard practice.
(This is based on feedback from our partners.)
## Test Plan
The Stack itself does not need updating. However, client SDKs and
documentation will need to be updated.
# What does this PR do?
- Fix issue w/ llama stack build using together template
<img width="669" alt="image"
src="https://github.com/user-attachments/assets/1cbef052-d902-40b9-98f8-37efb494d117">
- For builds from templates, copy over the
`templates/<template-name>/run.yaml` file to the
`~/.llama/distributions/<name>/<name>-run.yaml` instead of re-building
run config.
## Test Plan
```
$ llama stack build --template together --image-type conda
..
Build spec configuration saved at /opt/anaconda3/envs/llamastack-together/together-build.yaml
Build Successful! Next steps:
1. Set the environment variables: LLAMASTACK_PORT, TOGETHER_API_KEY
2. `llama stack run /Users/xiyan/.llama/distributions/llamastack-together/together-run.yaml`
```
```
$ llama stack run /Users/xiyan/.llama/distributions/llamastack-together/together-run.yaml
```
```
$ llama-stack-client models list
$ pytest -v -s -m remote agents/test_agents.py --env REMOTE_STACK_URL=http://localhost:5000 --inference-model meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
```
<img width="764" alt="image"
src="https://github.com/user-attachments/assets/b805b6c5-a316-4561-8fe3-24fc3b1f8b80">
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
remove another model_ pydantic namespace warning and convert old-style
'class Config' to new-style 'model_config' workaround.
also a whitespace change to get past -
flake8...................................................................Failed
llama_stack/cli/download.py:296:85: E226 missing whitespace around
arithmetic operator
llama_stack/cli/download.py:297:54: E226 missing whitespace around
arithmetic operator
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
# What does this PR do?
add more quantized model support for ollama.
- [ ] Addresses issue (#issue)
## Test Plan
Tested with ollama docker that run llama3.2 3b 4bit model.
```
root@docker-desktop:/# ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3.2:3b a80c4f17acd5 3.5 GB 100% CPU 3 minutes from now
```
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
This PR adds a method in stack to return the stackrunconfig object based
on the template name. This will be used to instantiate a direct client
without the need for an explicit run.yaml
---------
Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>