# Context
For test automation, the end goal is to run a single pytest command from
root test directory (llama_stack/providers/tests/.) such that we execute
push-blocking tests
The work plan:
1) trigger pytest from llama_stack/providers/tests/.
2) use config file to determine what tests and parametrization we want
to run
# What does this PR do?
1) consolidates the "inference-models" / "embedding-model" /
"judge-model" ... options in root conftest.py. Without this change, we
will hit into error when trying to run `pytest
/Users/sxyi/llama-stack/llama_stack/providers/tests/.` because of
duplicated `addoptions` definitions across child conftest files.
2) Add a `config` option to specify test config in YAML. (see
[`ci_test_config.yaml`](https://gist.github.com/sixianyi0721/5b37fbce4069139445c2f06f6e42f87e)
for example config file)
For provider_fixtures, we allow users to use either a default fixture
combination or define their own {api:provider} combinations.
```
memory:
....
fixtures:
provider_fixtures:
- default_fixture_param_id: ollama // use default fixture combination with param_id="ollama" in [providers/tests/memory/conftest.py](https://fburl.com/mtjzwsmk)
- inference: sentence_transformers
memory: faiss
- default_fixture_param_id: chroma
```
3) generate tests according to the config. Logic lives in two places:
a) in `{api}/conftest.py::pytest_generate_tests`, we read from config to
do parametrization.
b) after test collection, in `pytest_collection_modifyitems`, we filter
the tests to include only functions listed in config.
## Test Plan
1) `pytest /Users/sxyi/llama-stack/llama_stack/providers/tests/.
--collect-only --config=ci_test_config.yaml`
Using `--collect-only` tag to print the pytests listed in the config
file (`ci_test_config.yaml`).
output:
[gist](https://gist.github.com/sixianyi0721/05145e60d4d085c17cfb304beeb1e60e)
2) sanity check on `--inference-model` option
```
pytest -v -s -k "ollama" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
```
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
PR #639 introduced the notion of Tools API and ability to invoke tools
through API just as any resource. This PR changes the Agents to start
using the Tools API to invoke tools. Major changes include:
1) Ability to specify tool groups with AgentConfig
2) Agent gets the corresponding tool definitions for the specified tools
and pass along to the model
3) Attachements are now named as Documents and their behavior is mostly
unchanged from user perspective
4) You can specify args that can be injected to a tool call through
Agent config. This is especially useful in case of memory tool, where
you want the tool to operate on a specific memory bank.
5) You can also register tool groups with args, which lets the agent
inject these as well into the tool call.
6) All tests have been migrated to use new tools API and fixtures
including client SDK tests
7) Telemetry just works with tools API because of our trace protocol
decorator
## Test Plan
```
pytest -s -v -k fireworks llama_stack/providers/tests/agents/test_agents.py \
--safety-shield=meta-llama/Llama-Guard-3-8B \
--inference-model=meta-llama/Llama-3.1-8B-Instruct
pytest -s -v -k together llama_stack/providers/tests/tools/test_tools.py \
--safety-shield=meta-llama/Llama-Guard-3-8B \
--inference-model=meta-llama/Llama-3.1-8B-Instruct
LLAMA_STACK_CONFIG="/Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml" pytest -v tests/client-sdk/agents/test_agents.py
```
run.yaml:
https://gist.github.com/dineshyv/0365845ad325e1c2cab755788ccc5994
Notebook:
https://colab.research.google.com/drive/1ck7hXQxRl6UvT-ijNRZ-gMZxH1G3cN2d?usp=sharing
## What does this PR do?
This is a long-pending change and particularly important to get done
now.
Specifically:
- we cannot "localize" (aka download) any URLs from media attachments
anywhere near our modeling code. it must be done within llama-stack.
- `PIL.Image` is infesting all our APIs via `ImageMedia ->
InterleavedTextMedia` and that cannot be right at all. Anything in the
API surface must be "naturally serializable". We need a standard `{
type: "image", image_url: "<...>" }` which is more extensible
- `UserMessage`, `SystemMessage`, etc. are moved completely to
llama-stack from the llama-models repository.
See https://github.com/meta-llama/llama-models/pull/244 for the
corresponding PR in llama-models.
## Test Plan
```bash
cd llama_stack/providers/tests
pytest -s -v -k "fireworks or ollama or together" inference/test_vision_inference.py
pytest -s -v -k "(fireworks or ollama or together) and llama_3b" inference/test_text_inference.py
pytest -s -v -k chroma memory/test_memory.py \
--env EMBEDDING_DIMENSION=384 --env CHROMA_DB_PATH=/tmp/foobar
pytest -s -v -k fireworks agents/test_agents.py \
--safety-shield=meta-llama/Llama-Guard-3-8B \
--inference-model=meta-llama/Llama-3.1-8B-Instruct
```
Updated the client sdk (see PR ...), installed the SDK in the same
environment and then ran the SDK tests:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=together pytest -s -v agents/test_agents.py
LLAMA_STACK_CONFIG=ollama pytest -s -v memory/test_memory.py
# this one needed a bit of hacking in the run.yaml to ensure I could register the vision model correctly
INFERENCE_MODEL=llama3.2-vision:latest LLAMA_STACK_CONFIG=ollama pytest -s -v inference/test_inference.py
```
# What does this PR do?
Adds the sentence transformer provider and the `all-MiniLM-L6-v2`
embedding model to the default models to register in the run.yaml for
all providers.
## Test Plan
llama stack build --template together --image-type conda
llama stack run
~/.llama/distributions/llamastack-together/together-run.yaml
This PR does the following:
1) adds the ability to generate embeddings in all supported inference
providers.
2) Moves all the memory providers to use the inference API and improved
the memory tests to setup the inference stack correctly and use the
embedding models
This is a merge from #589 and #598
The same code is used (inside providers/remote/memory/chroma/chroma.py)
but it is driven by separate configurations and changes which Chroma
client to use. Note that the dependencies are separate
(`chromadb-client` vs `chromadb` -- the latter is a _much_ heavier
package.)
```
pytest -s -v -m chroma memory/test_memory.py --env CHROMA_DB_PATH=/tmp/chroma_test
pytest -s -v -m chroma memory/test_memory.py --env CHROMA_URL=http://localhost:6001
```
# What does this PR do?
Addresses issue (#342)
- PDFs uploaded from url are being loaded into vector db as raw bytes
- Instead this PR extracts text from PDF if mime_type is
"application/json"
- Adds tests to cover new cases
## Test Plan
Ran these unit tests:
```bash
llama stack build --template meta-reference-gpu --image-type conda
conda activate llamastack-meta-reference-gpu
pip install pytest pytest-asyncio pypdf
pytest llama_stack/providers/tests/memory/test_vector_store.py -v
```
```
platform linux -- Python 3.10.15, pytest-8.3.3, pluggy-1.5.0 -- /home/ubuntu/1xa100-2/llama-stack/envs/bin/python
cachedir: .pytest_cache
rootdir: /home/ubuntu/1xa100-2/llama-stack
configfile: pyproject.toml
plugins: anyio-4.6.2.post1, asyncio-0.24.0, httpx-0.35.0
asyncio: mode=strict, default_loop_scope=None
collected 3 items
llama_stack/providers/tests/memory/test_vector_store.py::TestVectorStore::test_returns_content_from_pdf_data_uri PASSED [ 33%]
llama_stack/providers/tests/memory/test_vector_store.py::TestVectorStore::test_downloads_pdf_and_returns_content PASSED [ 66%]
llama_stack/providers/tests/memory/test_vector_store.py::TestVectorStore::test_downloads_pdf_and_returns_content_with_url_object PASSED [100%]
======================================================= 3 passed, 1 warning in 0.62s =======================================================
```
Tested manually via [this
script](afc8f8bebf/init.py)
to initialize and [this
script](afc8f8bebf/query.py)
to query
```bash
# Ran with meta-reference-gpu with safety
llama stack build --template meta-reference-gpu --image-type conda && llama stack run distributions/meta-reference-gpu/run-with-safety.yaml \
--port 5001 \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-11B-Vision-Instruct
# Run init.py script
wget https://raw.githubusercontent.com/aidando73/llama-stack/afc8f8bebf70e1ad065d87e84692e1a3a45d9e19/init.py
pip install httpx==0.27.2 # Due to issue https://github.com/meta-llama/llama-stack-client-python/issues/54
python init.py
# Run query.py script
wget https://raw.githubusercontent.com/aidando73/llama-stack/afc8f8bebf70e1ad065d87e84692e1a3a45d9e19/query.py
python query.py
```
Should output valid text chunks
```
Chunk(content=' that it has a significantly\nlower violation rate than the competing standalone open source model, trading off a higher false refusal rate.\nLong-context safety. Long-context models are vulnerable to many-shot jailbreaking attacks without targeted\nmitigation (Anil et al., 2024). To address this, we finetune our models on SFT datasets that include examples\nof safe behavior in the presence of demonstrations of unsafe behavior in context. We develop a scalable\nmitigation strategy that significantly reduces VR, effectively neutralizing the impact of longer context attacks\neven for 256-shot attacks. This approach shows little to no impact on FRR and most helpfulness metrics.\nTo quantify the effectiveness of our long context safety mitigations, we use two additional benchmarking\nmethods: DocQA and Many-shot. For DocQA, short for “document question answering,” we use long documents\nwith information that could be utilized in adversarial ways. Models are provided both the document and a set\nof prompts related to the document in order to test whether the questions being related to information in the\ndocument affected the model’s ability to respond safely to the prompts. For Many-shot, following Anil et al.\n(2024), we construct a synthetic chat history composed of unsafe prompt-response pairs. A final prompt,\nunrelated to previous messages, is used to test whether the unsafe behavior in-context influenced the model\n45\nto response unsafely. The violation and false refusal rates for both DocQA and Many-shot are shown in\nFigure 20. We see that Llama 405B (with and without Llama Guard) is Pareto-better than the Comp. 2\nsystem across both violation rates and false refusal rates, across both DocQA and Many-shot. Relative to\nComp. 1, we find that Llama 405B is significantly safer, while coming at a trade off on false refusal.\nTool usage safety. The diversity of possible tools and the implementation of the tool usage call and integration\ninto the model make tool usage a challenging capability to fully mitigate (Wallace et al., 2024). We focus on\nthe search usecase. Violation and false refusal rates are shown in Figure 20. We tested against the Comp. 1\nsystem, where we find that Llama 405B is significantly safer, though has a slightly higher false refusal rate.\n5.4.5 Cybersecurity and Chemical/Biological Weapons Safety\nCyberSecurity evaluation results. To evaluate cybersecurity risk, we leverage the Cyber', document_id='num-0', token_count=512)0.7354530813978312
Chunk(content='.\nThrough careful ablations, we observe that mixing0.1% of synthetically generated long-context data with the\noriginal short-context data optimizes the performance across both short-context and long-context benchmarks.\nDPO. We observe that using only short context training data in DPO did not negatively impact long-context\nperformance as long as the SFT model is high quality in long context tasks. We suspect this is due to the\nfact that our DPO recipe has fewer optimizer steps than SFT. Given this finding, we keep the standard\nshort-context recipe for DPO on top of our long-context SFT checkpoints.\n4.3.5 Tool Use\nTeaching LLMs to use tools such as search engines or code interpreters hugely expands the range of tasks\nthey can solve, transforming them from pure chat models into more general assistants (Nakano et al., 2021;\nThoppilan et al., 2022; Parisi et al., 2022; Gao et al., 2023; Mialon et al., 2023a; Schick et al., 2024). We train\nLlama 3 to interact with the following tools:\n• Search engine. Llama 3 is trained to use Brave Search7 to answer questions about recent events that go\nbeyond its knowledge cutoff or that require retrieving a particular piece of information from the web.\n• Python interpreter. Llama 3 can generate and execute code to perform complex computations, read files\nuploaded by the user and solve tasks based on them such as question answering, summarization, data\nanalysis or visualization.\n7https://brave.com/search/api/\n24\n• Mathematical computational engine. Llama 3 can use the Wolfram Alpha API8 to more accurately solve\nmath, science problems, or retrieve accurate information from Wolfram’s database.\nThe resulting model is able to use these tools in a chat setup to solve the user’s queries, including in multi-turn\ndialogs. If a query requires multiple tool calls, the model can write a step-by-step plan, call the tools in\nsequence, and do reasoning after each tool call.\nWe also improve Llama 3’s zero-shot tool use capabilities — given in-context, potentially unseen tool definitions\nand a user query, we train the model to generate the correct tool call.\nImplementation. We implement our core tools as Python objects with different methods. Zero-shot tools can\nbe implemented as Python functions with descriptions, documentation (i.e., examples for', document_id='num-0', token_count=512)0.7350672465928054
Chunk(content=' Embeddings RoPE (θ = 500, 000)\nTable 3 Overview of the key hyperparameters of Llama 3. We display settings for 8B, 70B, and 405B language models.\n• We use a vocabulary with 128K tokens. Our token vocabulary combines 100K tokens from thetiktoken3\ntokenizer with 28K additional tokens to better support non-English languages. Compared to the Llama\n2 tokenizer, our new tokenizer improves compression rates on a sample of English data from 3.17 to\n3.94 characters per token. This enables the model to “read” more text for the same amount of training\ncompute. We also found that adding 28K tokens from select non-English languages improved both\ncompression ratios and downstream performance, with no impact on English tokenization.\n• We increase the RoPE base frequency hyperparameter to 500,000. This enables us to better support\nlonger contexts; Xiong et al. (2023) showed this value to be effective for context lengths up to 32,768.\nLlama 3 405B uses an architecture with 126 layers, a token representation dimension of 16,384, and 128\nattention heads; see Table 3 for details. This leads to a model size that is approximately compute-optimal\naccording to scaling laws on our data for our training budget of3.8 × 1025 FLOPs.\n3.2.1 Scaling Laws\nWe develop scaling laws (Hoffmann et al., 2022; Kaplan et al., 2020) to determine the optimal model size for\nour flagship model given our pre-training compute budget. In addition to determining the optimal model size,\na major challenge is to forecast the flagship model’s performance on downstream benchmark tasks, due to a\ncouple of issues: (1) Existing scaling laws typically predict only next-token prediction loss rather than specific\nbenchmark performance. (2) Scaling laws can be noisy and unreliable because they are developed based on\npre-training runs conducted with small compute budgets (Wei et al., 2022b).\nTo address these challenges, we implement a two-stage methodology to develop scaling laws that accurately\npredict downstream benchmark performance:\n1. We first establish a correlation between the compute-optimal model’s negative log-likelihood on down-\nstream tasks and the training FLOPs.\n2. Next, we correlate the negative log-likelihood on downstream tasks with task accuracy, utilizing both', document_id='num-0', token_count=512)0.7172908346230037
```
## Before submitting
- [x] N/A - This PR fixes a typo or improves the docs (you can dismiss
the other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [x] N/A - Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
The semantics of an Update on resources is very tricky to reason about
especially for memory banks and models. The best way to go forward here
is for the user to unregister and register a new resource. We don't have
a compelling reason to support update APIs.
Tests:
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m
"chroma" --env CHROMA_HOST=localhost --env CHROMA_PORT=8000
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m
"pgvector" --env PGVECTOR_DB=postgres --env PGVECTOR_USER=postgres --env
PGVECTOR_PASSWORD=mysecretpassword --env PGVECTOR_HOST=0.0.0.0
$CONDA_PREFIX/bin/pytest -v -s -m "ollama"
llama_stack/providers/tests/inference/test_model_registration.py
---------
Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
# What does this PR do?
This PR kills the notion of "pure passthrough" remote providers. You
cannot specify a single provider you must specify a whole distribution
(stack) as remote.
This PR also significantly fixes / upgrades testing infrastructure so
you can now test against a remotely hosted stack server by just doing
```bash
pytest -s -v -m remote test_agents.py \
--inference-model=Llama3.1-8B-Instruct --safety-shield=Llama-Guard-3-1B \
--env REMOTE_STACK_URL=http://localhost:5001
```
Also fixed `test_agents_persistence.py` (which was broken) and killed
some deprecated testing functions.
## Test Plan
All the tests.
# What does this PR do?
This PR brings back the facility to not force registration of resources
onto the user. This is not just annoying but actually not feasible
sometimes. For example, you may have a Stack which boots up with private
providers for inference for models A and B. There is no way for the user
to actually know which model is being served by these providers now (to
be able to register it.)
How will this avoid the users needing to do registration? In a follow-up
diff, I will make sure I update the sample run.yaml files so they list
the models served by the distributions explicitly. So when users do
`llama stack build --template <...>` and run it, their distributions
come up with the right set of models they expect.
For self-hosted distributions, it also allows us to have a place to
explicit list the models that need to be served to make the "complete"
stack (including safety, e.g.)
## Test Plan
Started ollama locally with two lightweight models: Llama3.2-3B-Instruct
and Llama-Guard-3-1B.
Updated all the tests including agents. Here's the tests I ran so far:
```bash
pytest -s -v -m "fireworks and llama_3b" test_text_inference.py::TestInference \
--env FIREWORKS_API_KEY=...
pytest -s -v -m "ollama and llama_3b" test_text_inference.py::TestInference
pytest -s -v -m ollama test_safety.py
pytest -s -v -m faiss test_memory.py
pytest -s -v -m ollama test_agents.py \
--inference-model=Llama3.2-3B-Instruct --safety-model=Llama-Guard-3-1B
```
Found a few bugs here and there pre-existing that these test runs fixed.
* Significantly simpler and malleable test setup
* convert memory tests
* refactor fixtures and add support for composable fixtures
* Fix memory to use the newer fixture organization
* Get agents tests working
* Safety tests work
* yet another refactor to make this more general
now it accepts --inference-model, --safety-model options also
* get multiple providers working for meta-reference (for inference + safety)
* Add README.md
---------
Co-authored-by: Ashwin Bharambe <ashwin@meta.com>
This PR adds support for Qdrant - https://qdrant.tech/ to be used as a vector memory.
I've unit-tested the methods to confirm that they work as intended.
To run Qdrant
```
docker run -p 6333:6333 qdrant/qdrant
```
This PR makes several core changes to the developer experience surrounding Llama Stack.
Background: PR #92 introduced the notion of "routing" to the Llama Stack. It introduces three object types: (1) models, (2) shields and (3) memory banks. Each of these objects can be associated with a distinct provider. So you can get model A to be inferenced locally while model B, C can be inference remotely (e.g.)
However, this had a few drawbacks:
you could not address the provider instances -- i.e., if you configured "meta-reference" with a given model, you could not assign an identifier to this instance which you could re-use later.
the above meant that you could not register a "routing_key" (e.g. model) dynamically and say "please use this existing provider I have already configured" for a new model.
the terms "routing_table" and "routing_key" were exposed directly to the user. in my view, this is way too much overhead for a new user (which almost everyone is.) people come to the stack wanting to do ML and encounter a completely unexpected term.
What this PR does: This PR structures the run config with only a single prominent key:
- providers
Providers are instances of configured provider types. Here's an example which shows two instances of the remote::tgi provider which are serving two different models.
providers:
inference:
- provider_id: foo
provider_type: remote::tgi
config: { ... }
- provider_id: bar
provider_type: remote::tgi
config: { ... }
Secondly, the PR adds dynamic registration of { models | shields | memory_banks } to the API surface. The distribution still acts like a "routing table" (as previously) except that it asks the backing providers for a listing of these objects. For example it asks a TGI or Ollama inference adapter what models it is serving. Only the models that are being actually served can be requested by the user for inference. Otherwise, the Stack server will throw an error.
When dynamically registering these objects, you can use the provider IDs shown above. Info about providers can be obtained using the Api.inspect set of endpoints (/providers, /routes, etc.)
The above examples shows the correspondence between inference providers and models registry items. Things work similarly for the safety <=> shields and memory <=> memory_banks pairs.
Registry: This PR also makes it so that Providers need to implement additional methods for registering and listing objects. For example, each Inference provider is now expected to implement the ModelsProtocolPrivate protocol (naming is not great!) which consists of two methods
register_model
list_models
The goal is to inform the provider that a certain model needs to be supported so the provider can make any relevant backend changes if needed (or throw an error if the model cannot be supported.)
There are many other cleanups included some of which are detailed in a follow-up comment.