llama-stack/llama_stack/providers/tests
Botao Chen 36b4fe02cc
[4/n][torchtune integration] support lazy load model during inference (#620)
## What does this PR do?
In this PR, we refactor the meta reference inference logic to support 
- load the model during registering model instead of during spinning up
server
- support inference finetuned model checkpoint on top of native llama
model

## Why need these changes
To solve the existing pain points that 
- user cannot lazy load the model and hot switch the inference
checkpoint after spinning up the server
- this blocks us doing inference and eval on the same sever for a
finetuned checkpoint after post training
- user cannot do inference on a finetuned checkpoint on top of native
llama models

## Expect user experience change
- The inference model won't be loaded when spinning up server. Instead,
it will be loaded during register model. If user add the model as models
resource in run.yaml, it will be registered and loaded automatically
when starting server. There is an optional flag 'skip_initialize' in
model metadata to skip model loading during registration.
- There is an optional flag 'llama_model' in model metadata to identify
the base model of the Model class for validation and initialize model
arch. model identifier no longer needs to be a native llama model
- the default inference model name updates from
'meta-llama/Llama-3.2-3B-Instruct' to 'Llama3.2-3B-Instruct'
- It aligns with the checkpoint folder name after running 'llama model
download'
- It aligns with the descriptor name defined in llama-models SKU list
bf5b0c4fe7/models/datatypes.py (L95)


## test
run python llama_stack/scripts/distro_codegen.py


**run unit test**
- torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference"
--inference-model="Llama3.1-8B-Instruct"
./llama_stack/providers/tests/inference/test_text_inference.py
- torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference"
--inference-model="Llama3.1-8B-Instruct"
./llama_stack/providers/tests/inference/test_model_registration.py


**test post training experience**
on server side run: llama stack run
llama_stack/templates/experimental-post-training/run.yaml
server is spinning up without model loaded

<img width="812" alt="Screenshot 2024-12-17 at 1 24 50 PM"
src="https://github.com/user-attachments/assets/ce1f606b-3b6f-452f-b48e-b3761ffd90f3"
/>

on client side, run: llama-stack-client --endpoint
http://devgpu018.nha2.facebook.com:5000 models register
Llama3.2-3B-Instruct
register model successfully and the model is loaded 
<img width="1111" alt="Screenshot 2024-12-17 at 1 26 30 PM"
src="https://github.com/user-attachments/assets/56e02131-cf7d-4de5-8f63-fbdcb8c55c26"
/>


<img width="1541" alt="Screenshot 2024-12-17 at 1 26 09 PM"
src="https://github.com/user-attachments/assets/a83255a1-20f5-40a2-af51-55641410a115"
/>

if add "skip_initialize" in metadata, model is registered but isn't
loaded

on client side, run: llama-stack-client --endpoint
http://devgpu018.nha2.facebook.com:5000 inference chat-completion
--message "hello, what model are you?"

Inference the model succesfully
<img width="1121" alt="Screenshot 2024-12-17 at 1 27 33 PM"
src="https://github.com/user-attachments/assets/8e708545-3fe7-4a73-8754-1470fa5f1e75"
/>

**test inference experience**
run: llama stack run llama_stack/templates/meta-reference-gpu/run.yaml
model is loaded since the model is in resouce list in run.yaml 
<img width="1537" alt="Screenshot 2024-12-17 at 1 30 19 PM"
src="https://github.com/user-attachments/assets/5c8af817-66eb-43f8-bf4c-f5e24b0a12c6"
/>

on client side, run: llama-stack-client --endpoint
http://devgpu018.nha2.facebook.com:5000 inference chat-completion
--message "hello, what model are you?"
inference successfully 
<img width="1123" alt="Screenshot 2024-12-17 at 1 31 08 PM"
src="https://github.com/user-attachments/assets/471809aa-c65e-46dc-a37e-7094fb857f97"
/>



## inference on a finetuned model
**register a finetuned model that finetuned by post training api
(torchtune)**
- the model is registered and loaded successfully 
- the model is shown up in the model list 
<img width="974" alt="Screenshot 2024-12-18 at 3 56 33 PM"
src="https://github.com/user-attachments/assets/2994b4f5-4fa9-40c6-acc6-4b971479f3e2"
/>

**run inference**

<img width="977" alt="Screenshot 2024-12-18 at 3 57 59 PM"
src="https://github.com/user-attachments/assets/d117abbc-b2a0-41d8-a028-1a13128787b2"
/>
2024-12-18 16:30:53 -08:00
..
agents Update the "InterleavedTextMedia" type (#635) 2024-12-17 11:18:31 -08:00
datasetio [1/n] torchtune <> llama-stack integration skeleton (#540) 2024-12-13 11:05:35 -08:00
eval refactor scoring/eval pytests (#607) 2024-12-11 10:47:37 -08:00
inference [4/n][torchtune integration] support lazy load model during inference (#620) 2024-12-18 16:30:53 -08:00
memory Update the "InterleavedTextMedia" type (#635) 2024-12-17 11:18:31 -08:00
post_training Update the "InterleavedTextMedia" type (#635) 2024-12-17 11:18:31 -08:00
safety Update the "InterleavedTextMedia" type (#635) 2024-12-17 11:18:31 -08:00
scoring refactor scoring/eval pytests (#607) 2024-12-11 10:47:37 -08:00
__init__.py Remove "routing_table" and "routing_key" concepts for the user (#201) 2024-10-10 10:24:13 -07:00
conftest.py [1/n] torchtune <> llama-stack integration skeleton (#540) 2024-12-13 11:05:35 -08:00
env.py Significantly simpler and malleable test setup (#360) 2024-11-04 17:36:43 -08:00
README.md update tests --inference-model to hf id 2024-11-18 17:36:58 -08:00
resolver.py Auto-generate distro yamls + docs (#468) 2024-11-18 14:57:06 -08:00

Testing Llama Stack Providers

The Llama Stack is designed as a collection of Lego blocks -- various APIs -- which are composable and can be used to quickly and reliably build an app. We need a testing setup which is relatively flexible to enable easy combinations of these providers.

We use pytest and all of its dynamism to enable the features needed. Specifically:

  • We use pytest_addoption to add CLI options allowing you to override providers, models, etc.

  • We use pytest_generate_tests to dynamically parametrize our tests. This allows us to support a default set of (providers, models, etc.) combinations but retain the flexibility to override them via the CLI if needed.

  • We use pytest_configure to make sure we dynamically add appropriate marks based on the fixtures we make.

Common options

All tests support a --providers option which can be a string of the form api1=provider_fixture1,api2=provider_fixture2. So, when testing safety (which need inference and safety APIs) you can use --providers inference=together,safety=meta_reference to use these fixtures in concert.

Depending on the API, there are custom options enabled. For example, inference tests allow for an --inference-model override, etc.

By default, we disable warnings and enable short tracebacks. You can override them using pytest's flags as appropriate.

Some providers need special API keys or other configuration options to work. You can check out the individual fixtures (located in tests/<api>/fixtures.py) for what these keys are. These can be specified using the --env CLI option. You can also have it be present in the environment (exporting in your shell) or put it in the .env file in the directory from which you run the test. For example, to use the Together fixture you can use --env TOGETHER_API_KEY=<...>

Inference

We have the following orthogonal parametrizations (pytest "marks") for inference tests:

  • providers: (meta_reference, together, fireworks, ollama)
  • models: (llama_8b, llama_3b)

If you want to run a test with the llama_8b model with fireworks, you can use:

pytest -s -v llama_stack/providers/tests/inference/test_text_inference.py \
  -m "fireworks and llama_8b" \
  --env FIREWORKS_API_KEY=<...>

You can make it more complex to run both llama_8b and llama_3b on Fireworks, but only llama_3b with Ollama:

pytest -s -v llama_stack/providers/tests/inference/test_text_inference.py \
  -m "fireworks or (ollama and llama_3b)" \
  --env FIREWORKS_API_KEY=<...>

Finally, you can override the model completely by doing:

pytest -s -v llama_stack/providers/tests/inference/test_text_inference.py \
  -m fireworks \
  --inference-model "meta-llama/Llama3.1-70B-Instruct" \
  --env FIREWORKS_API_KEY=<...>

Agents

The Agents API composes three other APIs underneath:

  • Inference
  • Safety
  • Memory

Given that each of these has several fixtures each, the set of combinations is large. We provide a default set of combinations (see tests/agents/conftest.py) with easy to use "marks":

  • meta_reference -- uses all the meta_reference fixtures for the dependent APIs
  • together -- uses Together for inference, and meta_reference for the rest
  • ollama -- uses Ollama for inference, and meta_reference for the rest

An example test with Together:

pytest -s -m together llama_stack/providers/tests/agents/test_agents.py  \
 --env TOGETHER_API_KEY=<...>

If you want to override the inference model or safety model used, you can use the --inference-model or --safety-shield CLI options as appropriate.

If you wanted to test a remotely hosted stack, you can use -m remote as follows:

pytest -s -m remote llama_stack/providers/tests/agents/test_agents.py \
  --env REMOTE_STACK_URL=<...>