Commit graph

6 commits

Author SHA1 Message Date
Botao Chen
36b4fe02cc
[4/n][torchtune integration] support lazy load model during inference (#620)
## What does this PR do?
In this PR, we refactor the meta reference inference logic to support 
- load the model during registering model instead of during spinning up
server
- support inference finetuned model checkpoint on top of native llama
model

## Why need these changes
To solve the existing pain points that 
- user cannot lazy load the model and hot switch the inference
checkpoint after spinning up the server
- this blocks us doing inference and eval on the same sever for a
finetuned checkpoint after post training
- user cannot do inference on a finetuned checkpoint on top of native
llama models

## Expect user experience change
- The inference model won't be loaded when spinning up server. Instead,
it will be loaded during register model. If user add the model as models
resource in run.yaml, it will be registered and loaded automatically
when starting server. There is an optional flag 'skip_initialize' in
model metadata to skip model loading during registration.
- There is an optional flag 'llama_model' in model metadata to identify
the base model of the Model class for validation and initialize model
arch. model identifier no longer needs to be a native llama model
- the default inference model name updates from
'meta-llama/Llama-3.2-3B-Instruct' to 'Llama3.2-3B-Instruct'
- It aligns with the checkpoint folder name after running 'llama model
download'
- It aligns with the descriptor name defined in llama-models SKU list
bf5b0c4fe7/models/datatypes.py (L95)


## test
run python llama_stack/scripts/distro_codegen.py


**run unit test**
- torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference"
--inference-model="Llama3.1-8B-Instruct"
./llama_stack/providers/tests/inference/test_text_inference.py
- torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference"
--inference-model="Llama3.1-8B-Instruct"
./llama_stack/providers/tests/inference/test_model_registration.py


**test post training experience**
on server side run: llama stack run
llama_stack/templates/experimental-post-training/run.yaml
server is spinning up without model loaded

<img width="812" alt="Screenshot 2024-12-17 at 1 24 50 PM"
src="https://github.com/user-attachments/assets/ce1f606b-3b6f-452f-b48e-b3761ffd90f3"
/>

on client side, run: llama-stack-client --endpoint
http://devgpu018.nha2.facebook.com:5000 models register
Llama3.2-3B-Instruct
register model successfully and the model is loaded 
<img width="1111" alt="Screenshot 2024-12-17 at 1 26 30 PM"
src="https://github.com/user-attachments/assets/56e02131-cf7d-4de5-8f63-fbdcb8c55c26"
/>


<img width="1541" alt="Screenshot 2024-12-17 at 1 26 09 PM"
src="https://github.com/user-attachments/assets/a83255a1-20f5-40a2-af51-55641410a115"
/>

if add "skip_initialize" in metadata, model is registered but isn't
loaded

on client side, run: llama-stack-client --endpoint
http://devgpu018.nha2.facebook.com:5000 inference chat-completion
--message "hello, what model are you?"

Inference the model succesfully
<img width="1121" alt="Screenshot 2024-12-17 at 1 27 33 PM"
src="https://github.com/user-attachments/assets/8e708545-3fe7-4a73-8754-1470fa5f1e75"
/>

**test inference experience**
run: llama stack run llama_stack/templates/meta-reference-gpu/run.yaml
model is loaded since the model is in resouce list in run.yaml 
<img width="1537" alt="Screenshot 2024-12-17 at 1 30 19 PM"
src="https://github.com/user-attachments/assets/5c8af817-66eb-43f8-bf4c-f5e24b0a12c6"
/>

on client side, run: llama-stack-client --endpoint
http://devgpu018.nha2.facebook.com:5000 inference chat-completion
--message "hello, what model are you?"
inference successfully 
<img width="1123" alt="Screenshot 2024-12-17 at 1 31 08 PM"
src="https://github.com/user-attachments/assets/471809aa-c65e-46dc-a37e-7094fb857f97"
/>



## inference on a finetuned model
**register a finetuned model that finetuned by post training api
(torchtune)**
- the model is registered and loaded successfully 
- the model is shown up in the model list 
<img width="974" alt="Screenshot 2024-12-18 at 3 56 33 PM"
src="https://github.com/user-attachments/assets/2994b4f5-4fa9-40c6-acc6-4b971479f3e2"
/>

**run inference**

<img width="977" alt="Screenshot 2024-12-18 at 3 57 59 PM"
src="https://github.com/user-attachments/assets/d117abbc-b2a0-41d8-a028-1a13128787b2"
/>
2024-12-18 16:30:53 -08:00
Ashwin Bharambe
e84d4436b5
Since we are pushing for HF repos, we should accept them in inference configs (#497)
# What does this PR do?

As the title says. 

## Test Plan

This needs
8752149f58
to also land. So the next package (0.0.54) will make this work properly.

The test is:

```bash
pytest -v -s -m "llama_3b and meta_reference" test_model_registration.py
```
2024-11-20 16:14:37 -08:00
Dinesh Yeduguru
57a9b4d57f
Allow models to be registered as long as llama model is provided (#472)
This PR allows models to be registered with provider as long as the user
specifies a llama model, even though the model does not match our
prebuilt provider specific mapping.
Test:
pytest -v -s
llama_stack/providers/tests/inference/test_model_registration.py -m
"together" --env TOGETHER_API_KEY=<KEY>

---------

Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
2024-11-18 15:05:29 -08:00
Dinesh Yeduguru
0850ad656a
unregister for memory banks and remove update API (#458)
The semantics of an Update on resources is very tricky to reason about
especially for memory banks and models. The best way to go forward here
is for the user to unregister and register a new resource. We don't have
a compelling reason to support update APIs.


Tests:
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m
"chroma" --env CHROMA_HOST=localhost --env CHROMA_PORT=8000

pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m
"pgvector" --env PGVECTOR_DB=postgres --env PGVECTOR_USER=postgres --env
PGVECTOR_PASSWORD=mysecretpassword --env PGVECTOR_HOST=0.0.0.0

$CONDA_PREFIX/bin/pytest -v -s -m "ollama"
llama_stack/providers/tests/inference/test_model_registration.py

---------

Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
2024-11-14 17:12:11 -08:00
Dinesh Yeduguru
efe791bab7
Support model resource updates and deletes (#452)
# What does this PR do?
* Changes the registry to store only one RoutableObject per identifier.
Before it was a list, which is not really required.
* Adds impl for updates and deletes
* Updates routing table to handle updates correctly



## Test Plan
```
❯ llama-stack-client models list
+------------------------+---------------+------------------------------------+------------+
| identifier             | provider_id   | provider_resource_id               | metadata   |
+========================+===============+====================================+============+
| Llama3.1-405B-Instruct | fireworks-0   | fireworks/llama-v3p1-405b-instruct | {}         |
+------------------------+---------------+------------------------------------+------------+
| Llama3.1-8B-Instruct   | fireworks-0   | fireworks/llama-v3p1-8b-instruct   | {}         |
+------------------------+---------------+------------------------------------+------------+
| Llama3.2-3B-Instruct   | fireworks-0   | fireworks/llama-v3p2-1b-instruct   | {}         |
+------------------------+---------------+------------------------------------+------------+
❯ llama-stack-client models register dineshyv-model --provider-model-id=fireworks/llama-v3p1-70b-instruct
Successfully registered model dineshyv-model
❯ llama-stack-client models list
+------------------------+---------------+------------------------------------+------------+
| identifier             | provider_id   | provider_resource_id               | metadata   |
+========================+===============+====================================+============+
| Llama3.1-405B-Instruct | fireworks-0   | fireworks/llama-v3p1-405b-instruct | {}         |
+------------------------+---------------+------------------------------------+------------+
| Llama3.1-8B-Instruct   | fireworks-0   | fireworks/llama-v3p1-8b-instruct   | {}         |
+------------------------+---------------+------------------------------------+------------+
| Llama3.2-3B-Instruct   | fireworks-0   | fireworks/llama-v3p2-1b-instruct   | {}         |
+------------------------+---------------+------------------------------------+------------+
| dineshyv-model         | fireworks-0   | fireworks/llama-v3p1-70b-instruct  | {}         |
+------------------------+---------------+------------------------------------+------------+
❯ llama-stack-client models update dineshyv-model --provider-model-id=fireworks/llama-v3p1-405b-instruct
Successfully updated model dineshyv-model
❯ llama-stack-client models list
+------------------------+---------------+------------------------------------+------------+
| identifier             | provider_id   | provider_resource_id               | metadata   |
+========================+===============+====================================+============+
| Llama3.1-405B-Instruct | fireworks-0   | fireworks/llama-v3p1-405b-instruct | {}         |
+------------------------+---------------+------------------------------------+------------+
| Llama3.1-8B-Instruct   | fireworks-0   | fireworks/llama-v3p1-8b-instruct   | {}         |
+------------------------+---------------+------------------------------------+------------+
| Llama3.2-3B-Instruct   | fireworks-0   | fireworks/llama-v3p2-1b-instruct   | {}         |
+------------------------+---------------+------------------------------------+------------+
| dineshyv-model         | fireworks-0   | fireworks/llama-v3p1-405b-instruct | {}         |
+------------------------+---------------+------------------------------------+------------+
llama-stack-client models delete dineshyv-model
❯ llama-stack-client models list
+------------------------+---------------+------------------------------------+------------+
| identifier             | provider_id   | provider_resource_id               | metadata   |
+========================+===============+====================================+============+
| Llama3.1-405B-Instruct | fireworks-0   | fireworks/llama-v3p1-405b-instruct | {}         |
+------------------------+---------------+------------------------------------+------------+
| Llama3.1-8B-Instruct   | fireworks-0   | fireworks/llama-v3p1-8b-instruct   | {}         |
+------------------------+---------------+------------------------------------+------------+
| Llama3.2-3B-Instruct   | fireworks-0   | fireworks/llama-v3p2-1b-instruct   | {}         |
+------------------------+---------------+------------------------------------+------------+

```

---------

Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
2024-11-13 21:55:41 -08:00
Dinesh Yeduguru
787e2034b7
model registration in ollama and vllm check against the available models in the provider (#446)
tests:
pytest -v -s -m "ollama"
llama_stack/providers/tests/inference/test_text_inference.py

pytest -v -s -m vllm_remote
llama_stack/providers/tests/inference/test_text_inference.py --env
VLLM_URL="http://localhost:9798/v1"

---------
2024-11-13 13:04:06 -08:00