Commit graph

14 commits

Author SHA1 Message Date
Ihar Hrachyshka
9e6561a1ec
chore: enable pyupgrade fixes (#1806)
# What does this PR do?

The goal of this PR is code base modernization.

Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)

Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-05-01 14:23:50 -07:00
Ashwin Bharambe
530d4bdfe1
refactor: move all llama code to models/llama out of meta reference (#1887)
# What does this PR do?

Move around bits. This makes the copies from llama-models _much_ easier
to maintain and ensures we don't entangle meta-reference specific
tidbits into llama-models code even by accident.

Also, kills the meta-reference-quantized-gpu distro and rolls
quantization deps into meta-reference-gpu.

## Test Plan

```
LLAMA_MODELS_DEBUG=1 \
  with-proxy llama stack run meta-reference-gpu \
  --env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \
   --env INFERENCE_CHECKPOINT_DIR=<DIR> \
   --env MODEL_PARALLEL_SIZE=4 \
   --env QUANTIZATION_TYPE=fp8_mixed
```

Start a server with and without quantization. Point integration tests to
it using:

```
pytest -s -v  tests/integration/inference/test_text_inference.py \
   --stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
2025-04-07 15:03:58 -07:00
Ihar Hrachyshka
0cbb7f7f21
chore: fix mypy violations in post_training modules (#1548)
# What does this PR do?

Fixes a bunch of violations.

Note: this patch touches all files but post_training.py that will be
significantly changed by #1437, hence leaving it out of the picture for
now.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Testing with https://github.com/meta-llama/llama-stack/pull/1543

Also checked that GPU training works with the change:

```
INFO:     ::1:53316 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 200 OK
INFO:     ::1:53316 - "GET /v1/post-training/job/status?job_uuid=test-jobb5ca2d84-d541-42f8-883b-762828b4c0e7 HTTP/1.1" 200 OK
INFO:     ::1:53316 - "GET /v1/post-training/job/artifacts?job_uuid=test-jobb5ca2d84-d541-42f8-883b-762828b4c0e7 HTTP/1.1" 200 OK
21:24:01.161 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (32526.75ms)
 21:23:28.769 [DEBUG] Setting manual seed to local seed 3918872849. Local seed is seed + rank = 3918872849 + 0
 21:23:28.996 [INFO] Identified model_type = Llama3_2. Ignoring output.weight in checkpoint in favor of the tok_embedding.weight tied weights.
 21:23:29.933 [INFO] Memory stats after model init:
        GPU peak memory allocation: 6.05 GiB
        GPU peak memory reserved: 6.10 GiB
        GPU peak memory active: 6.05 GiB
 21:23:29.934 [INFO] Model is initialized with precision torch.bfloat16.
 21:23:30.115 [INFO] Tokenizer is initialized.
 21:23:30.118 [INFO] Optimizer is initialized.
 21:23:30.119 [INFO] Loss is initialized.
 21:23:30.896 [INFO] Dataset and Sampler are initialized.
 21:23:30.898 [INFO] Learning rate scheduler is initialized.
 21:23:31.618 [INFO] Memory stats after model init:
        GPU peak memory allocation: 6.24 GiB
        GPU peak memory reserved: 6.30 GiB
        GPU peak memory active: 6.24 GiB
 21:23:31.620 [INFO] Starting checkpoint save...
 21:23:59.428 [INFO] Model checkpoint of size 6.43 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth
 21:23:59.445 [INFO] Adapter checkpoint of size 0.00 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth

```

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-18 14:58:16 -07:00
Botao Chen
123fb9eb24
feat: [post training] support save hf safetensor format checkpoint (#845)
## context

Now, in llama stack, we only support inference / eval a finetuned
checkpoint with meta-reference as inference provider. This is
sub-optimal since meta-reference is pretty slow.

Our vision is that developer can inference / eval a finetuned checkpoint
produced by post training apis with all the inference providers on the
stack. To achieve this, we'd like to define an unified output checkpoint
format for post training providers. So that, all the inference provider
can respect that format for customized model inference.

By spotting check how
[ollama](https://github.com/ollama/ollama/blob/main/docs/import.md) and
[fireworks](https://docs.fireworks.ai/models/uploading-custom-models) do
inference on a customized model, we defined the output checkpoint format
as /adapter/adapter_config.json and /adapter/adapter_model.safetensors
(as we only support LoRA post training now, we begin from adapter only
checkpoint)

## test
we kick off a post training job and configured checkpoint format as
'huggingface'. Output files
![Screenshot 2025-02-24 at 11 54
33 PM](https://github.com/user-attachments/assets/fb45a5d7-f288-4d30-82f8-b7a8da2859be)



we did a proof of concept with ollama to see if ollama can inference our
finetuned checkpoint
1. create Modelfile like 

<img width="799" alt="Screenshot 2025-01-22 at 5 04 18 PM"
src="https://github.com/user-attachments/assets/7fca9ac3-a294-44f8-aab1-83852c600609"
/>

2. create a customized model with `ollama create llama_3_2_finetuned`
and run inference successfully

![Screenshot 2025-02-24 at 11 55
17 PM](https://github.com/user-attachments/assets/1abe7c52-c6a7-491a-b07c-b7a8e3fd1ddd)


This is just a proof of concept with ollama cmd line. As next step, we'd
like to wrap loading / inference customized model logic in the inference
provider implementation.
2025-02-25 23:29:08 -08:00
Ashwin Bharambe
314ee09ae3
chore: move all Llama Stack types from llama-models to llama-stack (#1098)
llama-models should have extremely minimal cruft. Its sole purpose
should be didactic -- show the simplest implementation of the llama
models and document the prompt formats, etc.

This PR is the complement to
https://github.com/meta-llama/llama-models/pull/279

## Test Plan

Ensure all `llama` CLI `model` sub-commands work:

```bash
llama model list
llama model download --model-id ...
llama model prompt-format -m ...
```

Ran tests:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/
LLAMA_STACK_CONFIG=fireworks pytest -s -v vector_io/
LLAMA_STACK_CONFIG=fireworks pytest -s -v agents/
```

Create a fresh venv `uv venv && source .venv/bin/activate` and run
`llama stack build --template fireworks --image-type venv` followed by
`llama stack run together --image-type venv` <-- the server runs

Also checked that the OpenAPI generator can run and there is no change
in the generated files as a result.

```bash
cd docs/openapi_generator
sh run_openapi_generator.sh
```
2025-02-14 09:10:59 -08:00
Sébastien Han
e4a1579e63
build: format codebase imports using ruff linter (#1028)
# What does this PR do?

- Configured ruff linter to automatically fix import sorting issues.
- Set --exit-non-zero-on-fix to ensure non-zero exit code when fixes are
applied.
- Enabled the 'I' selection to focus on import-related linting rules.
- Ran the linter, and formatted all codebase imports accordingly.
- Removed the black dep from the "dev" group since we use ruff

Signed-off-by: Sébastien Han <seb@redhat.com>

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-02-13 10:06:21 -08:00
Yuan Tang
34ab7a3b6c
Fix precommit check after moving to ruff (#927)
Lint check in main branch is failing. This fixes the lint check after we
moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We
need to move to a `ruff.toml` file as well as fixing and ignoring some
additional checks.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-02 06:46:45 -08:00
Botao Chen
25c1d9b037
[post training] define llama stack post training dataset format (#717)
## context
In this PR, we defined 2 llama stack dataset formats (instruct, dialog)

- For instruct dataset format, the column schema will be
[chat_completion_input, expected_answer], which is consistent with the
eval data format. This dataset format is the abstract of single turn QA
style post training data
- For dialog dataset format, the column schema will be [dialog], which
is a list of user messages and assistant messages that interleave
together. During training, the whole list will be the model input and
the loss is calculated on assistant messages only. This dataset format
is the abstract of multi turn chat style post training data

## changes
- defined the 2 llama stack dataset formats
- an adapter to convert llama stack dataset format to torchtune dataset
format
- move dataset format validation to post training level instead of
torchtune level since it's not specific to torchtune
- add localfs as datasetio provider


## test 
instruct format
- use https://huggingface.co/datasets/llamastack/evals as dataset and
the training works as expected
<img width="1443" alt="Screenshot 2025-01-09 at 5 15 14 PM"
src="https://github.com/user-attachments/assets/2c37a936-c67a-4726-90e0-23fa0ba7000f"
/>

- use my generated local dataset and the training works as expected

<img width="1617" alt="Screenshot 2025-01-09 at 5 19 11 PM"
src="https://github.com/user-attachments/assets/0bdccbbf-bac2-472a-a365-15213e49bbfa"
/>


dialog format
- use my generated local dataset and the training works as expected
<img width="1588" alt="Screenshot 2025-01-09 at 5 23 16 PM"
src="https://github.com/user-attachments/assets/893915ba-41a3-4d51-948b-e872060ecede"
/>
2025-01-14 12:48:49 -08:00
Botao Chen
747683a8a2
Add init files to post training folders (#711)
add init files to post training folders to make pkg build pick up those
files

## Test
WIP colab notebook
https://colab.research.google.com/drive/1K4Q2wZq232_Bpy2ud4zL9aRxvCWAwyQs?usp=sharing
to sharecase the post training APIs
2025-01-13 20:19:18 -08:00
Botao Chen
e86271aeac
support llama3.1 8B instruct in post training (#698)
## What does this PR do? 
- Change to support llama3.1 8B instruct model other than llama3 8B
model as llama3.1 8B instruct model is a better model to finetune on top
of
- Make the copy files logic in checkpointer safer in case the file be
copied doesn't exist in source path

## test
issue a post training request from client and verify training works as
expect
<img width="1101" alt="Screenshot 2025-01-02 at 12 18 45 PM"
src="https://github.com/user-attachments/assets/47cc4df9-3edc-4afd-b5dd-abe1f039f1ed"
/>

<img width="782" alt="Screenshot 2025-01-02 at 12 18 52 PM"
src="https://github.com/user-attachments/assets/b9435274-ef1d-4570-bd8e-0880c3a4b2e9"
/>
2025-01-03 17:33:05 -08:00
Botao Chen
d9f75cc98f
Import from the right path (#708)
Import BaseModel and Field from pydantic
2025-01-02 13:15:31 -08:00
Botao Chen
750604c7af
[Post Training] Fix missing import (#705)
## context
Post training apis are broken after the import * refactor
https://github.com/meta-llama/llama-stack/pull/689. This PR is adding
the missing import back

## Test
Issue a post training request from client and the training finishes
successfully

<img width="1101" alt="Screenshot 2025-01-02 at 12 18 45 PM"
src="https://github.com/user-attachments/assets/8c781459-f340-4021-85e1-fc68b1dcb8c8"
/>

<img width="782" alt="Screenshot 2025-01-02 at 12 18 52 PM"
src="https://github.com/user-attachments/assets/14b04b7d-e5c7-4662-8fa6-748446ad3511"
/>
2025-01-02 13:08:20 -08:00
Xi Yan
3c72c034e6
[remove import *] clean up import *'s (#689)
# What does this PR do?

- as title, cleaning up `import *`'s
- upgrade tests to make them more robust to bad model outputs
- remove import *'s in llama_stack/apis/* (skip __init__ modules)
<img width="465" alt="image"
src="https://github.com/user-attachments/assets/d8339c13-3b40-4ba5-9c53-0d2329726ee2"
/>

- run `sh run_openapi_generator.sh`, no types gets affected

## Test Plan

### Providers Tests

**agents**
```
pytest -v -s llama_stack/providers/tests/agents/test_agents.py -m "together" --safety-shield meta-llama/Llama-Guard-3-8B --inference-model meta-llama/Llama-3.1-405B-Instruct-FP8
```

**inference**
```bash
# meta-reference
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py

# together
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py

pytest ./llama_stack/providers/tests/inference/test_prompt_adapter.py 
```

**safety**
```
pytest -v -s llama_stack/providers/tests/safety/test_safety.py -m together --safety-shield meta-llama/Llama-Guard-3-8B
```

**memory**
```
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m "sentence_transformers" --env EMBEDDING_DIMENSION=384
```

**scoring**
```
pytest -v -s -m llm_as_judge_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct
pytest -v -s -m basic_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
pytest -v -s -m braintrust_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
```


**datasetio**
```
pytest -v -s -m localfs llama_stack/providers/tests/datasetio/test_datasetio.py
pytest -v -s -m huggingface llama_stack/providers/tests/datasetio/test_datasetio.py
```


**eval**
```
pytest -v -s -m meta_reference_eval_together_inference llama_stack/providers/tests/eval/test_eval.py
pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio llama_stack/providers/tests/eval/test_eval.py
```

### Client-SDK Tests
```
LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v ./tests/client-sdk
```

### llama-stack-apps
```
PORT=5000
LOCALHOST=localhost

python -m examples.agents.hello $LOCALHOST $PORT
python -m examples.agents.inflation $LOCALHOST $PORT
python -m examples.agents.podcast_transcript $LOCALHOST $PORT
python -m examples.agents.rag_as_attachments $LOCALHOST $PORT
python -m examples.agents.rag_with_memory_bank $LOCALHOST $PORT
python -m examples.safety.llama_guard_demo_mm $LOCALHOST $PORT
python -m examples.agents.e2e_loop_with_custom_tools $LOCALHOST $PORT

# Vision model
python -m examples.interior_design_assistant.app
python -m examples.agent_store.app $LOCALHOST $PORT
```

### CLI
```
which llama
llama model prompt-format -m Llama3.2-11B-Vision-Instruct
llama model list
llama stack list-apis
llama stack list-providers inference

llama stack build --template ollama --image-type conda
```

### Distributions Tests
**ollama**
```
llama stack build --template ollama --image-type conda
ollama run llama3.2:1b-instruct-fp16
llama stack run ./llama_stack/templates/ollama/run.yaml --env INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
```

**fireworks**
```
llama stack build --template fireworks --image-type conda
llama stack run ./llama_stack/templates/fireworks/run.yaml
```

**together**
```
llama stack build --template together --image-type conda
llama stack run ./llama_stack/templates/together/run.yaml
```

**tgi**
```
llama stack run ./llama_stack/templates/tgi/run.yaml --env TGI_URL=http://0.0.0.0:5009 --env INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
```

## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-27 15:45:44 -08:00
Botao Chen
c294a01c4b
[2/n][torchtune integration] implement job management and return training artifacts (#593)
### Context 
In this PR, we 
- Implement the post training job management and get training artifacts
apis
  - get_training_jobs
  - get_training_job_status
  - get_training_job_artifacts
- get_training_job_logstream is deleted since the trace can be directly
accessed by UI with Jaeger
https://llama-stack.readthedocs.io/en/latest/building_applications/telemetry.html#jaeger-to-visualize-traces
- Refactor the post training and training types definition to make them
more intuitive.
- Rewrite the checkpointer to make it compatible with llama-stack file
system and can be recognized during inference


### Test
Unit test
`pytest llama_stack/providers/tests/post_training/test_post_training.py
-m "torchtune_post_training_huggingface_datasetio" -v -s --tb=short
--disable-warnings`

<img width="1506" alt="Screenshot 2024-12-10 at 4 06 17 PM"
src="https://github.com/user-attachments/assets/16225029-bdb7-48c4-9d13-e580cc769c0a">


e2e test with client side call

<img width="888" alt="Screenshot 2024-12-10 at 4 09 44 PM"
src="https://github.com/user-attachments/assets/de375e4c-ef67-4dcc-a045-4037d9489191">
2024-12-13 15:00:04 -08:00