llama-stack-mirror/llama_stack/templates/experimental-post-training/run.yaml
Botao Chen f450a0fd32
Change post training run.yaml inference config (#710)
## Context
Colab notebook provides some limited free T4 GPU. 

Making post training template e2e works with colab notebook T4 is
critical for early adoption of the stack post training apis. However, we
found that the existing LlamaModelParallelGenerator
(https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/inline/inference/meta_reference/inference.py#L82)
in meta-reference inference implementation isn't compatible with T4
machine.

In this PR, We change to disable create_distributed_process_group for
inference api in post training run.yaml config and setup up the
distributed env variables in notebook
<img width="493" alt="Screenshot 2025-01-02 at 3 48 08 PM"
src="https://github.com/user-attachments/assets/dd159f70-4cff-475c-b459-1fc6e2c720ba"
/>

to make meta reference inference compatible with the free T4 machine

 ## test
Test with the WIP post training showcase colab notebook
https://colab.research.google.com/drive/1K4Q2wZq232_Bpy2ud4zL9aRxvCWAwyQs?usp=sharing
2025-01-03 08:37:48 -08:00

90 lines
2 KiB
YAML

version: '2'
image_name: experimental-post-training
docker_image: null
conda_env: experimental-post-training
apis:
- agents
- datasetio
- eval
- inference
- memory
- safety
- scoring
- telemetry
- post_training
providers:
inference:
- provider_id: meta-reference-inference
provider_type: inline::meta-reference
config:
max_seq_len: 4096
checkpoint_dir: null
create_distributed_process_group: False
eval:
- provider_id: meta-reference
provider_type: inline::meta-reference
config: {}
scoring:
- provider_id: basic
provider_type: inline::basic
config: {}
datasetio:
- provider_id: huggingface-0
provider_type: remote::huggingface
config: {}
telemetry:
- provider_id: meta-reference
provider_type: inline::meta-reference
config: {}
post_training:
- provider_id: torchtune-post-training
provider_type: inline::torchtune
config: {}
agents:
- provider_id: meta-reference
provider_type: inline::meta-reference
config:
persistence_store:
type: sqlite
namespace: null
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/meta-reference-gpu}/agents_store.db
safety:
- provider_id: llama-guard
provider_type: inline::llama-guard
config: {}
memory:
- provider_id: faiss
provider_type: inline::faiss
config:
kvstore:
type: sqlite
namespace: null
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/meta-reference-gpu}/faiss_store.db
metadata_store:
namespace: null
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/meta-reference-gpu}/registry.db
models: []
shields: []
memory_banks: []
datasets:
- dataset_id: alpaca
provider_id: huggingface-0
url:
uri: https://huggingface.co/datasets/tatsu-lab/alpaca
metadata:
path: tatsu-lab/alpaca
name:
split: train
dataset_schema:
instruction:
type: string
input:
type: string
output:
type: string
text:
type: string
scoring_fns: []
eval_tasks: []