docs: provider and distro codegen migration (#3531)

# What does this PR do?    - Updates provider and distro codegen to handle the new format - Migrates provider and distro files to the new format ## Test Plan - Manual testing
2025-10-11 05:38:38 +00:00 · 2025-09-24 14:01:29 -07:00 · 2025-09-24 14:01:29 -07:00 · d23865757f
commit d23865757f
parent 45da31801c
103 changed files with 1796 additions and 423 deletions
--- a/docs/docs/providers/post_training/index.mdx
+++ b/docs/docs/providers/post_training/index.mdx
@ -0,0 +1,17 @@
+---
+sidebar_label: Post Training
+title: Post_Training
+---
+
+# Post_Training
+
+## Overview
+
+This section contains documentation for all available providers for the **post_training** API.
+
+## Providers
+
+- [Huggingface-Gpu](./inline_huggingface-gpu)
+- [Torchtune-Cpu](./inline_torchtune-cpu)
+- [Torchtune-Gpu](./inline_torchtune-gpu)
+- [Remote - Nvidia](./remote_nvidia)
--- a/docs/docs/providers/post_training/inline_huggingface-cpu.mdx
+++ b/docs/docs/providers/post_training/inline_huggingface-cpu.mdx
@ -0,0 +1,40 @@
+# inline::huggingface-cpu
+
+## Description
+
+HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `device` | `<class 'str'>` | No | cuda |  |
+| `distributed_backend` | `Literal['fsdp', 'deepspeed'` | No |  |  |
+| `checkpoint_format` | `Literal['full_state', 'huggingface'` | No | huggingface |  |
+| `chat_template` | `<class 'str'>` | No | <|user|>
+{input}
+<|assistant|>
+{output} |  |
+| `model_specific_config` | `<class 'dict'>` | No | {'trust_remote_code': True, 'attn_implementation': 'sdpa'} |  |
+| `max_seq_length` | `<class 'int'>` | No | 2048 |  |
+| `gradient_checkpointing` | `<class 'bool'>` | No | False |  |
+| `save_total_limit` | `<class 'int'>` | No | 3 |  |
+| `logging_steps` | `<class 'int'>` | No | 10 |  |
+| `warmup_ratio` | `<class 'float'>` | No | 0.1 |  |
+| `weight_decay` | `<class 'float'>` | No | 0.01 |  |
+| `dataloader_num_workers` | `<class 'int'>` | No | 4 |  |
+| `dataloader_pin_memory` | `<class 'bool'>` | No | True |  |
+| `dpo_beta` | `<class 'float'>` | No | 0.1 |  |
+| `use_reference_model` | `<class 'bool'>` | No | True |  |
+| `dpo_loss_type` | `Literal['sigmoid', 'hinge', 'ipo', 'kto_pair'` | No | sigmoid |  |
+| `dpo_output_dir` | `<class 'str'>` | No |  |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: huggingface
+distributed_backend: null
+device: cpu
+dpo_output_dir: ~/.llama/dummy/dpo_output
+
+```
--- a/docs/docs/providers/post_training/inline_huggingface-gpu.mdx
+++ b/docs/docs/providers/post_training/inline_huggingface-gpu.mdx
@ -0,0 +1,42 @@
+---
+description: "HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem."
+sidebar_label: Huggingface-Gpu
+title: inline::huggingface-gpu
+---
+
+# inline::huggingface-gpu
+
+## Description
+
+HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `device` | `<class 'str'>` | No | cuda |  |
+| `distributed_backend` | `Literal['fsdp', 'deepspeed'` | No |  |  |
+| `checkpoint_format` | `Literal['full_state', 'huggingface'` | No | huggingface |  |
+| `chat_template` | `<class 'str'>` | No | &lt;|user|&gt;&lt;br/&gt;&#123;input&#125;&lt;br/&gt;&lt;|assistant|&gt;&lt;br/&gt;&#123;output&#125; |  |
+| `model_specific_config` | `<class 'dict'>` | No | &#123;'trust_remote_code': True, 'attn_implementation': 'sdpa'&#125; |  |
+| `max_seq_length` | `<class 'int'>` | No | 2048 |  |
+| `gradient_checkpointing` | `<class 'bool'>` | No | False |  |
+| `save_total_limit` | `<class 'int'>` | No | 3 |  |
+| `logging_steps` | `<class 'int'>` | No | 10 |  |
+| `warmup_ratio` | `<class 'float'>` | No | 0.1 |  |
+| `weight_decay` | `<class 'float'>` | No | 0.01 |  |
+| `dataloader_num_workers` | `<class 'int'>` | No | 4 |  |
+| `dataloader_pin_memory` | `<class 'bool'>` | No | True |  |
+| `dpo_beta` | `<class 'float'>` | No | 0.1 |  |
+| `use_reference_model` | `<class 'bool'>` | No | True |  |
+| `dpo_loss_type` | `Literal['sigmoid', 'hinge', 'ipo', 'kto_pair'` | No | sigmoid |  |
+| `dpo_output_dir` | `<class 'str'>` | No |  |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: huggingface
+distributed_backend: null
+device: cpu
+dpo_output_dir: ~/.llama/dummy/dpo_output
+```
--- a/docs/docs/providers/post_training/inline_huggingface.mdx
+++ b/docs/docs/providers/post_training/inline_huggingface.mdx
@ -0,0 +1,40 @@
+# inline::huggingface
+
+## Description
+
+HuggingFace-based post-training provider for fine-tuning models using the HuggingFace ecosystem.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `device` | `<class 'str'>` | No | cuda |  |
+| `distributed_backend` | `Literal['fsdp', 'deepspeed'` | No |  |  |
+| `checkpoint_format` | `Literal['full_state', 'huggingface'` | No | huggingface |  |
+| `chat_template` | `<class 'str'>` | No | <|user|>
+{input}
+<|assistant|>
+{output} |  |
+| `model_specific_config` | `<class 'dict'>` | No | {'trust_remote_code': True, 'attn_implementation': 'sdpa'} |  |
+| `max_seq_length` | `<class 'int'>` | No | 2048 |  |
+| `gradient_checkpointing` | `<class 'bool'>` | No | False |  |
+| `save_total_limit` | `<class 'int'>` | No | 3 |  |
+| `logging_steps` | `<class 'int'>` | No | 10 |  |
+| `warmup_ratio` | `<class 'float'>` | No | 0.1 |  |
+| `weight_decay` | `<class 'float'>` | No | 0.01 |  |
+| `dataloader_num_workers` | `<class 'int'>` | No | 4 |  |
+| `dataloader_pin_memory` | `<class 'bool'>` | No | True |  |
+| `dpo_beta` | `<class 'float'>` | No | 0.1 |  |
+| `use_reference_model` | `<class 'bool'>` | No | True |  |
+| `dpo_loss_type` | `Literal['sigmoid', 'hinge', 'ipo', 'kto_pair'` | No | sigmoid |  |
+| `dpo_output_dir` | `<class 'str'>` | No |  |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: huggingface
+distributed_backend: null
+device: cpu
+dpo_output_dir: ~/.llama/dummy/dpo_output
+
+```
--- a/docs/docs/providers/post_training/inline_torchtune-cpu.mdx
+++ b/docs/docs/providers/post_training/inline_torchtune-cpu.mdx
@ -0,0 +1,24 @@
+---
+description: "TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework."
+sidebar_label: Torchtune-Cpu
+title: inline::torchtune-cpu
+---
+
+# inline::torchtune-cpu
+
+## Description
+
+TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `torch_seed` | `int \| None` | No |  |  |
+| `checkpoint_format` | `Literal['meta', 'huggingface'` | No | meta |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: meta
+```
--- a/docs/docs/providers/post_training/inline_torchtune-gpu.mdx
+++ b/docs/docs/providers/post_training/inline_torchtune-gpu.mdx
@ -0,0 +1,24 @@
+---
+description: "TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework."
+sidebar_label: Torchtune-Gpu
+title: inline::torchtune-gpu
+---
+
+# inline::torchtune-gpu
+
+## Description
+
+TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `torch_seed` | `int \| None` | No |  |  |
+| `checkpoint_format` | `Literal['meta', 'huggingface'` | No | meta |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: meta
+```
--- a/docs/docs/providers/post_training/inline_torchtune.md
+++ b/docs/docs/providers/post_training/inline_torchtune.md
@ -0,0 +1,20 @@
+# inline::torchtune
+
+## Description
+
+TorchTune-based post-training provider for fine-tuning and optimizing models using Meta's TorchTune framework.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `torch_seed` | `int \| None` | No |  |  |
+| `checkpoint_format` | `Literal['meta', 'huggingface'` | No | meta |  |
+
+## Sample Configuration
+
+```yaml
+checkpoint_format: meta
+
+```
+
--- a/docs/docs/providers/post_training/remote_nvidia.mdx
+++ b/docs/docs/providers/post_training/remote_nvidia.mdx
@ -0,0 +1,32 @@
+---
+description: "NVIDIA's post-training provider for fine-tuning models on NVIDIA's platform."
+sidebar_label: Remote - Nvidia
+title: remote::nvidia
+---
+
+# remote::nvidia
+
+## Description
+
+NVIDIA's post-training provider for fine-tuning models on NVIDIA's platform.
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | The NVIDIA API key. |
+| `dataset_namespace` | `str \| None` | No | default | The NVIDIA dataset namespace. |
+| `project_id` | `str \| None` | No | test-example-model@v1 | The NVIDIA project ID. |
+| `customizer_url` | `str \| None` | No |  | Base URL for the NeMo Customizer API |
+| `timeout` | `<class 'int'>` | No | 300 | Timeout for the NVIDIA Post Training API |
+| `max_retries` | `<class 'int'>` | No | 3 | Maximum number of retries for the NVIDIA Post Training API |
+| `output_model_dir` | `<class 'str'>` | No | test-example-model@v1 | Directory to save the output model |
+
+## Sample Configuration
+
+```yaml
+api_key: ${env.NVIDIA_API_KEY:=}
+dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
+project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
+customizer_url: ${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test}
+```