fix: update dangling references to llama download command (#3763)

## Summary After removing model management CLI in #3700, this PR updates remaining references to the old `llama download` command to use `huggingface-cli download` instead. ## Changes - Updated error messages in `meta_reference/common.py` to recommend `huggingface-cli download` - Updated error messages in `torchtune/recipes/lora_finetuning_single_device.py` to use `huggingface-cli download` - Updated post-training notebook to use `huggingface-cli download` instead of `llama download` - Fixed typo: "you model" -> "your model" ## Test Plan - Verified error messages provide correct guidance for users - Checked that notebook instructions are up-to-date with current tooling
2025-12-03 09:53:45 +00:00 · 2025-10-09 18:35:02 -07:00 · 2025-10-09 18:35:02 -07:00 · ebae0385bb
commit ebae0385bb
parent 8fe4a216b5
3 changed files with 6369 additions and 6410 deletions
--- a/docs/notebooks/Alpha_Llama_Stack_Post_Training.ipynb
+++ b/docs/notebooks/Alpha_Llama_Stack_Post_Training.ipynb
@ -4236,24 +4236,7 @@
   "metadata": {
    "id": "RWa220T5sjbR"
   },
-      "source": [
-        "# 2. Start Post Training\n",
-        "Currenty, Llama stack post training APIs support [Supervised Fine-tune](https://cameronrwolfe.substack.com/p/understanding-and-using-supervised) which is a straightfoard and effective way to boost model performance on specific tasks.\n",
-        "\n",
-        "We start from [LoRA finetune algorithm](https://pytorch.org/torchtune/main/tutorials/lora_finetune.html#what-is-lora) that can significantly reduce finetune GPU memory usage as well as needs less data\n",
-        "\n",
-        "\n",
-        "#### 2.0. Download the base model\n",
-        "Download the Llama model that will be used with [the downloading model CLI](https://llama-stack.readthedocs.io/en/latest/references/llama_cli_reference/download_models.html).\n",
-        "\n",
-        "Since ollama takes huggingface safetensor format checkpoint, we need to output the finetuned checkpoint in hugging face format. We download the model checkpoint from huggingface source.\n",
-        "\n",
-        "> You need to get a huggingface token from [here](https://huggingface.co/) and replace the \"HF_TOKEN\"\n",
-        "\n",
-        "\n",
-        "\n",
-        "\n"
-      ]
+   "source": "# 2. Start Post Training\nCurrently, Llama stack post training APIs support [Supervised Fine-tune](https://cameronrwolfe.substack.com/p/understanding-and-using-supervised) which is a straightforward and effective way to boost model performance on specific tasks.\n\nWe start from [LoRA finetune algorithm](https://pytorch.org/torchtune/main/tutorials/lora_finetune.html#what-is-lora) that can significantly reduce finetune GPU memory usage as well as needs less data\n\n\n#### 2.0. Download the base model\nDownload the Llama model using the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/guides/cli).\n\nSince ollama takes huggingface safetensor format checkpoint, we need to output the finetuned checkpoint in hugging face format. We download the model checkpoint from huggingface source.\n\n> You need to authenticate with Hugging Face by getting your token from [here](https://huggingface.co/settings/tokens) and running `huggingface-cli login`"
  },
  {
   "cell_type": "code",
@ -4266,33 +4249,8 @@
    "id": "yF50MtwcsogU",
    "outputId": "92ba3b3a-63a0-4ab8-c8cd-5437365128fc"
   },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            ".gitattributes: 100% 1.52k/1.52k [00:00<00:00, 12.1MB/s]\n",
-            "LICENSE.txt: 100% 7.71k/7.71k [00:00<00:00, 33.3MB/s]\n",
-            "README.md: 100% 41.7k/41.7k [00:00<00:00, 56.9MB/s]\n",
-            "USE_POLICY.md: 100% 6.02k/6.02k [00:00<00:00, 32.4MB/s]\n",
-            "config.json: 100% 878/878 [00:00<00:00, 6.94MB/s]\n",
-            "generation_config.json: 100% 189/189 [00:00<00:00, 1.71MB/s]\n",
-            "model.safetensors.index.json: 100% 20.9k/20.9k [00:00<00:00, 87.0MB/s]\n",
-            "consolidated.00.pth: 100% 6.43G/6.43G [00:18<00:00, 353MB/s]\n",
-            "original%2Forig_params.json: 100% 220/220 [00:00<00:00, 1.69MB/s]\n",
-            "original%2Fparams.json: 100% 220/220 [00:00<00:00, 1.64MB/s]\n",
-            "tokenizer.model: 100% 2.18M/2.18M [00:00<00:00, 44.8MB/s]\n",
-            "special_tokens_map.json: 100% 296/296 [00:00<00:00, 2.69MB/s]\n",
-            "tokenizer.json: 100% 9.09M/9.09M [00:01<00:00, 8.57MB/s]\n",
-            "tokenizer_config.json: 100% 54.5k/54.5k [00:00<00:00, 172MB/s]\n",
-            "\n",
-            "Successfully downloaded model to /root/.llama/checkpoints/Llama3.2-3B-Instruct\n"
-          ]
-        }
-      ],
-      "source": [
-        "!llama download --source huggingface --model-id Llama3.2-3B-Instruct --hf-token \"HF_TOKEN\""
-      ]
+   "outputs": [],
+   "source": "!huggingface-cli download meta-llama/Llama-3.2-3B-Instruct --local-dir ~/.llama/Llama-3.2-3B-Instruct"
  },
  {
   "cell_type": "markdown",
--- a/llama_stack/providers/inline/inference/meta_reference/common.py
+++ b/llama_stack/providers/inline/inference/meta_reference/common.py
@ -18,7 +18,7 @@ def model_checkpoint_dir(model_id) -> str:

    assert checkpoint_dir.exists(), (
        f"Could not find checkpoints in: {model_local_dir(model_id)}. "
-        f"If you try to use the native llama model, Please download model using `llama download --model-id {model_id}`"
-        f"Otherwise, please save you model checkpoint under {model_local_dir(model_id)}"
+        f"If you try to use the native llama model, please download the model using `llama-model download --source meta --model-id {model_id}` (see https://github.com/meta-llama/llama-models). "
+        f"Otherwise, please save your model checkpoint under {model_local_dir(model_id)}"
    )
    return str(checkpoint_dir)
--- a/llama_stack/providers/inline/post_training/torchtune/recipes/lora_finetuning_single_device.py
+++ b/llama_stack/providers/inline/post_training/torchtune/recipes/lora_finetuning_single_device.py
@ -104,9 +104,10 @@ class LoraFinetuningSingleDevice:
            if not any(p.exists() for p in paths):
                checkpoint_dir = checkpoint_dir / "original"

+            hf_repo = model.huggingface_repo or f"meta-llama/{model.descriptor()}"
            assert checkpoint_dir.exists(), (
                f"Could not find checkpoints in: {model_local_dir(model.descriptor())}. "
-                f"Please download model using `llama download --model-id {model.descriptor()}`"
+                f"Please download the model using `huggingface-cli download {hf_repo} --local-dir ~/.llama/{model.descriptor()}`"
            )
            return str(checkpoint_dir)