forked from phoenix-oss/llama-stack-mirror
		
	merge
This commit is contained in:
		
						commit
						a54d757ade
					
				
					 197 changed files with 9392 additions and 3089 deletions
				
			
		|  | @ -56,9 +56,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image. | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ./run.yaml:/root/my-run.yaml \ | ||||
|   llamastack/distribution-nvidia \ | ||||
|  | @ -72,7 +73,7 @@ docker run \ | |||
| ```bash | ||||
| llama stack build --template nvidia --image-type conda | ||||
| llama stack run ./run.yaml \ | ||||
|   --port 5001 \ | ||||
|   --port 8321 \ | ||||
|   --env NVIDIA_API_KEY=$NVIDIA_API_KEY | ||||
|   --env INFERENCE_MODEL=$INFERENCE_MODEL | ||||
| ``` | ||||
|  |  | |||
|  | @ -26,7 +26,7 @@ The `llamastack/distribution-bedrock` distribution consists of the following pro | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| 
 | ||||
| ### Models | ||||
| 
 | ||||
|  | @ -51,9 +51,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image. | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   llamastack/distribution-bedrock \ | ||||
|   --port $LLAMA_STACK_PORT \ | ||||
|  |  | |||
|  | @ -18,7 +18,7 @@ The `llamastack/distribution-cerebras` distribution consists of the following pr | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `CEREBRAS_API_KEY`: Cerebras API Key (default: ``) | ||||
| 
 | ||||
| ### Models | ||||
|  | @ -43,9 +43,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image. | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ./run.yaml:/root/my-run.yaml \ | ||||
|   llamastack/distribution-cerebras \ | ||||
|  | @ -59,6 +60,6 @@ docker run \ | |||
| ```bash | ||||
| llama stack build --template cerebras --image-type conda | ||||
| llama stack run ./run.yaml \ | ||||
|   --port 5001 \ | ||||
|   --port 8321 \ | ||||
|   --env CEREBRAS_API_KEY=$CEREBRAS_API_KEY | ||||
| ``` | ||||
|  |  | |||
|  | @ -53,7 +53,7 @@ docker compose down | |||
| 
 | ||||
| #### Start Dell-TGI server locally | ||||
| ``` | ||||
| docker run -it --shm-size 1g -p 80:80 --gpus 4 \ | ||||
| docker run -it --pull always --shm-size 1g -p 80:80 --gpus 4 \ | ||||
| -e NUM_SHARD=4 | ||||
| -e MAX_BATCH_PREFILL_TOKENS=32768 \ | ||||
| -e MAX_INPUT_TOKENS=8000 \ | ||||
|  | @ -65,7 +65,7 @@ registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1 | |||
| #### Start Llama Stack server pointing to TGI server | ||||
| 
 | ||||
| ``` | ||||
| docker run --network host -it -p 8321:8321 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml | ||||
| docker run --pull always --network host -it -p 8321:8321 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml | ||||
| ``` | ||||
| 
 | ||||
| Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g. | ||||
|  |  | |||
|  | @ -55,6 +55,7 @@ export CUDA_VISIBLE_DEVICES=0 | |||
| export LLAMA_STACK_PORT=8321 | ||||
| 
 | ||||
| docker run --rm -it \ | ||||
|   --pull always \ | ||||
|   --network host \ | ||||
|   -v $HOME/.cache/huggingface:/data \ | ||||
|   -e HF_TOKEN=$HF_TOKEN \ | ||||
|  | @ -78,6 +79,7 @@ export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B | |||
| export CUDA_VISIBLE_DEVICES=1 | ||||
| 
 | ||||
| docker run --rm -it \ | ||||
|   --pull always \ | ||||
|   --network host \ | ||||
|   -v $HOME/.cache/huggingface:/data \ | ||||
|   -e HF_TOKEN=$HF_TOKEN \ | ||||
|  | @ -120,6 +122,7 @@ This method allows you to get started quickly without having to build the distri | |||
| 
 | ||||
| ```bash | ||||
| docker run -it \ | ||||
|   --pull always \ | ||||
|   --network host \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v $HOME/.llama:/root/.llama \ | ||||
|  | @ -147,6 +150,7 @@ export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B | |||
| 
 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v $HOME/.llama:/root/.llama \ | ||||
|   -v ./llama_stack/templates/tgi/run-with-safety.yaml:/root/my-run.yaml \ | ||||
|  |  | |||
|  | @ -28,7 +28,7 @@ The `llamastack/distribution-fireworks` distribution consists of the following p | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `FIREWORKS_API_KEY`: Fireworks.AI API Key (default: ``) | ||||
| 
 | ||||
| ### Models | ||||
|  | @ -61,9 +61,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image. | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   llamastack/distribution-fireworks \ | ||||
|   --port $LLAMA_STACK_PORT \ | ||||
|  |  | |||
|  | @ -28,7 +28,7 @@ The `llamastack/distribution-groq` distribution consists of the following provid | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `GROQ_API_KEY`: Groq API Key (default: ``) | ||||
| 
 | ||||
| ### Models | ||||
|  | @ -56,9 +56,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image. | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   llamastack/distribution-groq \ | ||||
|   --port $LLAMA_STACK_PORT \ | ||||
|  |  | |||
|  | @ -30,7 +30,7 @@ Note that you need access to nvidia GPUs to run this distribution. This distribu | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`) | ||||
| - `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`) | ||||
| - `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`) | ||||
|  | @ -75,9 +75,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image. | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ~/.llama:/root/.llama \ | ||||
|   llamastack/distribution-meta-reference-gpu \ | ||||
|  | @ -90,6 +91,7 @@ If you are using Llama Stack Safety / Shield APIs, use: | |||
| ```bash | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ~/.llama:/root/.llama \ | ||||
|   llamastack/distribution-meta-reference-gpu \ | ||||
|  | @ -105,7 +107,7 @@ Make sure you have done `uv pip install llama-stack` and have the Llama Stack CL | |||
| ```bash | ||||
| llama stack build --template meta-reference-gpu --image-type conda | ||||
| llama stack run distributions/meta-reference-gpu/run.yaml \ | ||||
|   --port 5001 \ | ||||
|   --port 8321 \ | ||||
|   --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct | ||||
| ``` | ||||
| 
 | ||||
|  | @ -113,7 +115,7 @@ If you are using Llama Stack Safety / Shield APIs, use: | |||
| 
 | ||||
| ```bash | ||||
| llama stack run distributions/meta-reference-gpu/run-with-safety.yaml \ | ||||
|   --port 5001 \ | ||||
|   --port 8321 \ | ||||
|   --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \ | ||||
|   --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B | ||||
| ``` | ||||
|  |  | |||
|  | @ -32,7 +32,7 @@ Note that you need access to nvidia GPUs to run this distribution. This distribu | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`) | ||||
| - `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`) | ||||
| 
 | ||||
|  | @ -75,9 +75,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image. | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ~/.llama:/root/.llama \ | ||||
|   llamastack/distribution-meta-reference-quantized-gpu \ | ||||
|  | @ -90,6 +91,7 @@ If you are using Llama Stack Safety / Shield APIs, use: | |||
| ```bash | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ~/.llama:/root/.llama \ | ||||
|   llamastack/distribution-meta-reference-quantized-gpu \ | ||||
|  |  | |||
|  | @ -15,7 +15,7 @@ The `llamastack/distribution-nvidia` distribution consists of the following prov | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `NVIDIA_API_KEY`: NVIDIA API Key (default: ``) | ||||
| 
 | ||||
| ### Models | ||||
|  | @ -39,9 +39,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image. | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ./run.yaml:/root/my-run.yaml \ | ||||
|   llamastack/distribution-nvidia \ | ||||
|  | @ -55,6 +56,6 @@ docker run \ | |||
| ```bash | ||||
| llama stack build --template nvidia --image-type conda | ||||
| llama stack run ./run.yaml \ | ||||
|   --port 5001 \ | ||||
|   --port 8321 \ | ||||
|   --env NVIDIA_API_KEY=$NVIDIA_API_KEY | ||||
| ``` | ||||
|  |  | |||
|  | @ -30,7 +30,7 @@ You should use this distribution if you have a regular desktop machine without v | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `OLLAMA_URL`: URL of the Ollama server (default: `http://127.0.0.1:11434`) | ||||
| - `INFERENCE_MODEL`: Inference model loaded into the Ollama server (default: `meta-llama/Llama-3.2-3B-Instruct`) | ||||
| - `SAFETY_MODEL`: Safety model loaded into the Ollama server (default: `meta-llama/Llama-Guard-3-1B`) | ||||
|  | @ -69,9 +69,10 @@ Now you are ready to run Llama Stack with Ollama as the inference provider. You | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| export LLAMA_STACK_PORT=5001 | ||||
| export LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ~/.llama:/root/.llama \ | ||||
|   llamastack/distribution-ollama \ | ||||
|  | @ -89,6 +90,7 @@ cd /path/to/llama-stack | |||
| 
 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ~/.llama:/root/.llama \ | ||||
|   -v ./llama_stack/templates/ollama/run-with-safety.yaml:/root/my-run.yaml \ | ||||
|  | @ -105,7 +107,7 @@ docker run \ | |||
| Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available. | ||||
| 
 | ||||
| ```bash | ||||
| export LLAMA_STACK_PORT=5001 | ||||
| export LLAMA_STACK_PORT=8321 | ||||
| 
 | ||||
| llama stack build --template ollama --image-type conda | ||||
| llama stack run ./run.yaml \ | ||||
|  |  | |||
|  | @ -28,7 +28,7 @@ The `llamastack/distribution-passthrough` distribution consists of the following | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `PASSTHROUGH_API_KEY`: Passthrough API Key (default: ``) | ||||
| - `PASSTHROUGH_URL`: Passthrough URL (default: ``) | ||||
| 
 | ||||
|  |  | |||
|  | @ -29,7 +29,7 @@ You can use this distribution if you have GPUs and want to run an independent vL | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `INFERENCE_MODEL`: Inference model loaded into the vLLM server (default: `meta-llama/Llama-3.2-3B-Instruct`) | ||||
| - `VLLM_URL`: URL of the vLLM server with the main inference model (default: `http://host.docker.internal:5100/v1`) | ||||
| - `MAX_TOKENS`: Maximum number of tokens for generation (default: `4096`) | ||||
|  | @ -47,6 +47,7 @@ export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct | |||
| export CUDA_VISIBLE_DEVICES=0 | ||||
| 
 | ||||
| docker run \ | ||||
|     --pull always \ | ||||
|     --runtime nvidia \ | ||||
|     --gpus $CUDA_VISIBLE_DEVICES \ | ||||
|     -v ~/.cache/huggingface:/root/.cache/huggingface \ | ||||
|  | @ -59,6 +60,8 @@ docker run \ | |||
|     --port $INFERENCE_PORT | ||||
| ``` | ||||
| 
 | ||||
| Note that you'll also need to set `--enable-auto-tool-choice` and `--tool-call-parser` to [enable tool calling in vLLM](https://docs.vllm.ai/en/latest/features/tool_calling.html). | ||||
| 
 | ||||
| If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like: | ||||
| 
 | ||||
| ```bash | ||||
|  | @ -67,6 +70,7 @@ export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B | |||
| export CUDA_VISIBLE_DEVICES=1 | ||||
| 
 | ||||
| docker run \ | ||||
|     --pull always \ | ||||
|     --runtime nvidia \ | ||||
|     --gpus $CUDA_VISIBLE_DEVICES \ | ||||
|     -v ~/.cache/huggingface:/root/.cache/huggingface \ | ||||
|  | @ -90,10 +94,11 @@ This method allows you to get started quickly without having to build the distri | |||
| ```bash | ||||
| export INFERENCE_PORT=8000 | ||||
| export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct | ||||
| export LLAMA_STACK_PORT=5001 | ||||
| export LLAMA_STACK_PORT=8321 | ||||
| 
 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ./run.yaml:/root/my-run.yaml \ | ||||
|   llamastack/distribution-remote-vllm \ | ||||
|  | @ -115,6 +120,7 @@ cd /path/to/llama-stack | |||
| 
 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ~/.llama:/root/.llama \ | ||||
|   -v ./llama_stack/templates/remote-vllm/run-with-safety.yaml:/root/my-run.yaml \ | ||||
|  | @ -135,7 +141,7 @@ Make sure you have done `uv pip install llama-stack` and have the Llama Stack CL | |||
| ```bash | ||||
| export INFERENCE_PORT=8000 | ||||
| export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct | ||||
| export LLAMA_STACK_PORT=5001 | ||||
| export LLAMA_STACK_PORT=8321 | ||||
| 
 | ||||
| cd distributions/remote-vllm | ||||
| llama stack build --template remote-vllm --image-type conda | ||||
|  |  | |||
|  | @ -27,7 +27,7 @@ The `llamastack/distribution-sambanova` distribution consists of the following p | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `SAMBANOVA_API_KEY`: SambaNova.AI API Key (default: ``) | ||||
| 
 | ||||
| ### Models | ||||
|  | @ -59,9 +59,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image. | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   llamastack/distribution-sambanova \ | ||||
|   --port $LLAMA_STACK_PORT \ | ||||
|  |  | |||
|  | @ -31,7 +31,7 @@ You can use this distribution if you have GPUs and want to run an independent TG | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `INFERENCE_MODEL`: Inference model loaded into the TGI server (default: `meta-llama/Llama-3.2-3B-Instruct`) | ||||
| - `TGI_URL`: URL of the TGI server with the main inference model (default: `http://127.0.0.1:8080/v1`) | ||||
| - `TGI_SAFETY_URL`: URL of the TGI server with the safety model (default: `http://127.0.0.1:8081/v1`) | ||||
|  | @ -48,6 +48,7 @@ export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct | |||
| export CUDA_VISIBLE_DEVICES=0 | ||||
| 
 | ||||
| docker run --rm -it \ | ||||
|   --pull always \ | ||||
|   -v $HOME/.cache/huggingface:/data \ | ||||
|   -p $INFERENCE_PORT:$INFERENCE_PORT \ | ||||
|   --gpus $CUDA_VISIBLE_DEVICES \ | ||||
|  | @ -68,6 +69,7 @@ export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B | |||
| export CUDA_VISIBLE_DEVICES=1 | ||||
| 
 | ||||
| docker run --rm -it \ | ||||
|   --pull always \ | ||||
|   -v $HOME/.cache/huggingface:/data \ | ||||
|   -p $SAFETY_PORT:$SAFETY_PORT \ | ||||
|   --gpus $CUDA_VISIBLE_DEVICES \ | ||||
|  | @ -88,9 +90,10 @@ Now you are ready to run Llama Stack with TGI as the inference provider. You can | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   llamastack/distribution-tgi \ | ||||
|   --port $LLAMA_STACK_PORT \ | ||||
|  | @ -107,6 +110,7 @@ cd /path/to/llama-stack | |||
| 
 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ~/.llama:/root/.llama \ | ||||
|   -v ./llama_stack/templates/tgi/run-with-safety.yaml:/root/my-run.yaml \ | ||||
|  |  | |||
|  | @ -28,7 +28,7 @@ The `llamastack/distribution-together` distribution consists of the following pr | |||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) | ||||
| - `TOGETHER_API_KEY`: Together.AI API Key (default: ``) | ||||
| 
 | ||||
| ### Models | ||||
|  | @ -62,9 +62,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image. | |||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| LLAMA_STACK_PORT=8321 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   --pull always \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   llamastack/distribution-together \ | ||||
|   --port $LLAMA_STACK_PORT \ | ||||
|  |  | |||
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue