forked from phoenix-oss/llama-stack-mirror
		
	Restructure docs (#494)
Rendered docs at: https://llama-stack.readthedocs.io/en/doc-simplify/
This commit is contained in:
		
							parent
							
								
									068ac00a3b
								
							
						
					
					
						commit
						b3f9e8b2f2
					
				
					 20 changed files with 586 additions and 200 deletions
				
			
		
							
								
								
									
										133
									
								
								docs/source/distributions/self_hosted_distro/tgi.md
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										133
									
								
								docs/source/distributions/self_hosted_distro/tgi.md
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,133 @@ | |||
| # TGI Distribution | ||||
| 
 | ||||
| ```{toctree} | ||||
| :maxdepth: 2 | ||||
| :hidden: | ||||
| 
 | ||||
| self | ||||
| ``` | ||||
| 
 | ||||
| The `llamastack/distribution-tgi` distribution consists of the following provider configurations. | ||||
| 
 | ||||
| | API | Provider(s) | | ||||
| |-----|-------------| | ||||
| | agents | `inline::meta-reference` | | ||||
| | inference | `remote::tgi` | | ||||
| | memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` | | ||||
| | safety | `inline::llama-guard` | | ||||
| | telemetry | `inline::meta-reference` | | ||||
| 
 | ||||
| 
 | ||||
| You can use this distribution if you have GPUs and want to run an independent TGI server container for running inference. | ||||
| 
 | ||||
| ### Environment Variables | ||||
| 
 | ||||
| The following environment variables can be configured: | ||||
| 
 | ||||
| - `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`) | ||||
| - `INFERENCE_MODEL`: Inference model loaded into the TGI server (default: `meta-llama/Llama-3.2-3B-Instruct`) | ||||
| - `TGI_URL`: URL of the TGI server with the main inference model (default: `http://127.0.0.1:8080}/v1`) | ||||
| - `TGI_SAFETY_URL`: URL of the TGI server with the safety model (default: `http://127.0.0.1:8081/v1`) | ||||
| - `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`) | ||||
| 
 | ||||
| 
 | ||||
| ## Setting up TGI server | ||||
| 
 | ||||
| Please check the [TGI Getting Started Guide](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#get-started) to get a TGI endpoint. Here is a sample script to start a TGI server locally via Docker: | ||||
| 
 | ||||
| ```bash | ||||
| export INFERENCE_PORT=8080 | ||||
| export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct | ||||
| export CUDA_VISIBLE_DEVICES=0 | ||||
| 
 | ||||
| docker run --rm -it \ | ||||
|   -v $HOME/.cache/huggingface:/data \ | ||||
|   -p $INFERENCE_PORT:$INFERENCE_PORT \ | ||||
|   --gpus $CUDA_VISIBLE_DEVICES \ | ||||
|   ghcr.io/huggingface/text-generation-inference:2.3.1 \ | ||||
|   --dtype bfloat16 \ | ||||
|   --usage-stats off \ | ||||
|   --sharded false \ | ||||
|   --cuda-memory-fraction 0.7 \ | ||||
|   --model-id $INFERENCE_MODEL \ | ||||
|   --port $INFERENCE_PORT | ||||
| ``` | ||||
| 
 | ||||
| If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like: | ||||
| 
 | ||||
| ```bash | ||||
| export SAFETY_PORT=8081 | ||||
| export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B | ||||
| export CUDA_VISIBLE_DEVICES=1 | ||||
| 
 | ||||
| docker run --rm -it \ | ||||
|   -v $HOME/.cache/huggingface:/data \ | ||||
|   -p $SAFETY_PORT:$SAFETY_PORT \ | ||||
|   --gpus $CUDA_VISIBLE_DEVICES \ | ||||
|   ghcr.io/huggingface/text-generation-inference:2.3.1 \ | ||||
|   --dtype bfloat16 \ | ||||
|   --usage-stats off \ | ||||
|   --sharded false \ | ||||
|   --model-id $SAFETY_MODEL \ | ||||
|   --port $SAFETY_PORT | ||||
| ``` | ||||
| 
 | ||||
| ## Running Llama Stack | ||||
| 
 | ||||
| Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image. | ||||
| 
 | ||||
| ### Via Docker | ||||
| 
 | ||||
| This method allows you to get started quickly without having to build the distribution code. | ||||
| 
 | ||||
| ```bash | ||||
| LLAMA_STACK_PORT=5001 | ||||
| docker run \ | ||||
|   -it \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ./run.yaml:/root/my-run.yaml \ | ||||
|   llamastack/distribution-tgi \ | ||||
|   --yaml-config /root/my-run.yaml \ | ||||
|   --port $LLAMA_STACK_PORT \ | ||||
|   --env INFERENCE_MODEL=$INFERENCE_MODEL \ | ||||
|   --env TGI_URL=http://host.docker.internal:$INFERENCE_PORT | ||||
| ``` | ||||
| 
 | ||||
| If you are using Llama Stack Safety / Shield APIs, use: | ||||
| 
 | ||||
| ```bash | ||||
| docker run \ | ||||
|   -it \ | ||||
|   -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ | ||||
|   -v ./run-with-safety.yaml:/root/my-run.yaml \ | ||||
|   llamastack/distribution-tgi \ | ||||
|   --yaml-config /root/my-run.yaml \ | ||||
|   --port $LLAMA_STACK_PORT \ | ||||
|   --env INFERENCE_MODEL=$INFERENCE_MODEL \ | ||||
|   --env TGI_URL=http://host.docker.internal:$INFERENCE_PORT \ | ||||
|   --env SAFETY_MODEL=$SAFETY_MODEL \ | ||||
|   --env TGI_SAFETY_URL=http://host.docker.internal:$SAFETY_PORT | ||||
| ``` | ||||
| 
 | ||||
| ### Via Conda | ||||
| 
 | ||||
| Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available. | ||||
| 
 | ||||
| ```bash | ||||
| llama stack build --template tgi --image-type conda | ||||
| llama stack run ./run.yaml | ||||
|   --port 5001 | ||||
|   --env INFERENCE_MODEL=$INFERENCE_MODEL | ||||
|   --env TGI_URL=http://127.0.0.1:$INFERENCE_PORT | ||||
| ``` | ||||
| 
 | ||||
| If you are using Llama Stack Safety / Shield APIs, use: | ||||
| 
 | ||||
| ```bash | ||||
| llama stack run ./run-with-safety.yaml | ||||
|   --port 5001 | ||||
|   --env INFERENCE_MODEL=$INFERENCE_MODEL | ||||
|   --env TGI_URL=http://127.0.0.1:$INFERENCE_PORT | ||||
|   --env SAFETY_MODEL=$SAFETY_MODEL | ||||
|   --env TGI_SAFETY_URL=http://127.0.0.1:$SAFETY_PORT | ||||
| ``` | ||||
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue