forked from phoenix-oss/llama-stack-mirror
merge
This commit is contained in:
commit
a54d757ade
197 changed files with 9392 additions and 3089 deletions
|
@ -14,7 +14,7 @@ Agents are configured using the `AgentConfig` class, which includes:
|
|||
- **Safety Shields**: Guardrails to ensure responsible AI behavior
|
||||
|
||||
```python
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
from llama_stack_client import Agent
|
||||
|
||||
|
||||
# Create the agent
|
||||
|
@ -44,14 +44,14 @@ Each interaction with an agent is called a "turn" and consists of:
|
|||
- **Output Message**: The agent's response
|
||||
|
||||
```python
|
||||
from llama_stack_client.lib.agents.event_logger import EventLogger
|
||||
from llama_stack_client import AgentEventLogger
|
||||
|
||||
# Create a turn with streaming response
|
||||
turn_response = agent.create_turn(
|
||||
session_id=session_id,
|
||||
messages=[{"role": "user", "content": "Tell me about Llama models"}],
|
||||
)
|
||||
for log in EventLogger().log(turn_response):
|
||||
for log in AgentEventLogger().log(turn_response):
|
||||
log.print()
|
||||
```
|
||||
### Non-Streaming
|
||||
|
|
|
@ -67,9 +67,7 @@ sequenceDiagram
|
|||
Each step in this process can be monitored and controlled through configurations. Here's an example that demonstrates monitoring the agent's execution:
|
||||
|
||||
```python
|
||||
from llama_stack_client import LlamaStackClient
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
from llama_stack_client.lib.agents.event_logger import EventLogger
|
||||
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
|
||||
from rich.pretty import pprint
|
||||
|
||||
# Replace host and port
|
||||
|
@ -113,7 +111,7 @@ response = agent.create_turn(
|
|||
)
|
||||
|
||||
# Monitor each step of execution
|
||||
for log in EventLogger().log(response):
|
||||
for log in AgentEventLogger().log(response):
|
||||
log.print()
|
||||
|
||||
# Using non-streaming API, the response contains input, steps, and output.
|
||||
|
|
|
@ -23,9 +23,7 @@ In this example, we will show you how to:
|
|||
|
||||
##### Building a Search Agent
|
||||
```python
|
||||
from llama_stack_client import LlamaStackClient
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
from llama_stack_client.lib.agents.event_logger import EventLogger
|
||||
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
|
||||
|
||||
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
|
||||
|
||||
|
@ -54,7 +52,7 @@ for prompt in user_prompts:
|
|||
session_id=session_id,
|
||||
)
|
||||
|
||||
for log in EventLogger().log(response):
|
||||
for log in AgentEventLogger().log(response):
|
||||
log.print()
|
||||
```
|
||||
|
||||
|
|
|
@ -55,11 +55,11 @@ chunks_response = client.vector_io.query(
|
|||
A better way to ingest documents is to use the RAG Tool. This tool allows you to ingest documents from URLs, files, etc. and automatically chunks them into smaller pieces.
|
||||
|
||||
```python
|
||||
from llama_stack_client.types import Document
|
||||
from llama_stack_client import RAGDocument
|
||||
|
||||
urls = ["memory_optimizations.rst", "chat.rst", "llama3.rst"]
|
||||
documents = [
|
||||
Document(
|
||||
RAGDocument(
|
||||
document_id=f"num-{i}",
|
||||
content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
|
||||
mime_type="text/plain",
|
||||
|
@ -86,7 +86,7 @@ results = client.tool_runtime.rag_tool.query(
|
|||
One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
|
||||
|
||||
```python
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
from llama_stack_client import Agent
|
||||
|
||||
# Create agent with memory
|
||||
agent = Agent(
|
||||
|
@ -140,9 +140,9 @@ response = agent.create_turn(
|
|||
|
||||
You can print the response with below.
|
||||
```python
|
||||
from llama_stack_client.lib.agents.event_logger import EventLogger
|
||||
from llama_stack_client import AgentEventLogger
|
||||
|
||||
for log in EventLogger().log(response):
|
||||
for log in AgentEventLogger().log(response):
|
||||
log.print()
|
||||
```
|
||||
|
||||
|
|
|
@ -57,7 +57,7 @@ The `otel` sink works with any service compatible with the OpenTelemetry collect
|
|||
Start a Jaeger instance with the OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686 using the following command:
|
||||
|
||||
```bash
|
||||
$ docker run --rm --name jaeger \
|
||||
$ docker run --pull always --rm --name jaeger \
|
||||
-p 16686:16686 -p 4318:4318 \
|
||||
jaegertracing/jaeger:2.1.0
|
||||
```
|
||||
|
|
|
@ -110,10 +110,18 @@ MCP tools are special tools that can interact with llama stack over model contex
|
|||
|
||||
Refer to [https://github.com/modelcontextprotocol/servers](https://github.com/modelcontextprotocol/servers) for available MCP servers.
|
||||
|
||||
```shell
|
||||
# start your MCP server
|
||||
mkdir /tmp/content
|
||||
touch /tmp/content/foo
|
||||
touch /tmp/content/bar
|
||||
npx -y supergateway --port 8000 --stdio 'npx -y @modelcontextprotocol/server-filesystem /tmp/content'
|
||||
```
|
||||
|
||||
Then register the MCP server as a tool group,
|
||||
```python
|
||||
# Register MCP tools
|
||||
client.toolgroups.register(
|
||||
toolgroup_id="builtin::filesystem",
|
||||
toolgroup_id="mcp::filesystem",
|
||||
provider_id="model-context-protocol",
|
||||
mcp_endpoint=URL(uri="http://localhost:8000/sse"),
|
||||
)
|
||||
|
@ -181,7 +189,7 @@ group_tools = client.tools.list_tools(toolgroup_id="search_tools")
|
|||
## Simple Example: Using an Agent with the Code-Interpreter Tool
|
||||
|
||||
```python
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
from llama_stack_client import Agent
|
||||
|
||||
# Instantiate the AI agent with the given configuration
|
||||
agent = Agent(
|
||||
|
|
|
@ -55,7 +55,7 @@ llama stack run llama_stack/templates/open-benchmark/run.yaml
|
|||
There are 3 necessary inputs to run a benchmark eval
|
||||
- `list of benchmark_ids`: The list of benchmark ids to run evaluation on
|
||||
- `model-id`: The model id to evaluate on
|
||||
- `utput_dir`: Path to store the evaluate results
|
||||
- `output_dir`: Path to store the evaluate results
|
||||
```
|
||||
llama-stack-client eval run-benchmark <benchmark_id_1> <benchmark_id_2> ... \
|
||||
--model_id <model id to evaluate on> \
|
||||
|
@ -69,7 +69,7 @@ llama-stack-client eval run-benchmark help
|
|||
to see the description of all the flags that eval run-benchmark has
|
||||
|
||||
|
||||
In the output log, you can find the file path that has your evaluation results. Open that file and you can see you aggrgate
|
||||
In the output log, you can find the file path that has your evaluation results. Open that file and you can see you aggregate
|
||||
evaluation results over there.
|
||||
|
||||
|
||||
|
|
|
@ -56,9 +56,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image.
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ./run.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-nvidia \
|
||||
|
@ -72,7 +73,7 @@ docker run \
|
|||
```bash
|
||||
llama stack build --template nvidia --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
--port 5001 \
|
||||
--port 8321 \
|
||||
--env NVIDIA_API_KEY=$NVIDIA_API_KEY
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL
|
||||
```
|
||||
|
|
|
@ -26,7 +26,7 @@ The `llamastack/distribution-bedrock` distribution consists of the following pro
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
|
||||
### Models
|
||||
|
||||
|
@ -51,9 +51,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image.
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
llamastack/distribution-bedrock \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
|
|
|
@ -18,7 +18,7 @@ The `llamastack/distribution-cerebras` distribution consists of the following pr
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `CEREBRAS_API_KEY`: Cerebras API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
@ -43,9 +43,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image.
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ./run.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-cerebras \
|
||||
|
@ -59,6 +60,6 @@ docker run \
|
|||
```bash
|
||||
llama stack build --template cerebras --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
--port 5001 \
|
||||
--port 8321 \
|
||||
--env CEREBRAS_API_KEY=$CEREBRAS_API_KEY
|
||||
```
|
||||
|
|
|
@ -53,7 +53,7 @@ docker compose down
|
|||
|
||||
#### Start Dell-TGI server locally
|
||||
```
|
||||
docker run -it --shm-size 1g -p 80:80 --gpus 4 \
|
||||
docker run -it --pull always --shm-size 1g -p 80:80 --gpus 4 \
|
||||
-e NUM_SHARD=4
|
||||
-e MAX_BATCH_PREFILL_TOKENS=32768 \
|
||||
-e MAX_INPUT_TOKENS=8000 \
|
||||
|
@ -65,7 +65,7 @@ registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1
|
|||
#### Start Llama Stack server pointing to TGI server
|
||||
|
||||
```
|
||||
docker run --network host -it -p 8321:8321 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
|
||||
docker run --pull always --network host -it -p 8321:8321 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
|
||||
|
|
|
@ -55,6 +55,7 @@ export CUDA_VISIBLE_DEVICES=0
|
|||
export LLAMA_STACK_PORT=8321
|
||||
|
||||
docker run --rm -it \
|
||||
--pull always \
|
||||
--network host \
|
||||
-v $HOME/.cache/huggingface:/data \
|
||||
-e HF_TOKEN=$HF_TOKEN \
|
||||
|
@ -78,6 +79,7 @@ export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
|||
export CUDA_VISIBLE_DEVICES=1
|
||||
|
||||
docker run --rm -it \
|
||||
--pull always \
|
||||
--network host \
|
||||
-v $HOME/.cache/huggingface:/data \
|
||||
-e HF_TOKEN=$HF_TOKEN \
|
||||
|
@ -120,6 +122,7 @@ This method allows you to get started quickly without having to build the distri
|
|||
|
||||
```bash
|
||||
docker run -it \
|
||||
--pull always \
|
||||
--network host \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v $HOME/.llama:/root/.llama \
|
||||
|
@ -147,6 +150,7 @@ export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
|||
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v $HOME/.llama:/root/.llama \
|
||||
-v ./llama_stack/templates/tgi/run-with-safety.yaml:/root/my-run.yaml \
|
||||
|
|
|
@ -28,7 +28,7 @@ The `llamastack/distribution-fireworks` distribution consists of the following p
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `FIREWORKS_API_KEY`: Fireworks.AI API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
@ -61,9 +61,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image.
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
llamastack/distribution-fireworks \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
|
|
|
@ -28,7 +28,7 @@ The `llamastack/distribution-groq` distribution consists of the following provid
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `GROQ_API_KEY`: Groq API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
@ -56,9 +56,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image.
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
llamastack/distribution-groq \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
|
|
|
@ -30,7 +30,7 @@ Note that you need access to nvidia GPUs to run this distribution. This distribu
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
||||
- `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`)
|
||||
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
|
||||
|
@ -75,9 +75,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image.
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
llamastack/distribution-meta-reference-gpu \
|
||||
|
@ -90,6 +91,7 @@ If you are using Llama Stack Safety / Shield APIs, use:
|
|||
```bash
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
llamastack/distribution-meta-reference-gpu \
|
||||
|
@ -105,7 +107,7 @@ Make sure you have done `uv pip install llama-stack` and have the Llama Stack CL
|
|||
```bash
|
||||
llama stack build --template meta-reference-gpu --image-type conda
|
||||
llama stack run distributions/meta-reference-gpu/run.yaml \
|
||||
--port 5001 \
|
||||
--port 8321 \
|
||||
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
```
|
||||
|
||||
|
@ -113,7 +115,7 @@ If you are using Llama Stack Safety / Shield APIs, use:
|
|||
|
||||
```bash
|
||||
llama stack run distributions/meta-reference-gpu/run-with-safety.yaml \
|
||||
--port 5001 \
|
||||
--port 8321 \
|
||||
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
|
||||
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
```
|
||||
|
|
|
@ -32,7 +32,7 @@ Note that you need access to nvidia GPUs to run this distribution. This distribu
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
||||
- `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`)
|
||||
|
||||
|
@ -75,9 +75,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image.
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
llamastack/distribution-meta-reference-quantized-gpu \
|
||||
|
@ -90,6 +91,7 @@ If you are using Llama Stack Safety / Shield APIs, use:
|
|||
```bash
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
llamastack/distribution-meta-reference-quantized-gpu \
|
||||
|
|
|
@ -15,7 +15,7 @@ The `llamastack/distribution-nvidia` distribution consists of the following prov
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
@ -39,9 +39,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image.
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ./run.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-nvidia \
|
||||
|
@ -55,6 +56,6 @@ docker run \
|
|||
```bash
|
||||
llama stack build --template nvidia --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
--port 5001 \
|
||||
--port 8321 \
|
||||
--env NVIDIA_API_KEY=$NVIDIA_API_KEY
|
||||
```
|
||||
|
|
|
@ -30,7 +30,7 @@ You should use this distribution if you have a regular desktop machine without v
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `OLLAMA_URL`: URL of the Ollama server (default: `http://127.0.0.1:11434`)
|
||||
- `INFERENCE_MODEL`: Inference model loaded into the Ollama server (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
||||
- `SAFETY_MODEL`: Safety model loaded into the Ollama server (default: `meta-llama/Llama-Guard-3-1B`)
|
||||
|
@ -69,9 +69,10 @@ Now you are ready to run Llama Stack with Ollama as the inference provider. You
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
export LLAMA_STACK_PORT=5001
|
||||
export LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
llamastack/distribution-ollama \
|
||||
|
@ -89,6 +90,7 @@ cd /path/to/llama-stack
|
|||
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
-v ./llama_stack/templates/ollama/run-with-safety.yaml:/root/my-run.yaml \
|
||||
|
@ -105,7 +107,7 @@ docker run \
|
|||
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
|
||||
|
||||
```bash
|
||||
export LLAMA_STACK_PORT=5001
|
||||
export LLAMA_STACK_PORT=8321
|
||||
|
||||
llama stack build --template ollama --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
|
|
|
@ -28,7 +28,7 @@ The `llamastack/distribution-passthrough` distribution consists of the following
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `PASSTHROUGH_API_KEY`: Passthrough API Key (default: ``)
|
||||
- `PASSTHROUGH_URL`: Passthrough URL (default: ``)
|
||||
|
||||
|
|
|
@ -29,7 +29,7 @@ You can use this distribution if you have GPUs and want to run an independent vL
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `INFERENCE_MODEL`: Inference model loaded into the vLLM server (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
||||
- `VLLM_URL`: URL of the vLLM server with the main inference model (default: `http://host.docker.internal:5100/v1`)
|
||||
- `MAX_TOKENS`: Maximum number of tokens for generation (default: `4096`)
|
||||
|
@ -47,6 +47,7 @@ export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
|||
export CUDA_VISIBLE_DEVICES=0
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
--runtime nvidia \
|
||||
--gpus $CUDA_VISIBLE_DEVICES \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
|
@ -59,6 +60,8 @@ docker run \
|
|||
--port $INFERENCE_PORT
|
||||
```
|
||||
|
||||
Note that you'll also need to set `--enable-auto-tool-choice` and `--tool-call-parser` to [enable tool calling in vLLM](https://docs.vllm.ai/en/latest/features/tool_calling.html).
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
|
||||
|
||||
```bash
|
||||
|
@ -67,6 +70,7 @@ export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
|||
export CUDA_VISIBLE_DEVICES=1
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
--runtime nvidia \
|
||||
--gpus $CUDA_VISIBLE_DEVICES \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
|
@ -90,10 +94,11 @@ This method allows you to get started quickly without having to build the distri
|
|||
```bash
|
||||
export INFERENCE_PORT=8000
|
||||
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
export LLAMA_STACK_PORT=5001
|
||||
export LLAMA_STACK_PORT=8321
|
||||
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ./run.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-remote-vllm \
|
||||
|
@ -115,6 +120,7 @@ cd /path/to/llama-stack
|
|||
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
-v ./llama_stack/templates/remote-vllm/run-with-safety.yaml:/root/my-run.yaml \
|
||||
|
@ -135,7 +141,7 @@ Make sure you have done `uv pip install llama-stack` and have the Llama Stack CL
|
|||
```bash
|
||||
export INFERENCE_PORT=8000
|
||||
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
export LLAMA_STACK_PORT=5001
|
||||
export LLAMA_STACK_PORT=8321
|
||||
|
||||
cd distributions/remote-vllm
|
||||
llama stack build --template remote-vllm --image-type conda
|
||||
|
|
|
@ -27,7 +27,7 @@ The `llamastack/distribution-sambanova` distribution consists of the following p
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `SAMBANOVA_API_KEY`: SambaNova.AI API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
@ -59,9 +59,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image.
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
llamastack/distribution-sambanova \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
|
|
|
@ -31,7 +31,7 @@ You can use this distribution if you have GPUs and want to run an independent TG
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `INFERENCE_MODEL`: Inference model loaded into the TGI server (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
||||
- `TGI_URL`: URL of the TGI server with the main inference model (default: `http://127.0.0.1:8080/v1`)
|
||||
- `TGI_SAFETY_URL`: URL of the TGI server with the safety model (default: `http://127.0.0.1:8081/v1`)
|
||||
|
@ -48,6 +48,7 @@ export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
|||
export CUDA_VISIBLE_DEVICES=0
|
||||
|
||||
docker run --rm -it \
|
||||
--pull always \
|
||||
-v $HOME/.cache/huggingface:/data \
|
||||
-p $INFERENCE_PORT:$INFERENCE_PORT \
|
||||
--gpus $CUDA_VISIBLE_DEVICES \
|
||||
|
@ -68,6 +69,7 @@ export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
|||
export CUDA_VISIBLE_DEVICES=1
|
||||
|
||||
docker run --rm -it \
|
||||
--pull always \
|
||||
-v $HOME/.cache/huggingface:/data \
|
||||
-p $SAFETY_PORT:$SAFETY_PORT \
|
||||
--gpus $CUDA_VISIBLE_DEVICES \
|
||||
|
@ -88,9 +90,10 @@ Now you are ready to run Llama Stack with TGI as the inference provider. You can
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
llamastack/distribution-tgi \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
|
@ -107,6 +110,7 @@ cd /path/to/llama-stack
|
|||
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
-v ./llama_stack/templates/tgi/run-with-safety.yaml:/root/my-run.yaml \
|
||||
|
|
|
@ -28,7 +28,7 @@ The `llamastack/distribution-together` distribution consists of the following pr
|
|||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `TOGETHER_API_KEY`: Together.AI API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
@ -62,9 +62,10 @@ You can do this via Conda (build code) or Docker which has a pre-built image.
|
|||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
llamastack/distribution-together \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
|
|
|
@ -54,6 +54,7 @@ mkdir -p ~/.llama
|
|||
Then you can start the server using the container tool of your choice. For example, if you are running Docker you can use the following command:
|
||||
```bash
|
||||
docker run -it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
llamastack/distribution-ollama \
|
||||
|
@ -74,6 +75,7 @@ Docker containers run in their own isolated network namespaces on Linux. To allo
|
|||
Linux users having issues running the above command should instead try the following:
|
||||
```bash
|
||||
docker run -it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
--network=host \
|
||||
|
@ -197,9 +199,7 @@ import os
|
|||
import uuid
|
||||
from termcolor import cprint
|
||||
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
from llama_stack_client.lib.agents.event_logger import EventLogger
|
||||
from llama_stack_client.types import Document
|
||||
from llama_stack_client import Agent, AgentEventLogger, RAGDocument
|
||||
|
||||
|
||||
def create_http_client():
|
||||
|
@ -225,7 +225,7 @@ client = (
|
|||
# Documents to be used for RAG
|
||||
urls = ["chat.rst", "llama3.rst", "memory_optimizations.rst", "lora_finetune.rst"]
|
||||
documents = [
|
||||
Document(
|
||||
RAGDocument(
|
||||
document_id=f"num-{i}",
|
||||
content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
|
||||
mime_type="text/plain",
|
||||
|
@ -284,7 +284,7 @@ for prompt in user_prompts:
|
|||
messages=[{"role": "user", "content": prompt}],
|
||||
session_id=session_id,
|
||||
)
|
||||
for log in EventLogger().log(response):
|
||||
for log in AgentEventLogger().log(response):
|
||||
log.print()
|
||||
```
|
||||
|
||||
|
|
|
@ -15,8 +15,6 @@ Llama Stack defines and standardizes the core building blocks needed to bring ge
|
|||
- **Multiple developer interfaces** like CLI and SDKs for Python, Node, iOS, and Android
|
||||
- **Standalone applications** as examples for how to build production-grade AI applications with Llama Stack
|
||||
|
||||
We focus on making it easy to build production applications with the Llama model family - from the latest Llama 3.3 to specialized models like Llama Guard for safety.
|
||||
|
||||
```{image} ../_static/llama-stack.png
|
||||
:alt: Llama Stack
|
||||
:width: 400px
|
||||
|
|
|
@ -48,7 +48,7 @@ Llama Stack addresses these challenges through a service-oriented, API-first app
|
|||
|
||||
**Robust Ecosystem**
|
||||
- Llama Stack is already integrated with distribution partners (cloud providers, hardware vendors, and AI-focused companies).
|
||||
- Ecosystem offers tailored infrastructure, software, and services for deploying Llama models.
|
||||
- Ecosystem offers tailored infrastructure, software, and services for deploying a variety of models.
|
||||
|
||||
|
||||
### Our Philosophy
|
||||
|
@ -57,7 +57,6 @@ Llama Stack addresses these challenges through a service-oriented, API-first app
|
|||
- **Composability**: Every component is independent but works together seamlessly
|
||||
- **Production Ready**: Built for real-world applications, not just demos
|
||||
- **Turnkey Solutions**: Easy to deploy built in solutions for popular deployment scenarios
|
||||
- **Llama First**: Explicit focus on Meta's Llama models and partnering ecosystem
|
||||
|
||||
|
||||
With Llama Stack, you can focus on building your application while we handle the infrastructure complexity, essential capabilities, and provider integrations.
|
||||
|
|
|
@ -118,6 +118,7 @@ Playground can also be started in a docker image:
|
|||
export LLAMA_STACK_URL=http://localhost:11434
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
-p 8501:8501 \
|
||||
-e LLAMA_STACK_ENDPOINT=$LLAMA_STACK_URL \
|
||||
quay.io/jland/llama-stack-playground
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue