mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-06-28 02:53:30 +00:00
More Updates to Read the Docs (#856)
This commit is contained in:
parent
8a686270e9
commit
74e933cbfd
8 changed files with 405 additions and 730 deletions
133
docs/source/building_applications/agent_execution_loop.md
Normal file
133
docs/source/building_applications/agent_execution_loop.md
Normal file
|
@ -0,0 +1,133 @@
|
|||
# Agent Execution Loop
|
||||
|
||||
Agents are the heart of complex AI applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.
|
||||
|
||||
Each agent turn follows these key steps:
|
||||
|
||||
1. **Initial Safety Check**: The user's input is first screened through configured safety shields
|
||||
|
||||
2. **Context Retrieval**:
|
||||
- If RAG is enabled, the agent queries relevant documents from memory banks
|
||||
- For new documents, they are first inserted into the memory bank
|
||||
- Retrieved context is augmented to the user's prompt
|
||||
|
||||
3. **Inference Loop**: The agent enters its main execution loop:
|
||||
- The LLM receives the augmented prompt (with context and/or previous tool outputs)
|
||||
- The LLM generates a response, potentially with tool calls
|
||||
- If tool calls are present:
|
||||
- Tool inputs are safety-checked
|
||||
- Tools are executed (e.g., web search, code execution)
|
||||
- Tool responses are fed back to the LLM for synthesis
|
||||
- The loop continues until:
|
||||
- The LLM provides a final response without tool calls
|
||||
- Maximum iterations are reached
|
||||
- Token limit is exceeded
|
||||
|
||||
4. **Final Safety Check**: The agent's final response is screened through safety shields
|
||||
|
||||
```{mermaid}
|
||||
sequenceDiagram
|
||||
participant U as User
|
||||
participant E as Executor
|
||||
participant M as Memory Bank
|
||||
participant L as LLM
|
||||
participant T as Tools
|
||||
participant S as Safety Shield
|
||||
|
||||
Note over U,S: Agent Turn Start
|
||||
U->>S: 1. Submit Prompt
|
||||
activate S
|
||||
S->>E: Input Safety Check
|
||||
deactivate S
|
||||
|
||||
E->>M: 2.1 Query Context
|
||||
M-->>E: 2.2 Retrieved Documents
|
||||
|
||||
loop Inference Loop
|
||||
E->>L: 3.1 Augment with Context
|
||||
L-->>E: 3.2 Response (with/without tool calls)
|
||||
|
||||
alt Has Tool Calls
|
||||
E->>S: Check Tool Input
|
||||
S->>T: 4.1 Execute Tool
|
||||
T-->>E: 4.2 Tool Response
|
||||
E->>L: 5.1 Tool Response
|
||||
L-->>E: 5.2 Synthesized Response
|
||||
end
|
||||
|
||||
opt Stop Conditions
|
||||
Note over E: Break if:
|
||||
Note over E: - No tool calls
|
||||
Note over E: - Max iterations reached
|
||||
Note over E: - Token limit exceeded
|
||||
end
|
||||
end
|
||||
|
||||
E->>S: Output Safety Check
|
||||
S->>U: 6. Final Response
|
||||
```
|
||||
|
||||
Each step in this process can be monitored and controlled through configurations. Here's an example that demonstrates monitoring the agent's execution:
|
||||
|
||||
```python
|
||||
from llama_stack_client.lib.agents.event_logger import EventLogger
|
||||
|
||||
agent_config = AgentConfig(
|
||||
model="Llama3.2-3B-Instruct",
|
||||
instructions="You are a helpful assistant",
|
||||
# Enable both RAG and tool usage
|
||||
tools=[
|
||||
{
|
||||
"type": "memory",
|
||||
"memory_bank_configs": [{
|
||||
"type": "vector",
|
||||
"bank_id": "my_docs"
|
||||
}],
|
||||
"max_tokens_in_context": 4096
|
||||
},
|
||||
{
|
||||
"type": "code_interpreter",
|
||||
"enable_inline_code_execution": True
|
||||
}
|
||||
],
|
||||
# Configure safety
|
||||
input_shields=["content_safety"],
|
||||
output_shields=["content_safety"],
|
||||
# Control the inference loop
|
||||
max_infer_iters=5,
|
||||
sampling_params={
|
||||
"strategy": {
|
||||
"type": "top_p",
|
||||
"temperature": 0.7,
|
||||
"top_p": 0.95
|
||||
},
|
||||
"max_tokens": 2048
|
||||
}
|
||||
)
|
||||
|
||||
agent = Agent(client, agent_config)
|
||||
session_id = agent.create_session("monitored_session")
|
||||
|
||||
# Stream the agent's execution steps
|
||||
response = agent.create_turn(
|
||||
messages=[{"role": "user", "content": "Analyze this code and run it"}],
|
||||
attachments=[{
|
||||
"content": "https://raw.githubusercontent.com/example/code.py",
|
||||
"mime_type": "text/plain"
|
||||
}],
|
||||
session_id=session_id
|
||||
)
|
||||
|
||||
# Monitor each step of execution
|
||||
for log in EventLogger().log(response):
|
||||
if log.event.step_type == "memory_retrieval":
|
||||
print("Retrieved context:", log.event.retrieved_context)
|
||||
elif log.event.step_type == "inference":
|
||||
print("LLM output:", log.event.model_response)
|
||||
elif log.event.step_type == "tool_execution":
|
||||
print("Tool call:", log.event.tool_call)
|
||||
print("Tool response:", log.event.tool_response)
|
||||
elif log.event.step_type == "shield_call":
|
||||
if log.event.violation:
|
||||
print("Safety violation:", log.event.violation)
|
||||
```
|
36
docs/source/building_applications/evaluation.md
Normal file
36
docs/source/building_applications/evaluation.md
Normal file
|
@ -0,0 +1,36 @@
|
|||
## Testing & Evaluation
|
||||
|
||||
Llama Stack provides built-in tools for evaluating your applications:
|
||||
|
||||
1. **Benchmarking**: Test against standard datasets
|
||||
2. **Application Evaluation**: Score your application's outputs
|
||||
3. **Custom Metrics**: Define your own evaluation criteria
|
||||
|
||||
Here's how to set up basic evaluation:
|
||||
|
||||
```python
|
||||
# Create an evaluation task
|
||||
response = client.eval_tasks.register(
|
||||
eval_task_id="my_eval",
|
||||
dataset_id="my_dataset",
|
||||
scoring_functions=["accuracy", "relevance"]
|
||||
)
|
||||
|
||||
# Run evaluation
|
||||
job = client.eval.run_eval(
|
||||
task_id="my_eval",
|
||||
task_config={
|
||||
"type": "app",
|
||||
"eval_candidate": {
|
||||
"type": "agent",
|
||||
"config": agent_config
|
||||
}
|
||||
}
|
||||
)
|
||||
|
||||
# Get results
|
||||
result = client.eval.job_result(
|
||||
task_id="my_eval",
|
||||
job_id=job.job_id
|
||||
)
|
||||
```
|
|
@ -1,446 +1,26 @@
|
|||
# Building AI Applications
|
||||
|
||||
[](https://colab.research.google.com/drive/1F2ksmkoGQPa4pzRjMOE6BXWeOxWFIW6n?usp=sharing)
|
||||
Llama Stack provides all the building blocks needed to create sophisticated AI applications.
|
||||
|
||||
Llama Stack provides all the building blocks needed to create sophisticated AI applications. This guide will walk you through how to use these components effectively. Check out our Colab notebook on to follow along working examples on how you can build LLM-powered agentic applications using Llama Stack.
|
||||
The best way to get started is to look at this notebook which walks through the various APIs (from basic inference, to RAG agents) and how to use them.
|
||||
|
||||
## Basic Inference
|
||||
**Notebook**: [Building AI Applications](docs/notebooks/Llama_Stack_Building_AI_Applications.ipynb)
|
||||
|
||||
The foundation of any AI application is the ability to interact with LLM models. Llama Stack provides a simple interface for both completion and chat-based inference:
|
||||
## Agentic Concepts
|
||||
- **[Agent Execution Loop](agent_execution_loop)**
|
||||
- **[RAG](rag)**
|
||||
- **[Safety](safety)**
|
||||
- **[Tools](tools)**
|
||||
- **[Telemetry](telemetry)**
|
||||
|
||||
```python
|
||||
from llama_stack_client import LlamaStackClient
|
||||
|
||||
client = LlamaStackClient(base_url="http://localhost:5001")
|
||||
|
||||
# List available models
|
||||
models = client.models.list()
|
||||
|
||||
# Simple chat completion
|
||||
response = client.inference.chat_completion(
|
||||
model_id="Llama3.2-3B-Instruct",
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "Write a haiku about coding"}
|
||||
]
|
||||
)
|
||||
print(response.completion_message.content)
|
||||
```
|
||||
|
||||
## Adding Memory & RAG
|
||||
|
||||
Memory enables your applications to reference and recall information from previous interactions or external documents. Llama Stack's memory system is built around the concept of Memory Banks:
|
||||
|
||||
1. **Vector Memory Banks**: For semantic search and retrieval
|
||||
2. **Key-Value Memory Banks**: For structured data storage
|
||||
3. **Keyword Memory Banks**: For basic text search
|
||||
4. **Graph Memory Banks**: For relationship-based retrieval
|
||||
|
||||
Here's how to set up a vector memory bank for RAG:
|
||||
|
||||
```python
|
||||
# Register a memory bank
|
||||
bank_id = "my_documents"
|
||||
response = client.memory_banks.register(
|
||||
memory_bank_id=bank_id,
|
||||
params={
|
||||
"memory_bank_type": "vector",
|
||||
"embedding_model": "all-MiniLM-L6-v2",
|
||||
"chunk_size_in_tokens": 512
|
||||
}
|
||||
)
|
||||
|
||||
# Insert documents
|
||||
documents = [
|
||||
{
|
||||
"document_id": "doc1",
|
||||
"content": "Your document text here",
|
||||
"mime_type": "text/plain"
|
||||
}
|
||||
]
|
||||
client.memory.insert(bank_id, documents)
|
||||
|
||||
# Query documents
|
||||
results = client.memory.query(
|
||||
bank_id=bank_id,
|
||||
query="What do you know about...",
|
||||
)
|
||||
```
|
||||
|
||||
## Implementing Safety Guardrails
|
||||
|
||||
Safety is a critical component of any AI application. Llama Stack provides a Shield system that can be applied at multiple touchpoints:
|
||||
|
||||
```python
|
||||
# Register a safety shield
|
||||
shield_id = "content_safety"
|
||||
client.shields.register(
|
||||
shield_id=shield_id,
|
||||
provider_shield_id="llama-guard-basic"
|
||||
)
|
||||
|
||||
# Run content through shield
|
||||
response = client.safety.run_shield(
|
||||
shield_id=shield_id,
|
||||
messages=[{"role": "user", "content": "User message here"}]
|
||||
)
|
||||
|
||||
if response.violation:
|
||||
print(f"Safety violation detected: {response.violation.user_message}")
|
||||
```
|
||||
|
||||
## Building Agents
|
||||
|
||||
Agents are the heart of complex AI applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.
|
||||
|
||||
### The Agent Execution Loop
|
||||
|
||||
Each agent turn follows these key steps:
|
||||
|
||||
1. **Initial Safety Check**: The user's input is first screened through configured safety shields
|
||||
|
||||
2. **Context Retrieval**:
|
||||
- If RAG is enabled, the agent queries relevant documents from memory banks
|
||||
- For new documents, they are first inserted into the memory bank
|
||||
- Retrieved context is augmented to the user's prompt
|
||||
|
||||
3. **Inference Loop**: The agent enters its main execution loop:
|
||||
- The LLM receives the augmented prompt (with context and/or previous tool outputs)
|
||||
- The LLM generates a response, potentially with tool calls
|
||||
- If tool calls are present:
|
||||
- Tool inputs are safety-checked
|
||||
- Tools are executed (e.g., web search, code execution)
|
||||
- Tool responses are fed back to the LLM for synthesis
|
||||
- The loop continues until:
|
||||
- The LLM provides a final response without tool calls
|
||||
- Maximum iterations are reached
|
||||
- Token limit is exceeded
|
||||
|
||||
4. **Final Safety Check**: The agent's final response is screened through safety shields
|
||||
|
||||
```{mermaid}
|
||||
sequenceDiagram
|
||||
participant U as User
|
||||
participant E as Executor
|
||||
participant M as Memory Bank
|
||||
participant L as LLM
|
||||
participant T as Tools
|
||||
participant S as Safety Shield
|
||||
|
||||
Note over U,S: Agent Turn Start
|
||||
U->>S: 1. Submit Prompt
|
||||
activate S
|
||||
S->>E: Input Safety Check
|
||||
deactivate S
|
||||
|
||||
E->>M: 2.1 Query Context
|
||||
M-->>E: 2.2 Retrieved Documents
|
||||
|
||||
loop Inference Loop
|
||||
E->>L: 3.1 Augment with Context
|
||||
L-->>E: 3.2 Response (with/without tool calls)
|
||||
|
||||
alt Has Tool Calls
|
||||
E->>S: Check Tool Input
|
||||
S->>T: 4.1 Execute Tool
|
||||
T-->>E: 4.2 Tool Response
|
||||
E->>L: 5.1 Tool Response
|
||||
L-->>E: 5.2 Synthesized Response
|
||||
end
|
||||
|
||||
opt Stop Conditions
|
||||
Note over E: Break if:
|
||||
Note over E: - No tool calls
|
||||
Note over E: - Max iterations reached
|
||||
Note over E: - Token limit exceeded
|
||||
end
|
||||
end
|
||||
|
||||
E->>S: Output Safety Check
|
||||
S->>U: 6. Final Response
|
||||
```
|
||||
|
||||
Each step in this process can be monitored and controlled through configurations. Here's an example that demonstrates monitoring the agent's execution:
|
||||
|
||||
```python
|
||||
from llama_stack_client.lib.agents.event_logger import EventLogger
|
||||
|
||||
agent_config = AgentConfig(
|
||||
model="Llama3.2-3B-Instruct",
|
||||
instructions="You are a helpful assistant",
|
||||
# Enable both RAG and tool usage
|
||||
tools=[
|
||||
{
|
||||
"type": "memory",
|
||||
"memory_bank_configs": [{
|
||||
"type": "vector",
|
||||
"bank_id": "my_docs"
|
||||
}],
|
||||
"max_tokens_in_context": 4096
|
||||
},
|
||||
{
|
||||
"type": "code_interpreter",
|
||||
"enable_inline_code_execution": True
|
||||
}
|
||||
],
|
||||
# Configure safety
|
||||
input_shields=["content_safety"],
|
||||
output_shields=["content_safety"],
|
||||
# Control the inference loop
|
||||
max_infer_iters=5,
|
||||
sampling_params={
|
||||
"strategy": {
|
||||
"type": "top_p",
|
||||
"temperature": 0.7,
|
||||
"top_p": 0.95
|
||||
},
|
||||
"max_tokens": 2048
|
||||
}
|
||||
)
|
||||
|
||||
agent = Agent(client, agent_config)
|
||||
session_id = agent.create_session("monitored_session")
|
||||
|
||||
# Stream the agent's execution steps
|
||||
response = agent.create_turn(
|
||||
messages=[{"role": "user", "content": "Analyze this code and run it"}],
|
||||
attachments=[{
|
||||
"content": "https://raw.githubusercontent.com/example/code.py",
|
||||
"mime_type": "text/plain"
|
||||
}],
|
||||
session_id=session_id
|
||||
)
|
||||
|
||||
# Monitor each step of execution
|
||||
for log in EventLogger().log(response):
|
||||
if log.event.step_type == "memory_retrieval":
|
||||
print("Retrieved context:", log.event.retrieved_context)
|
||||
elif log.event.step_type == "inference":
|
||||
print("LLM output:", log.event.model_response)
|
||||
elif log.event.step_type == "tool_execution":
|
||||
print("Tool call:", log.event.tool_call)
|
||||
print("Tool response:", log.event.tool_response)
|
||||
elif log.event.step_type == "shield_call":
|
||||
if log.event.violation:
|
||||
print("Safety violation:", log.event.violation)
|
||||
```
|
||||
|
||||
This example shows how an agent can: Llama Stack provides a high-level agent framework:
|
||||
|
||||
```python
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
from llama_stack_client.types.agent_create_params import AgentConfig
|
||||
|
||||
# Configure an agent
|
||||
agent_config = AgentConfig(
|
||||
model="Llama3.2-3B-Instruct",
|
||||
instructions="You are a helpful assistant",
|
||||
tools=[
|
||||
{
|
||||
"type": "memory",
|
||||
"memory_bank_configs": [],
|
||||
"query_generator_config": {
|
||||
"type": "default",
|
||||
"sep": " "
|
||||
}
|
||||
}
|
||||
],
|
||||
input_shields=["content_safety"],
|
||||
output_shields=["content_safety"],
|
||||
enable_session_persistence=True
|
||||
)
|
||||
|
||||
# Create an agent
|
||||
agent = Agent(client, agent_config)
|
||||
session_id = agent.create_session("my_session")
|
||||
|
||||
# Run agent turns
|
||||
response = agent.create_turn(
|
||||
messages=[{"role": "user", "content": "Your question here"}],
|
||||
session_id=session_id
|
||||
)
|
||||
```
|
||||
|
||||
### Adding Tools to Agents
|
||||
```{toctree}
|
||||
:hidden:
|
||||
:maxdepth: 3
|
||||
:maxdepth: 1
|
||||
|
||||
agent_execution_loop
|
||||
rag
|
||||
safety
|
||||
tools
|
||||
```
|
||||
|
||||
Agents can be enhanced with various tools. For detailed information about available tools, their configuration, and providers, see the [Tools](tools.md) documentation.
|
||||
|
||||
Tools are configured through the `toolgroups` parameter in the agent configuration. Each tool group can be specified either as a string or with additional arguments:
|
||||
|
||||
```python
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
from llama_stack_client.types.agent_create_params import AgentConfig
|
||||
|
||||
agent_config = AgentConfig(
|
||||
model="Llama3.2-3B-Instruct",
|
||||
instructions="You are a helpful assistant",
|
||||
# Configure tool groups
|
||||
toolgroups=[
|
||||
# Simple string format
|
||||
"builtin::code_interpreter",
|
||||
# With arguments format
|
||||
{
|
||||
"name": "builtin::websearch",
|
||||
"args": {
|
||||
"max_results": 5
|
||||
}
|
||||
}
|
||||
],
|
||||
tool_choice="auto",
|
||||
tool_prompt_format="json",
|
||||
# Optional safety configuration
|
||||
input_shields=["content_safety"],
|
||||
output_shields=["content_safety"],
|
||||
# Control the inference loop
|
||||
max_infer_iters=10,
|
||||
sampling_params={
|
||||
"strategy": {
|
||||
"type": "top_p",
|
||||
"temperature": 0.7,
|
||||
"top_p": 0.95
|
||||
},
|
||||
"max_tokens": 2048
|
||||
}
|
||||
)
|
||||
|
||||
agent = Agent(client, agent_config)
|
||||
```
|
||||
|
||||
For details on available tool groups, providers, and their configuration options, refer to the [Tools](tools.md) documentation.
|
||||
|
||||
## Building RAG-Enhanced Agents
|
||||
|
||||
One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
|
||||
|
||||
```python
|
||||
from llama_stack_client.types import Attachment
|
||||
|
||||
# Create attachments from documents
|
||||
attachments = [
|
||||
Attachment(
|
||||
content="https://raw.githubusercontent.com/example/doc.rst",
|
||||
mime_type="text/plain"
|
||||
)
|
||||
]
|
||||
|
||||
# Configure agent with memory
|
||||
agent_config = AgentConfig(
|
||||
model="Llama3.2-3B-Instruct",
|
||||
instructions="You are a helpful assistant",
|
||||
tools=[{
|
||||
"type": "memory",
|
||||
"memory_bank_configs": [],
|
||||
"query_generator_config": {"type": "default", "sep": " "},
|
||||
"max_tokens_in_context": 4096,
|
||||
"max_chunks": 10
|
||||
}],
|
||||
enable_session_persistence=True
|
||||
)
|
||||
|
||||
agent = Agent(client, agent_config)
|
||||
session_id = agent.create_session("rag_session")
|
||||
|
||||
# Initial document ingestion
|
||||
response = agent.create_turn(
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": "I am providing some documents for reference."
|
||||
}],
|
||||
attachments=attachments,
|
||||
session_id=session_id
|
||||
)
|
||||
|
||||
# Query with RAG
|
||||
response = agent.create_turn(
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": "What are the key topics in the documents?"
|
||||
}],
|
||||
session_id=session_id
|
||||
)
|
||||
```
|
||||
|
||||
## Testing & Evaluation
|
||||
|
||||
Llama Stack provides built-in tools for evaluating your applications:
|
||||
|
||||
1. **Benchmarking**: Test against standard datasets
|
||||
2. **Application Evaluation**: Score your application's outputs
|
||||
3. **Custom Metrics**: Define your own evaluation criteria
|
||||
|
||||
Here's how to set up basic evaluation:
|
||||
|
||||
```python
|
||||
# Create an evaluation task
|
||||
response = client.eval_tasks.register(
|
||||
eval_task_id="my_eval",
|
||||
dataset_id="my_dataset",
|
||||
scoring_functions=["accuracy", "relevance"]
|
||||
)
|
||||
|
||||
# Run evaluation
|
||||
job = client.eval.run_eval(
|
||||
task_id="my_eval",
|
||||
task_config={
|
||||
"type": "app",
|
||||
"eval_candidate": {
|
||||
"type": "agent",
|
||||
"config": agent_config
|
||||
}
|
||||
}
|
||||
)
|
||||
|
||||
# Get results
|
||||
result = client.eval.job_result(
|
||||
task_id="my_eval",
|
||||
job_id=job.job_id
|
||||
)
|
||||
```
|
||||
|
||||
## Debugging & Monitoring
|
||||
|
||||
Llama Stack includes comprehensive telemetry for debugging and monitoring your applications:
|
||||
|
||||
1. **Tracing**: Track request flows across components
|
||||
2. **Metrics**: Measure performance and usage
|
||||
3. **Logging**: Debug issues and track behavior
|
||||
|
||||
The telemetry system supports multiple output formats:
|
||||
|
||||
- OpenTelemetry for visualization in tools like Jaeger
|
||||
- SQLite for local storage and querying
|
||||
- Console output for development
|
||||
|
||||
Example of querying traces:
|
||||
|
||||
```python
|
||||
# Query traces for a session
|
||||
traces = client.telemetry.query_traces(
|
||||
attribute_filters=[{
|
||||
"key": "session_id",
|
||||
"op": "eq",
|
||||
"value": session_id
|
||||
}]
|
||||
)
|
||||
|
||||
# Get spans within the root span; indexed by ID
|
||||
# Use parent_span_id to build a tree out of it
|
||||
spans_by_id = client.telemetry.get_span_tree(
|
||||
span_id=traces[0].root_span_id
|
||||
)
|
||||
```
|
||||
|
||||
For details on how to use the telemetry system to debug your applications, export traces to a dataset, and run evaluations, see the [Telemetry](telemetry) section.
|
||||
|
||||
```{toctree}
|
||||
:hidden:
|
||||
:maxdepth: 3
|
||||
|
||||
telemetry
|
||||
```
|
||||
|
|
92
docs/source/building_applications/rag.md
Normal file
92
docs/source/building_applications/rag.md
Normal file
|
@ -0,0 +1,92 @@
|
|||
## Memory & RAG
|
||||
|
||||
Memory enables your applications to reference and recall information from previous interactions or external documents. Llama Stack's memory system is built around the concept of Memory Banks:
|
||||
|
||||
1. **Vector Memory Banks**: For semantic search and retrieval
|
||||
2. **Key-Value Memory Banks**: For structured data storage
|
||||
3. **Keyword Memory Banks**: For basic text search
|
||||
4. **Graph Memory Banks**: For relationship-based retrieval
|
||||
|
||||
Here's how to set up a vector memory bank for RAG:
|
||||
|
||||
```python
|
||||
# Register a memory bank
|
||||
bank_id = "my_documents"
|
||||
response = client.memory_banks.register(
|
||||
memory_bank_id=bank_id,
|
||||
params={
|
||||
"memory_bank_type": "vector",
|
||||
"embedding_model": "all-MiniLM-L6-v2",
|
||||
"chunk_size_in_tokens": 512
|
||||
}
|
||||
)
|
||||
|
||||
# Insert documents
|
||||
documents = [
|
||||
{
|
||||
"document_id": "doc1",
|
||||
"content": "Your document text here",
|
||||
"mime_type": "text/plain"
|
||||
}
|
||||
]
|
||||
client.memory.insert(bank_id, documents)
|
||||
|
||||
# Query documents
|
||||
results = client.memory.query(
|
||||
bank_id=bank_id,
|
||||
query="What do you know about...",
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
### Building RAG-Enhanced Agents
|
||||
|
||||
One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
|
||||
|
||||
```python
|
||||
from llama_stack_client.types import Attachment
|
||||
|
||||
# Create attachments from documents
|
||||
attachments = [
|
||||
Attachment(
|
||||
content="https://raw.githubusercontent.com/example/doc.rst",
|
||||
mime_type="text/plain"
|
||||
)
|
||||
]
|
||||
|
||||
# Configure agent with memory
|
||||
agent_config = AgentConfig(
|
||||
model="Llama3.2-3B-Instruct",
|
||||
instructions="You are a helpful assistant",
|
||||
tools=[{
|
||||
"type": "memory",
|
||||
"memory_bank_configs": [],
|
||||
"query_generator_config": {"type": "default", "sep": " "},
|
||||
"max_tokens_in_context": 4096,
|
||||
"max_chunks": 10
|
||||
}],
|
||||
enable_session_persistence=True
|
||||
)
|
||||
|
||||
agent = Agent(client, agent_config)
|
||||
session_id = agent.create_session("rag_session")
|
||||
|
||||
# Initial document ingestion
|
||||
response = agent.create_turn(
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": "I am providing some documents for reference."
|
||||
}],
|
||||
attachments=attachments,
|
||||
session_id=session_id
|
||||
)
|
||||
|
||||
# Query with RAG
|
||||
response = agent.create_turn(
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": "What are the key topics in the documents?"
|
||||
}],
|
||||
session_id=session_id
|
||||
)
|
||||
```
|
21
docs/source/building_applications/safety.md
Normal file
21
docs/source/building_applications/safety.md
Normal file
|
@ -0,0 +1,21 @@
|
|||
## Safety Guardrails
|
||||
|
||||
Safety is a critical component of any AI application. Llama Stack provides a Shield system that can be applied at multiple touchpoints:
|
||||
|
||||
```python
|
||||
# Register a safety shield
|
||||
shield_id = "content_safety"
|
||||
client.shields.register(
|
||||
shield_id=shield_id,
|
||||
provider_shield_id="llama-guard-basic"
|
||||
)
|
||||
|
||||
# Run content through shield
|
||||
response = client.safety.run_shield(
|
||||
shield_id=shield_id,
|
||||
messages=[{"role": "user", "content": "User message here"}]
|
||||
)
|
||||
|
||||
if response.violation:
|
||||
print(f"Safety violation detected: {response.violation.user_message}")
|
||||
```
|
|
@ -13,24 +13,94 @@ In order to build your own distribution, we recommend you clone the `llama-stack
|
|||
git clone git@github.com:meta-llama/llama-stack.git
|
||||
cd llama-stack
|
||||
pip install -e .
|
||||
|
||||
llama stack build -h
|
||||
```
|
||||
Use the CLI to build your distribution.
|
||||
The main points to consider are:
|
||||
1. **Image Type** - Do you want a Conda / venv environment or a Container (eg. Docker)
|
||||
2. **Template** - Do you want to use a template to build your distribution? or start from scratch ?
|
||||
3. **Config** - Do you want to use a pre-existing config file to build your distribution?
|
||||
|
||||
We will start build our distribution (in the form of a Conda environment, or Container image). In this step, we will specify:
|
||||
- `name`: the name for our distribution (e.g. `my-stack`)
|
||||
- `image_type`: our build image type (`conda | container`)
|
||||
- `distribution_spec`: our distribution specs for specifying API providers
|
||||
- `description`: a short description of the configurations for the distribution
|
||||
- `providers`: specifies the underlying implementation for serving each API endpoint
|
||||
- `image_type`: `conda` | `container` to specify whether to build the distribution in the form of Container image or Conda environment.
|
||||
```
|
||||
llama stack build -h
|
||||
|
||||
usage: llama stack build [-h] [--config CONFIG] [--template TEMPLATE] [--list-templates | --no-list-templates] [--image-type {conda,container,venv}] [--image-name IMAGE_NAME]
|
||||
|
||||
Build a Llama stack container
|
||||
|
||||
options:
|
||||
-h, --help show this help message and exit
|
||||
--config CONFIG Path to a config file to use for the build. You can find example configs in llama_stack/distribution/**/build.yaml.
|
||||
If this argument is not provided, you will be prompted to enter information interactively
|
||||
--template TEMPLATE Name of the example template config to use for build. You may use `llama stack build --list-templates` to check out the available templates
|
||||
--list-templates, --no-list-templates
|
||||
Show the available templates for building a Llama Stack distribution (default: False)
|
||||
--image-type {conda,container,venv}
|
||||
Image Type to use for the build. This can be either conda or container or venv. If not specified, will use the image type from the template config.
|
||||
--image-name IMAGE_NAME
|
||||
[for image-type=conda] Name of the conda environment to use for the build. If
|
||||
not specified, currently active Conda environment will be used. If no Conda
|
||||
environment is active, you must specify a name.
|
||||
```
|
||||
|
||||
After this step is complete, a file named `<name>-build.yaml` and template file `<name>-run.yaml` will be generated and saved at the output file path specified at the end of the command.
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Building from a template
|
||||
To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers.
|
||||
|
||||
The following command will allow you to see the available templates and their corresponding providers.
|
||||
```
|
||||
llama stack build --list-templates
|
||||
```
|
||||
|
||||
```
|
||||
------------------------------+-----------------------------------------------------------------------------+
|
||||
| Template Name | Description |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-serverless | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| together | Use Together.AI for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| vllm-gpu | Use a built-in vLLM engine for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| experimental-post-training | Experimental template for post training |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| remote-vllm | Use (an external) vLLM server for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| fireworks | Use Fireworks.AI for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| tgi | Use (an external) TGI server for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| bedrock | Use AWS Bedrock for running LLM inference and safety |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| meta-reference-gpu | Use Meta Reference for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| nvidia | Use NVIDIA NIM for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| meta-reference-quantized-gpu | Use Meta Reference with fp8, int4 quantization for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| cerebras | Use Cerebras for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| ollama | Use (an external) Ollama server for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-endpoint | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
You may then pick a template to build your distribution with providers fitted to your liking.
|
||||
|
||||
For example, to build a distribution with TGI as the inference provider, you can run:
|
||||
```
|
||||
$ llama stack build --template tgi
|
||||
...
|
||||
You can now edit ~/.llama/distributions/llamastack-tgi/tgi-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-tgi/tgi-run.yaml`
|
||||
```
|
||||
:::
|
||||
:::{tab-item} Building from Scratch
|
||||
|
||||
- For a new user, we could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations.
|
||||
If the provided templates do not fit your use case, you could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations.
|
||||
|
||||
It would be best to start with a template and understand the structure of the config file and the various concepts ( APIS, providers, resources, etc.) before starting from scratch.
|
||||
```
|
||||
llama stack build
|
||||
|
||||
|
@ -57,272 +127,6 @@ You can now edit ~/.llama/distributions/llamastack-my-local-stack/my-local-stack
|
|||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} Building from a template
|
||||
- To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers.
|
||||
|
||||
The following command will allow you to see the available templates and their corresponding providers.
|
||||
```
|
||||
llama stack build --list-templates
|
||||
```
|
||||
|
||||
```
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| Template Name | Providers | Description |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| tgi | { | Use (an external) TGI server for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::tgi" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| remote-vllm | { | Use (an external) vLLM server for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::vllm" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| vllm-gpu | { | Use a built-in vLLM engine for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "inline::vllm" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| meta-reference-quantized-gpu | { | Use Meta Reference with fp8, int4 quantization for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "inline::meta-reference-quantized" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| meta-reference-gpu | { | Use Meta Reference for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-serverless | { | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::hf::serverless" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| together | { | Use Together.AI for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::together" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| ollama | { | Use (an external) Ollama server for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::ollama" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| bedrock | { | Use AWS Bedrock for running LLM inference and safety |
|
||||
| | "inference": [ | |
|
||||
| | "remote::bedrock" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "remote::bedrock" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-endpoint | { | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::hf::endpoint" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| fireworks | { | Use Fireworks.AI for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::fireworks" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| cerebras | { | Use Cerebras for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::cerebras" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
You may then pick a template to build your distribution with providers fitted to your liking.
|
||||
|
||||
For example, to build a distribution with TGI as the inference provider, you can run:
|
||||
```
|
||||
llama stack build --template tgi
|
||||
```
|
||||
|
||||
```
|
||||
$ llama stack build --template tgi
|
||||
...
|
||||
You can now edit ~/.llama/distributions/llamastack-tgi/tgi-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-tgi/tgi-run.yaml`
|
||||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} Building from a pre-existing build config file
|
||||
- In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command.
|
||||
|
||||
|
@ -377,6 +181,10 @@ After this step is successful, you should be able to find the built container im
|
|||
Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack build` step.
|
||||
|
||||
```
|
||||
# Start using template name
|
||||
llama stack run tgi
|
||||
|
||||
# Start using config file
|
||||
llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
|
||||
```
|
||||
|
||||
|
@ -412,4 +220,4 @@ INFO: 2401:db00:35c:2d2b:face:0:c9:0:54678 - "GET /models/list HTTP/1.1" 200
|
|||
|
||||
### Troubleshooting
|
||||
|
||||
If you encounter any issues, search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue.
|
||||
If you encounter any issues, ask questions in our discord or search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue.
|
||||
|
|
|
@ -70,20 +70,27 @@ Next up is the most critical part: the set of providers that the stack will use
|
|||
```yaml
|
||||
providers:
|
||||
inference:
|
||||
# provider_id is a string you can choose freely
|
||||
- provider_id: ollama
|
||||
# provider_type is a string that specifies the type of provider.
|
||||
# in this case, the provider for inference is ollama and it is run remotely (outside of the distribution)
|
||||
provider_type: remote::ollama
|
||||
# config is a dictionary that contains the configuration for the provider.
|
||||
# in this case, the configuration is the url of the ollama server
|
||||
config:
|
||||
url: ${env.OLLAMA_URL:http://localhost:11434}
|
||||
```
|
||||
A few things to note:
|
||||
- A _provider instance_ is identified with an (identifier, type, configuration) tuple. The identifier is a string you can choose freely.
|
||||
- A _provider instance_ is identified with an (id, type, configuration) triplet.
|
||||
- The id is a string you can choose freely.
|
||||
- You can instantiate any number of provider instances of the same type.
|
||||
- The configuration dictionary is provider-specific. Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
|
||||
- The configuration dictionary is provider-specific.
|
||||
- Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
|
||||
|
||||
## Resources
|
||||
```
|
||||
|
||||
Finally, let's look at the `models` section:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
- metadata: {}
|
||||
|
|
|
@ -1,11 +1,20 @@
|
|||
# Using Llama Stack as a Library
|
||||
|
||||
If you are planning to use an external service for Inference (even Ollama or TGI counts as external), it is often easier to use Llama Stack as a library. This avoids the overhead of setting up a server. For [example](https://github.com/meta-llama/llama-stack-client-python/blob/main/src/llama_stack_client/lib/direct/test.py):
|
||||
If you are planning to use an external service for Inference (even Ollama or TGI counts as external), it is often easier to use Llama Stack as a library. This avoids the overhead of setting up a server.
|
||||
```python
|
||||
# setup
|
||||
pip install llama-stack
|
||||
llama stack build --template together --image-type venv
|
||||
```
|
||||
|
||||
```python
|
||||
from llama_stack_client.lib.direct.direct import LlamaStackDirectClient
|
||||
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
|
||||
|
||||
client = await LlamaStackDirectClient.from_template('ollama')
|
||||
client = LlamaStackAsLibraryClient(
|
||||
"ollama",
|
||||
# provider_data is optional, but if you need to pass in any provider specific data, you can do so here.
|
||||
provider_data = {"tavily_search_api_key": os.environ['TAVILY_SEARCH_API_KEY']}
|
||||
)
|
||||
await client.initialize()
|
||||
```
|
||||
|
||||
|
@ -14,23 +23,12 @@ This will parse your config and set up any inline implementations and remote cli
|
|||
Then, you can access the APIs like `models` and `inference` on the client and call their methods directly:
|
||||
|
||||
```python
|
||||
response = await client.models.list()
|
||||
print(response)
|
||||
```
|
||||
|
||||
```python
|
||||
response = await client.inference.chat_completion(
|
||||
messages=[UserMessage(content="What is the capital of France?", role="user")],
|
||||
model_id="Llama3.1-8B-Instruct",
|
||||
stream=False,
|
||||
)
|
||||
print("\nChat completion response:")
|
||||
print(response)
|
||||
response = client.models.list()
|
||||
```
|
||||
|
||||
If you've created a [custom distribution](https://llama-stack.readthedocs.io/en/latest/distributions/building_distro.html), you can also use the run.yaml configuration file directly:
|
||||
|
||||
```python
|
||||
client = await LlamaStackDirectClient.from_config(config_path)
|
||||
await client.initialize()
|
||||
client = LlamaStackAsLibraryClient(config_path)
|
||||
client.initialize()
|
||||
```
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue