mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-07-29 15:23:51 +00:00
fix: revert back to use llama3.2:3b and use "starter" than "ollama"
- require enable ollama when use starter - need explictliy set --model-id when inference - update response example Signed-off-by: Wen Zhou <wenzhou@redhat.com>
This commit is contained in:
parent
214b1fe1ae
commit
2b1bfab131
2 changed files with 88 additions and 66 deletions
|
@ -393,17 +393,17 @@ llama stack list
|
||||||
```
|
```
|
||||||
|
|
||||||
```
|
```
|
||||||
------------------------------+-----------------------------------------------------------------------------+--------------+------------+
|
------------------------------+-----------------------------------------------------------------+--------------+------------+
|
||||||
| Stack Name | Path | Build Config | Run Config |
|
| Stack Name | Path | Build Config | Run Config |
|
||||||
+------------------------------+-----------------------------------------------------------------------------+--------------+------------+
|
+------------------------------+-----------------------------------------------------------------------------+--------------+
|
||||||
| together | /home/wenzhou/.llama/distributions/together | Yes | No |
|
| together | ~/.llama/distributions/together | Yes | No |
|
||||||
+------------------------------+-----------------------------------------------------------------------------+--------------+------------+
|
+------------------------------+-----------------------------------------------------------------------------+--------------+
|
||||||
| bedrock | /home/wenzhou/.llama/distributions/bedrock | Yes | No |
|
| bedrock | ~/.llama/distributions/bedrock | Yes | No |
|
||||||
+------------------------------+-----------------------------------------------------------------------------+--------------+------------+
|
+------------------------------+-----------------------------------------------------------------------------+--------------+
|
||||||
| starter | /home/wenzhou/.llama/distributions/starter | No | No |
|
| starter | ~/.llama/distributions/starter | Yes | Yes |
|
||||||
+------------------------------+-----------------------------------------------------------------------------+--------------+------------+
|
+------------------------------+-----------------------------------------------------------------------------+--------------+
|
||||||
| remote-vllm | /home/wenzhou/.llama/distributions/remote-vllm | Yes | Yes |
|
| remote-vllm | ~/.llama/distributions/remote-vllm | Yes | Yes |
|
||||||
+------------------------------+-----------------------------------------------------------------------------+--------------+------------+
|
+------------------------------+-----------------------------------------------------------------------------+--------------+
|
||||||
```
|
```
|
||||||
|
|
||||||
### Removing a Distribution
|
### Removing a Distribution
|
||||||
|
|
|
@ -42,7 +42,7 @@ powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | ie
|
||||||
Setup your virtual environment.
|
Setup your virtual environment.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
uv sync --python 3.10
|
uv sync --python 3.12
|
||||||
source .venv/bin/activate
|
source .venv/bin/activate
|
||||||
```
|
```
|
||||||
## Step 2: Run Llama Stack
|
## Step 2: Run Llama Stack
|
||||||
|
@ -56,9 +56,10 @@ You can use Python to build and run the Llama Stack server, which is useful for
|
||||||
Llama Stack uses a [YAML configuration file](../distributions/configuration.md) to specify the stack setup,
|
Llama Stack uses a [YAML configuration file](../distributions/configuration.md) to specify the stack setup,
|
||||||
which defines the providers and their settings.
|
which defines the providers and their settings.
|
||||||
Now let's build and run the Llama Stack config for Ollama.
|
Now let's build and run the Llama Stack config for Ollama.
|
||||||
|
We use `starter` as template. By default all providers are disabled, this requires enable ollama by passing environment variables.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
INFERENCE_MODEL=llama3.2:3b llama stack build --template starter --image-type venv --run
|
ENABLE_OLLAMA=ollama OLLAMA_INFERENCE_MODEL="llama3.2:3b" llama stack build --template starter --image-type venv --run
|
||||||
```
|
```
|
||||||
:::
|
:::
|
||||||
:::{tab-item} Using `conda`
|
:::{tab-item} Using `conda`
|
||||||
|
@ -69,17 +70,18 @@ which defines the providers and their settings.
|
||||||
Now let's build and run the Llama Stack config for Ollama.
|
Now let's build and run the Llama Stack config for Ollama.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
INFERENCE_MODEL=llama3.2:3b llama stack build --template starter --image-type conda --image-name llama3-3b-conda --run
|
ENABLE_OLLAMA=ollama INFERENCE_MODEL="llama3.2:3b" llama stack build --template starter --image-type conda --run
|
||||||
```
|
```
|
||||||
:::
|
:::
|
||||||
:::{tab-item} Using a Container
|
:::{tab-item} Using a Container
|
||||||
You can use a container image to run the Llama Stack server. We provide several container images for the server
|
You can use a container image to run the Llama Stack server. We provide several container images for the server
|
||||||
component that works with different inference providers out of the box. For this guide, we will use
|
component that works with different inference providers out of the box. For this guide, we will use
|
||||||
`llamastack/distribution-ollama` as the container image. If you'd like to build your own image or customize the
|
`llamastack/distribution-starter` as the container image. If you'd like to build your own image or customize the
|
||||||
configurations, please check out [this guide](../references/index.md).
|
configurations, please check out [this guide](../references/index.md).
|
||||||
First lets setup some environment variables and create a local directory to mount into the container’s file system.
|
First lets setup some environment variables and create a local directory to mount into the container’s file system.
|
||||||
```bash
|
```bash
|
||||||
export INFERENCE_MODEL="llama3.2:3b"
|
export INFERENCE_MODEL="llama3.2:3b"
|
||||||
|
export ENABLE_OLLAMA=ollama
|
||||||
export LLAMA_STACK_PORT=8321
|
export LLAMA_STACK_PORT=8321
|
||||||
mkdir -p ~/.llama
|
mkdir -p ~/.llama
|
||||||
```
|
```
|
||||||
|
@ -90,7 +92,7 @@ docker run -it \
|
||||||
--pull always \
|
--pull always \
|
||||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||||
-v ~/.llama:/root/.llama \
|
-v ~/.llama:/root/.llama \
|
||||||
llamastack/distribution-ollama \
|
llamastack/distribution-starter \
|
||||||
--port $LLAMA_STACK_PORT \
|
--port $LLAMA_STACK_PORT \
|
||||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||||
--env OLLAMA_URL=http://host.docker.internal:11434
|
--env OLLAMA_URL=http://host.docker.internal:11434
|
||||||
|
@ -112,7 +114,7 @@ docker run -it \
|
||||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||||
-v ~/.llama:/root/.llama \
|
-v ~/.llama:/root/.llama \
|
||||||
--network=host \
|
--network=host \
|
||||||
llamastack/distribution-ollama \
|
llamastack/distribution-starter \
|
||||||
--port $LLAMA_STACK_PORT \
|
--port $LLAMA_STACK_PORT \
|
||||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||||
--env OLLAMA_URL=http://localhost:11434
|
--env OLLAMA_URL=http://localhost:11434
|
||||||
|
@ -146,7 +148,7 @@ source .venv/bin/activate
|
||||||
|
|
||||||
:::{tab-item} Install with `venv`
|
:::{tab-item} Install with `venv`
|
||||||
```bash
|
```bash
|
||||||
uv venv client --python 3.10
|
uv venv client --python 3.12
|
||||||
source client/bin/activate
|
source client/bin/activate
|
||||||
pip install llama-stack-client
|
pip install llama-stack-client
|
||||||
```
|
```
|
||||||
|
@ -154,7 +156,7 @@ pip install llama-stack-client
|
||||||
|
|
||||||
:::{tab-item} Install with `conda`
|
:::{tab-item} Install with `conda`
|
||||||
```bash
|
```bash
|
||||||
yes | conda create -n stack-client python=3.10
|
yes | conda create -n stack-client python=3.12
|
||||||
conda activate stack-client
|
conda activate stack-client
|
||||||
pip install llama-stack-client
|
pip install llama-stack-client
|
||||||
```
|
```
|
||||||
|
@ -177,37 +179,56 @@ List the models
|
||||||
llama-stack-client models list
|
llama-stack-client models list
|
||||||
Available Models
|
Available Models
|
||||||
|
|
||||||
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
|
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓
|
||||||
┃ model_type ┃ identifier ┃ provider_resource_id ┃ metadata ┃ provider_id ┃
|
┃ model_type ┃ identifier ┃ provider_resource_id ┃ metadata ┃ provider_id ┃
|
||||||
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
|
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩
|
||||||
│ embedding │ all-MiniLM-L6-v2 │ all-minilm:l6-v2 │ {'embedding_dimension': 384.0} │ ollama │
|
│ embedding │ ollama/all-minilm:l6-v2 │ all-minilm:l6-v2 │ {'embedding_dimension': 384.0} │ ollama │
|
||||||
├─────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────────┼─────────────────┤
|
├─────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────────┼───────────────────────┤
|
||||||
│ llm │ meta-llama/Llama-3.2-3B-Instruct │ llama3.2:3b-instruct-fp16 │ │ ollama │
|
│ ... │ ... │ ... │ │ ... │
|
||||||
└─────────────────┴─────────────────────────────────────┴─────────────────────────────────────┴───────────────────────────────────────────┴─────────────────┘
|
├─────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────────┼───────────────────────┤
|
||||||
|
│ llm │ ollama/Llama-3.2:3b │ llama3.2:3b │ │ ollama │
|
||||||
Total models: 2
|
└─────────────────┴─────────────────────────────────────┴─────────────────────────────────────┴───────────────────────────────────────────┴───────────────────────┘
|
||||||
|
|
||||||
```
|
```
|
||||||
You can test basic Llama inference completion using the CLI.
|
You can test basic Llama inference completion using the CLI.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
llama-stack-client inference chat-completion --message "tell me a joke"
|
llama-stack-client inference chat-completion --model-id "ollama/llama3.2:3b" --message "tell me a joke"
|
||||||
|
|
||||||
```
|
```
|
||||||
Sample output:
|
Sample output:
|
||||||
```python
|
```python
|
||||||
ChatCompletionResponse(
|
OpenAIChatCompletion(
|
||||||
completion_message=CompletionMessage(
|
id="chatcmpl-08d7b2be-40f3-47ed-8f16-a6f29f2436af",
|
||||||
content="Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta!",
|
choices=[
|
||||||
role="assistant",
|
OpenAIChatCompletionChoice(
|
||||||
stop_reason="end_of_turn",
|
finish_reason="stop",
|
||||||
tool_calls=[],
|
index=0,
|
||||||
),
|
message=OpenAIChatCompletionChoiceMessageOpenAIAssistantMessageParam(
|
||||||
logprobs=None,
|
role="assistant",
|
||||||
metrics=[
|
content="Why couldn't the bicycle stand up by itself?\n\nBecause it was two-tired.",
|
||||||
Metric(metric="prompt_tokens", value=14.0, unit=None),
|
name=None,
|
||||||
Metric(metric="completion_tokens", value=27.0, unit=None),
|
tool_calls=None,
|
||||||
Metric(metric="total_tokens", value=41.0, unit=None),
|
refusal=None,
|
||||||
|
annotations=None,
|
||||||
|
audio=None,
|
||||||
|
function_call=None,
|
||||||
|
),
|
||||||
|
logprobs=None,
|
||||||
|
)
|
||||||
],
|
],
|
||||||
|
created=1751725254,
|
||||||
|
model="llama3.2:3b",
|
||||||
|
object="chat.completion",
|
||||||
|
service_tier=None,
|
||||||
|
system_fingerprint="fp_ollama",
|
||||||
|
usage={
|
||||||
|
"completion_tokens": 18,
|
||||||
|
"prompt_tokens": 29,
|
||||||
|
"total_tokens": 47,
|
||||||
|
"completion_tokens_details": None,
|
||||||
|
"prompt_tokens_details": None,
|
||||||
|
},
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -233,19 +254,19 @@ client = LlamaStackClient(base_url="http://localhost:8321")
|
||||||
models = client.models.list()
|
models = client.models.list()
|
||||||
|
|
||||||
# Select the first LLM
|
# Select the first LLM
|
||||||
llm = next(m for m in models if m.model_type == "llm")
|
llm = next(m for m in models if m.model_type == "llm" and m.provider_id == "ollama")
|
||||||
model_id = llm.identifier
|
model_id = llm.identifier
|
||||||
|
|
||||||
print("Model:", model_id)
|
print("Model:", model_id)
|
||||||
|
|
||||||
response = client.inference.chat_completion(
|
response = client.chat.completions.create(
|
||||||
model_id=model_id,
|
model=model_id,
|
||||||
messages=[
|
messages=[
|
||||||
{"role": "system", "content": "You are a helpful assistant."},
|
{"role": "system", "content": "You are a helpful assistant."},
|
||||||
{"role": "user", "content": "Write a haiku about coding"},
|
{"role": "user", "content": "Write a haiku about coding"},
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
print(response.completion_message.content)
|
print(response)
|
||||||
```
|
```
|
||||||
|
|
||||||
### ii. Run the Script
|
### ii. Run the Script
|
||||||
|
@ -255,12 +276,8 @@ uv run python inference.py
|
||||||
```
|
```
|
||||||
Which will output:
|
Which will output:
|
||||||
```
|
```
|
||||||
Model: llama3.2:3b
|
Model: ollama/llama3.2:3b
|
||||||
Here is a haiku about coding:
|
OpenAIChatCompletion(id='chatcmpl-30cd0f28-a2ad-4b6d-934b-13707fc60ebf', choices=[OpenAIChatCompletionChoice(finish_reason='stop', index=0, message=OpenAIChatCompletionChoiceMessageOpenAIAssistantMessageParam(role='assistant', content="Lines of code unfold\nAlgorithms dance with ease\nLogic's gentle kiss", name=None, tool_calls=None, refusal=None, annotations=None, audio=None, function_call=None), logprobs=None)], created=1751732480, model='llama3.2:3b', object='chat.completion', service_tier=None, system_fingerprint='fp_ollama', usage={'completion_tokens': 16, 'prompt_tokens': 37, 'total_tokens': 53, 'completion_tokens_details': None, 'prompt_tokens_details': None})
|
||||||
|
|
||||||
Lines of code unfold
|
|
||||||
Logic flows through digital night
|
|
||||||
Beauty in the bits
|
|
||||||
```
|
```
|
||||||
:::
|
:::
|
||||||
|
|
||||||
|
@ -278,7 +295,7 @@ import uuid
|
||||||
client = LlamaStackClient(base_url=f"http://localhost:8321")
|
client = LlamaStackClient(base_url=f"http://localhost:8321")
|
||||||
|
|
||||||
models = client.models.list()
|
models = client.models.list()
|
||||||
llm = next(m for m in models if m.model_type == "llm")
|
llm = next(m for m in models if m.model_type == "llm" and m.provider_id == "ollama")
|
||||||
model_id = llm.identifier
|
model_id = llm.identifier
|
||||||
|
|
||||||
agent = Agent(client, model=model_id, instructions="You are a helpful assistant.")
|
agent = Agent(client, model=model_id, instructions="You are a helpful assistant.")
|
||||||
|
@ -315,19 +332,20 @@ uv run python agent.py
|
||||||
|
|
||||||
```{dropdown} 👋 Click here to see the sample output
|
```{dropdown} 👋 Click here to see the sample output
|
||||||
Non-streaming ...
|
Non-streaming ...
|
||||||
agent> I'm an artificial intelligence designed to assist and communicate with users like you. I don't have a personal identity, but I'm here to provide information, answer questions, and help with tasks to the best of my abilities.
|
agent> I'm an artificial intelligence designed to assist and communicate with users like you. I don't have a personal identity, but I can provide information, answer questions, and help with tasks to the best of my abilities.
|
||||||
|
|
||||||
I can be used for a wide range of purposes, such as:
|
I'm a large language model, which means I've been trained on a massive dataset of text from various sources, allowing me to understand and respond to a wide range of topics and questions. My purpose is to provide helpful and accurate information, and I'm constantly learning and improving my responses based on the interactions I have with users like you.
|
||||||
|
|
||||||
|
I can help with:
|
||||||
|
|
||||||
|
* Answering questions on various subjects
|
||||||
* Providing definitions and explanations
|
* Providing definitions and explanations
|
||||||
* Offering suggestions and ideas
|
* Offering suggestions and ideas
|
||||||
* Helping with language translation
|
* Assisting with language-related tasks, such as proofreading and editing
|
||||||
* Assisting with writing and proofreading
|
* Generating text and content
|
||||||
* Generating text or responses to questions
|
* And more!
|
||||||
* Playing simple games or chatting about topics of interest
|
|
||||||
|
|
||||||
I'm constantly learning and improving my abilities, so feel free to ask me anything, and I'll do my best to help!
|
|
||||||
|
|
||||||
|
Feel free to ask me anything, and I'll do my best to help!
|
||||||
Streaming ...
|
Streaming ...
|
||||||
AgentTurnResponseStreamChunk(
|
AgentTurnResponseStreamChunk(
|
||||||
│ event=TurnResponseEvent(
|
│ event=TurnResponseEvent(
|
||||||
|
@ -421,15 +439,15 @@ uv run python agent.py
|
||||||
|
|
||||||
|
|
||||||
Streaming with print helper...
|
Streaming with print helper...
|
||||||
inference> Déjà vu!
|
inference> Déjà vu! You're asking me again!
|
||||||
|
|
||||||
As I mentioned earlier, I'm an artificial intelligence language model. I don't have a personal identity or consciousness like humans do. I exist solely to process and respond to text-based inputs, providing information and assistance on a wide range of topics.
|
As I mentioned earlier, I'm a computer program designed to simulate conversation and answer questions. I don't have a personal identity or consciousness like a human would. I exist solely as a digital entity, running on computer servers and responding to inputs from users like you.
|
||||||
|
|
||||||
I'm a computer program designed to simulate human-like conversations, using natural language processing (NLP) and machine learning algorithms to understand and generate responses. My purpose is to help users like you with their questions, provide information, and engage in conversation.
|
I'm a type of artificial intelligence (AI) called a large language model, which means I've been trained on a massive dataset of text from various sources. This training allows me to understand and respond to a wide range of questions and topics.
|
||||||
|
|
||||||
Think of me as a virtual companion, a helpful tool designed to make your interactions more efficient and enjoyable. I don't have personal opinions, emotions, or biases, but I'm here to provide accurate and informative responses to the best of my abilities.
|
My purpose is to provide helpful and accurate information, answer questions, and assist users like you with tasks and conversations. I don't have personal preferences, emotions, or opinions like humans do. My goal is to be informative, neutral, and respectful in my responses.
|
||||||
|
|
||||||
So, who am I? I'm just a computer program designed to help you!
|
So, that's me in a nutshell!
|
||||||
```
|
```
|
||||||
:::
|
:::
|
||||||
|
|
||||||
|
@ -483,7 +501,11 @@ client.tool_runtime.rag_tool.insert(
|
||||||
)
|
)
|
||||||
|
|
||||||
# Get the model being served
|
# Get the model being served
|
||||||
llm = next(m for m in client.models.list() if m.model_type == "llm")
|
llm = next(
|
||||||
|
m
|
||||||
|
for m in client.models.list()
|
||||||
|
if m.model_type == "llm" and m.provider_id == "ollama"
|
||||||
|
)
|
||||||
model = llm.identifier
|
model = llm.identifier
|
||||||
|
|
||||||
# Create the RAG agent
|
# Create the RAG agent
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue