forked from phoenix-oss/llama-stack-mirror
Made changes to readme and pinning to llamastack v0.0.61 (#624)
# What does this PR do? Pinning zero2hero to 0.0.61 and updated readme ## Test Plan Please describe: - Did a end to end test on the server and inference for 0.0.61 Server output: <img width="670" alt="image" src="https://github.com/user-attachments/assets/66515adf-102d-466d-b0ac-fa91568fcee6" /> ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [x] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.
This commit is contained in:
parent
49ad168336
commit
8e5b336792
2 changed files with 36 additions and 44 deletions
|
@ -358,7 +358,7 @@
|
||||||
" if not stream:\n",
|
" if not stream:\n",
|
||||||
" cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
|
" cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
|
||||||
" else:\n",
|
" else:\n",
|
||||||
" async for log in EventLogger().log(response):\n",
|
" for log in EventLogger().log(response):\n",
|
||||||
" log.print()\n",
|
" log.print()\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# In a Jupyter Notebook cell, use `await` to call the function\n",
|
"# In a Jupyter Notebook cell, use `await` to call the function\n",
|
||||||
|
@ -366,16 +366,6 @@
|
||||||
"# To run it in a python file, use this line instead\n",
|
"# To run it in a python file, use this line instead\n",
|
||||||
"# asyncio.run(run_main())\n"
|
"# asyncio.run(run_main())\n"
|
||||||
]
|
]
|
||||||
},
|
|
||||||
{
|
|
||||||
"cell_type": "code",
|
|
||||||
"execution_count": 11,
|
|
||||||
"id": "9399aecc",
|
|
||||||
"metadata": {},
|
|
||||||
"outputs": [],
|
|
||||||
"source": [
|
|
||||||
"#fin"
|
|
||||||
]
|
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"metadata": {
|
"metadata": {
|
||||||
|
|
|
@ -45,7 +45,7 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Install Dependencies and Set Up Environment
|
## Install Dependencies and Set Up Environmen
|
||||||
|
|
||||||
1. **Create a Conda Environment**:
|
1. **Create a Conda Environment**:
|
||||||
Create a new Conda environment with Python 3.10:
|
Create a new Conda environment with Python 3.10:
|
||||||
|
@ -73,7 +73,7 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
|
||||||
Open a new terminal and install `llama-stack`:
|
Open a new terminal and install `llama-stack`:
|
||||||
```bash
|
```bash
|
||||||
conda activate ollama
|
conda activate ollama
|
||||||
pip install llama-stack==0.0.55
|
pip install llama-stack==0.0.61
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
@ -96,7 +96,7 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
|
||||||
3. **Set the ENV variables by exporting them to the terminal**:
|
3. **Set the ENV variables by exporting them to the terminal**:
|
||||||
```bash
|
```bash
|
||||||
export OLLAMA_URL="http://localhost:11434"
|
export OLLAMA_URL="http://localhost:11434"
|
||||||
export LLAMA_STACK_PORT=5051
|
export LLAMA_STACK_PORT=5001
|
||||||
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
|
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
|
||||||
export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"
|
export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"
|
||||||
```
|
```
|
||||||
|
@ -104,34 +104,29 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
|
||||||
3. **Run the Llama Stack**:
|
3. **Run the Llama Stack**:
|
||||||
Run the stack with command shared by the API from earlier:
|
Run the stack with command shared by the API from earlier:
|
||||||
```bash
|
```bash
|
||||||
llama stack run ollama \
|
llama stack run ollama
|
||||||
--port $LLAMA_STACK_PORT \
|
--port $LLAMA_STACK_PORT
|
||||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
--env INFERENCE_MODEL=$INFERENCE_MODEL
|
||||||
--env SAFETY_MODEL=$SAFETY_MODEL \
|
--env SAFETY_MODEL=$SAFETY_MODEL
|
||||||
--env OLLAMA_URL=$OLLAMA_URL
|
--env OLLAMA_URL=$OLLAMA_URL
|
||||||
```
|
```
|
||||||
Note: Everytime you run a new model with `ollama run`, you will need to restart the llama stack. Otherwise it won't see the new model.
|
Note: Everytime you run a new model with `ollama run`, you will need to restart the llama stack. Otherwise it won't see the new model.
|
||||||
|
|
||||||
The server will start and listen on `http://localhost:5051`.
|
The server will start and listen on `http://localhost:5001`.
|
||||||
|
|
||||||
---
|
---
|
||||||
## Test with `llama-stack-client` CLI
|
## Test with `llama-stack-client` CLI
|
||||||
After setting up the server, open a new terminal window and install the llama-stack-client package.
|
After setting up the server, open a new terminal window and configure the llama-stack-client.
|
||||||
|
|
||||||
1. Install the llama-stack-client package
|
1. Configure the CLI to point to the llama-stack server.
|
||||||
```bash
|
```bash
|
||||||
conda activate ollama
|
llama-stack-client configure --endpoint http://localhost:5001
|
||||||
pip install llama-stack-client
|
|
||||||
```
|
|
||||||
2. Configure the CLI to point to the llama-stack server.
|
|
||||||
```bash
|
|
||||||
llama-stack-client configure --endpoint http://localhost:5051
|
|
||||||
```
|
```
|
||||||
**Expected Output:**
|
**Expected Output:**
|
||||||
```bash
|
```bash
|
||||||
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:5051
|
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:5001
|
||||||
```
|
```
|
||||||
3. Test the CLI by running inference:
|
2. Test the CLI by running inference:
|
||||||
```bash
|
```bash
|
||||||
llama-stack-client inference chat-completion --message "Write me a 2-sentence poem about the moon"
|
llama-stack-client inference chat-completion --message "Write me a 2-sentence poem about the moon"
|
||||||
```
|
```
|
||||||
|
@ -153,16 +148,18 @@ After setting up the server, open a new terminal window and install the llama-st
|
||||||
After setting up the server, open a new terminal window and verify it's working by sending a `POST` request using `curl`:
|
After setting up the server, open a new terminal window and verify it's working by sending a `POST` request using `curl`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl http://localhost:$LLAMA_STACK_PORT/inference/chat_completion \
|
curl http://localhost:$LLAMA_STACK_PORT/alpha/inference/chat-completion
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json"
|
||||||
-d '{
|
-d @- <<EOF
|
||||||
"model": "Llama3.2-3B-Instruct",
|
{
|
||||||
|
"model_id": "$INFERENCE_MODEL",
|
||||||
"messages": [
|
"messages": [
|
||||||
{"role": "system", "content": "You are a helpful assistant."},
|
{"role": "system", "content": "You are a helpful assistant."},
|
||||||
{"role": "user", "content": "Write me a 2-sentence poem about the moon"}
|
{"role": "user", "content": "Write me a 2-sentence poem about the moon"}
|
||||||
],
|
],
|
||||||
"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
|
"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
|
||||||
}'
|
}
|
||||||
|
EOF
|
||||||
```
|
```
|
||||||
|
|
||||||
You can check the available models with the command `llama-stack-client models list`.
|
You can check the available models with the command `llama-stack-client models list`.
|
||||||
|
@ -186,16 +183,12 @@ You can check the available models with the command `llama-stack-client models l
|
||||||
|
|
||||||
You can also interact with the Llama Stack server using a simple Python script. Below is an example:
|
You can also interact with the Llama Stack server using a simple Python script. Below is an example:
|
||||||
|
|
||||||
### 1. Activate Conda Environment and Install Required Python Packages
|
### 1. Activate Conda Environmen
|
||||||
The `llama-stack-client` library offers a robust and efficient python methods for interacting with the Llama Stack server.
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
conda activate ollama
|
conda activate ollama
|
||||||
pip install llama-stack-client
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Note, the client library gets installed by default if you install the server library
|
|
||||||
|
|
||||||
### 2. Create Python Script (`test_llama_stack.py`)
|
### 2. Create Python Script (`test_llama_stack.py`)
|
||||||
```bash
|
```bash
|
||||||
touch test_llama_stack.py
|
touch test_llama_stack.py
|
||||||
|
@ -206,19 +199,28 @@ touch test_llama_stack.py
|
||||||
In `test_llama_stack.py`, write the following code:
|
In `test_llama_stack.py`, write the following code:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from llama_stack_client import LlamaStackClient
|
import os
|
||||||
|
from llama_stack_client import LlamaStackClien
|
||||||
|
|
||||||
# Initialize the client
|
# Get the model ID from the environment variable
|
||||||
client = LlamaStackClient(base_url="http://localhost:5051")
|
INFERENCE_MODEL = os.environ.get("INFERENCE_MODEL")
|
||||||
|
|
||||||
# Create a chat completion request
|
# Check if the environment variable is se
|
||||||
|
if INFERENCE_MODEL is None:
|
||||||
|
raise ValueError("The environment variable 'INFERENCE_MODEL' is not set.")
|
||||||
|
|
||||||
|
# Initialize the clien
|
||||||
|
client = LlamaStackClient(base_url="http://localhost:5001")
|
||||||
|
|
||||||
|
# Create a chat completion reques
|
||||||
response = client.inference.chat_completion(
|
response = client.inference.chat_completion(
|
||||||
messages=[
|
messages=[
|
||||||
{"role": "system", "content": "You are a friendly assistant."},
|
{"role": "system", "content": "You are a friendly assistant."},
|
||||||
{"role": "user", "content": "Write a two-sentence poem about llama."}
|
{"role": "user", "content": "Write a two-sentence poem about llama."}
|
||||||
],
|
],
|
||||||
model_id=MODEL_NAME,
|
model_id=INFERENCE_MODEL,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Print the response
|
# Print the response
|
||||||
print(response.completion_message.content)
|
print(response.completion_message.content)
|
||||||
```
|
```
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue