Made changes to readme and pinning to llamastack v0.0.61 (#624)

# What does this PR do?

Pinning zero2hero to 0.0.61 and updated readme


## Test Plan
Please describe:
 - Did a end to end test on the server and inference for 0.0.61
 
Server output:
<img width="670" alt="image"
src="https://github.com/user-attachments/assets/66515adf-102d-466d-b0ac-fa91568fcee6"
/>


## Before submitting

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [x] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
This commit is contained in:
Justin Lee 2025-01-03 03:18:07 +08:00 committed by GitHub
parent 49ad168336
commit 8e5b336792
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 36 additions and 44 deletions

View file

@ -358,7 +358,7 @@
" if not stream:\n",
" cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
" else:\n",
" async for log in EventLogger().log(response):\n",
" for log in EventLogger().log(response):\n",
" log.print()\n",
"\n",
"# In a Jupyter Notebook cell, use `await` to call the function\n",
@ -366,16 +366,6 @@
"# To run it in a python file, use this line instead\n",
"# asyncio.run(run_main())\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "9399aecc",
"metadata": {},
"outputs": [],
"source": [
"#fin"
]
}
],
"metadata": {

View file

@ -45,7 +45,7 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
---
## Install Dependencies and Set Up Environment
## Install Dependencies and Set Up Environmen
1. **Create a Conda Environment**:
Create a new Conda environment with Python 3.10:
@ -73,7 +73,7 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
Open a new terminal and install `llama-stack`:
```bash
conda activate ollama
pip install llama-stack==0.0.55
pip install llama-stack==0.0.61
```
---
@ -96,7 +96,7 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
3. **Set the ENV variables by exporting them to the terminal**:
```bash
export OLLAMA_URL="http://localhost:11434"
export LLAMA_STACK_PORT=5051
export LLAMA_STACK_PORT=5001
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"
```
@ -104,34 +104,29 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
3. **Run the Llama Stack**:
Run the stack with command shared by the API from earlier:
```bash
llama stack run ollama \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env SAFETY_MODEL=$SAFETY_MODEL \
llama stack run ollama
--port $LLAMA_STACK_PORT
--env INFERENCE_MODEL=$INFERENCE_MODEL
--env SAFETY_MODEL=$SAFETY_MODEL
--env OLLAMA_URL=$OLLAMA_URL
```
Note: Everytime you run a new model with `ollama run`, you will need to restart the llama stack. Otherwise it won't see the new model.
The server will start and listen on `http://localhost:5051`.
The server will start and listen on `http://localhost:5001`.
---
## Test with `llama-stack-client` CLI
After setting up the server, open a new terminal window and install the llama-stack-client package.
After setting up the server, open a new terminal window and configure the llama-stack-client.
1. Install the llama-stack-client package
1. Configure the CLI to point to the llama-stack server.
```bash
conda activate ollama
pip install llama-stack-client
```
2. Configure the CLI to point to the llama-stack server.
```bash
llama-stack-client configure --endpoint http://localhost:5051
llama-stack-client configure --endpoint http://localhost:5001
```
**Expected Output:**
```bash
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:5051
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:5001
```
3. Test the CLI by running inference:
2. Test the CLI by running inference:
```bash
llama-stack-client inference chat-completion --message "Write me a 2-sentence poem about the moon"
```
@ -153,16 +148,18 @@ After setting up the server, open a new terminal window and install the llama-st
After setting up the server, open a new terminal window and verify it's working by sending a `POST` request using `curl`:
```bash
curl http://localhost:$LLAMA_STACK_PORT/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
"model": "Llama3.2-3B-Instruct",
curl http://localhost:$LLAMA_STACK_PORT/alpha/inference/chat-completion
-H "Content-Type: application/json"
-d @- <<EOF
{
"model_id": "$INFERENCE_MODEL",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write me a 2-sentence poem about the moon"}
],
"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
}'
}
EOF
```
You can check the available models with the command `llama-stack-client models list`.
@ -186,16 +183,12 @@ You can check the available models with the command `llama-stack-client models l
You can also interact with the Llama Stack server using a simple Python script. Below is an example:
### 1. Activate Conda Environment and Install Required Python Packages
The `llama-stack-client` library offers a robust and efficient python methods for interacting with the Llama Stack server.
### 1. Activate Conda Environmen
```bash
conda activate ollama
pip install llama-stack-client
```
Note, the client library gets installed by default if you install the server library
### 2. Create Python Script (`test_llama_stack.py`)
```bash
touch test_llama_stack.py
@ -206,19 +199,28 @@ touch test_llama_stack.py
In `test_llama_stack.py`, write the following code:
```python
from llama_stack_client import LlamaStackClient
import os
from llama_stack_client import LlamaStackClien
# Initialize the client
client = LlamaStackClient(base_url="http://localhost:5051")
# Get the model ID from the environment variable
INFERENCE_MODEL = os.environ.get("INFERENCE_MODEL")
# Create a chat completion request
# Check if the environment variable is se
if INFERENCE_MODEL is None:
raise ValueError("The environment variable 'INFERENCE_MODEL' is not set.")
# Initialize the clien
client = LlamaStackClient(base_url="http://localhost:5001")
# Create a chat completion reques
response = client.inference.chat_completion(
messages=[
{"role": "system", "content": "You are a friendly assistant."},
{"role": "user", "content": "Write a two-sentence poem about llama."}
],
model_id=MODEL_NAME,
model_id=INFERENCE_MODEL,
)
# Print the response
print(response.completion_message.content)
```