mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-06-27 18:50:41 +00:00
Made changes to readme and pinning to llamastack v0.0.61 (#624)
# What does this PR do? Pinning zero2hero to 0.0.61 and updated readme ## Test Plan Please describe: - Did a end to end test on the server and inference for 0.0.61 Server output: <img width="670" alt="image" src="https://github.com/user-attachments/assets/66515adf-102d-466d-b0ac-fa91568fcee6" /> ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [x] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.
This commit is contained in:
parent
49ad168336
commit
8e5b336792
2 changed files with 36 additions and 44 deletions
|
@ -358,7 +358,7 @@
|
|||
" if not stream:\n",
|
||||
" cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
|
||||
" else:\n",
|
||||
" async for log in EventLogger().log(response):\n",
|
||||
" for log in EventLogger().log(response):\n",
|
||||
" log.print()\n",
|
||||
"\n",
|
||||
"# In a Jupyter Notebook cell, use `await` to call the function\n",
|
||||
|
@ -366,16 +366,6 @@
|
|||
"# To run it in a python file, use this line instead\n",
|
||||
"# asyncio.run(run_main())\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "9399aecc",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#fin"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
|
|
|
@ -45,7 +45,7 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
|
|||
|
||||
---
|
||||
|
||||
## Install Dependencies and Set Up Environment
|
||||
## Install Dependencies and Set Up Environmen
|
||||
|
||||
1. **Create a Conda Environment**:
|
||||
Create a new Conda environment with Python 3.10:
|
||||
|
@ -73,7 +73,7 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
|
|||
Open a new terminal and install `llama-stack`:
|
||||
```bash
|
||||
conda activate ollama
|
||||
pip install llama-stack==0.0.55
|
||||
pip install llama-stack==0.0.61
|
||||
```
|
||||
|
||||
---
|
||||
|
@ -96,7 +96,7 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
|
|||
3. **Set the ENV variables by exporting them to the terminal**:
|
||||
```bash
|
||||
export OLLAMA_URL="http://localhost:11434"
|
||||
export LLAMA_STACK_PORT=5051
|
||||
export LLAMA_STACK_PORT=5001
|
||||
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
|
||||
export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"
|
||||
```
|
||||
|
@ -104,34 +104,29 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
|
|||
3. **Run the Llama Stack**:
|
||||
Run the stack with command shared by the API from earlier:
|
||||
```bash
|
||||
llama stack run ollama \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env SAFETY_MODEL=$SAFETY_MODEL \
|
||||
llama stack run ollama
|
||||
--port $LLAMA_STACK_PORT
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL
|
||||
--env SAFETY_MODEL=$SAFETY_MODEL
|
||||
--env OLLAMA_URL=$OLLAMA_URL
|
||||
```
|
||||
Note: Everytime you run a new model with `ollama run`, you will need to restart the llama stack. Otherwise it won't see the new model.
|
||||
|
||||
The server will start and listen on `http://localhost:5051`.
|
||||
The server will start and listen on `http://localhost:5001`.
|
||||
|
||||
---
|
||||
## Test with `llama-stack-client` CLI
|
||||
After setting up the server, open a new terminal window and install the llama-stack-client package.
|
||||
After setting up the server, open a new terminal window and configure the llama-stack-client.
|
||||
|
||||
1. Install the llama-stack-client package
|
||||
1. Configure the CLI to point to the llama-stack server.
|
||||
```bash
|
||||
conda activate ollama
|
||||
pip install llama-stack-client
|
||||
```
|
||||
2. Configure the CLI to point to the llama-stack server.
|
||||
```bash
|
||||
llama-stack-client configure --endpoint http://localhost:5051
|
||||
llama-stack-client configure --endpoint http://localhost:5001
|
||||
```
|
||||
**Expected Output:**
|
||||
```bash
|
||||
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:5051
|
||||
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:5001
|
||||
```
|
||||
3. Test the CLI by running inference:
|
||||
2. Test the CLI by running inference:
|
||||
```bash
|
||||
llama-stack-client inference chat-completion --message "Write me a 2-sentence poem about the moon"
|
||||
```
|
||||
|
@ -153,16 +148,18 @@ After setting up the server, open a new terminal window and install the llama-st
|
|||
After setting up the server, open a new terminal window and verify it's working by sending a `POST` request using `curl`:
|
||||
|
||||
```bash
|
||||
curl http://localhost:$LLAMA_STACK_PORT/inference/chat_completion \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Llama3.2-3B-Instruct",
|
||||
curl http://localhost:$LLAMA_STACK_PORT/alpha/inference/chat-completion
|
||||
-H "Content-Type: application/json"
|
||||
-d @- <<EOF
|
||||
{
|
||||
"model_id": "$INFERENCE_MODEL",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "Write me a 2-sentence poem about the moon"}
|
||||
],
|
||||
"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
|
||||
}'
|
||||
}
|
||||
EOF
|
||||
```
|
||||
|
||||
You can check the available models with the command `llama-stack-client models list`.
|
||||
|
@ -186,16 +183,12 @@ You can check the available models with the command `llama-stack-client models l
|
|||
|
||||
You can also interact with the Llama Stack server using a simple Python script. Below is an example:
|
||||
|
||||
### 1. Activate Conda Environment and Install Required Python Packages
|
||||
The `llama-stack-client` library offers a robust and efficient python methods for interacting with the Llama Stack server.
|
||||
### 1. Activate Conda Environmen
|
||||
|
||||
```bash
|
||||
conda activate ollama
|
||||
pip install llama-stack-client
|
||||
```
|
||||
|
||||
Note, the client library gets installed by default if you install the server library
|
||||
|
||||
### 2. Create Python Script (`test_llama_stack.py`)
|
||||
```bash
|
||||
touch test_llama_stack.py
|
||||
|
@ -206,19 +199,28 @@ touch test_llama_stack.py
|
|||
In `test_llama_stack.py`, write the following code:
|
||||
|
||||
```python
|
||||
from llama_stack_client import LlamaStackClient
|
||||
import os
|
||||
from llama_stack_client import LlamaStackClien
|
||||
|
||||
# Initialize the client
|
||||
client = LlamaStackClient(base_url="http://localhost:5051")
|
||||
# Get the model ID from the environment variable
|
||||
INFERENCE_MODEL = os.environ.get("INFERENCE_MODEL")
|
||||
|
||||
# Create a chat completion request
|
||||
# Check if the environment variable is se
|
||||
if INFERENCE_MODEL is None:
|
||||
raise ValueError("The environment variable 'INFERENCE_MODEL' is not set.")
|
||||
|
||||
# Initialize the clien
|
||||
client = LlamaStackClient(base_url="http://localhost:5001")
|
||||
|
||||
# Create a chat completion reques
|
||||
response = client.inference.chat_completion(
|
||||
messages=[
|
||||
{"role": "system", "content": "You are a friendly assistant."},
|
||||
{"role": "user", "content": "Write a two-sentence poem about llama."}
|
||||
],
|
||||
model_id=MODEL_NAME,
|
||||
model_id=INFERENCE_MODEL,
|
||||
)
|
||||
|
||||
# Print the response
|
||||
print(response.completion_message.content)
|
||||
```
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue