# Llama Stack Inference Guide

This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-3B-Instruct` model. 

Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llamastack.github.io/latest/getting_started/index.html).


### Table of Contents
1. [Quickstart](#quickstart)
2. [Building Effective Prompts](#building-effective-prompts)
3. [Conversation Loop](#conversation-loop)
4. [Conversation History](#conversation-history)
5. [Streaming Responses](#streaming-responses)


## Quickstart

This section walks through each step to set up and make a simple text generation request.



### 0. Configuration
Set up your connection parameters:

In [1]:
HOST = "localhost"  # Replace with your host
PORT = 8321       # Replace with your port
MODEL_NAME='meta-llama/Llama-3.2-3B-Instruct'

### 1. Set Up the Client

Begin by importing the necessary components from Llama Stackâ€™s client library:

In [2]:
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')

### 2. Create a Chat Completion Request

Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:

In [3]:
response = client.inference.chat_completion(
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."}
    ],
    model_id=MODEL_NAME,
)

print(response.completion_message.content)

Here is a two-sentence poem about a llama:

With soft fur and gentle eyes, the llama roams free,
A majestic creature, wild and carefree.


## Building Effective Prompts

Effective prompt creation (often called 'prompt engineering') is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:

### Sample Prompt

In [4]:
response = client.inference.chat_completion(
    messages=[
        {"role": "system", "content": "You are shakespeare."},
        {"role": "user", "content": "Write a two-sentence poem about llama."}
    ],
    model_id=MODEL_NAME,  # Changed from model to model_id
)
print(response.completion_message.content)

"O, fair llama, with thy gentle eyes so bright,
In Andean hills, thou dost enthrall with soft delight."


## Conversation Loop

To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'

In [6]:
import asyncio
from llama_stack_client import LlamaStackClient
from termcolor import cprint

client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')

async def chat_loop():
    while True:
        user_input = input('User> ')
        if user_input.lower() in ['exit', 'quit', 'bye']:
            cprint('Ending conversation. Goodbye!', 'yellow')
            break

        message = {"role": "user", "content": user_input}
        response = client.inference.chat_completion(
            messages=[message],
            model_id=MODEL_NAME
        )
        cprint(f'> Response: {response.completion_message.content}', 'cyan')

# Run the chat loop in a Jupyter Notebook cell using await
await chat_loop()
# To run it in a python file, use this line instead
# asyncio.run(chat_loop())


[36m> Response: How can I assist you today?[0m
[36m> Response: In South American hills, they roam and play,
The llama's gentle eyes gaze out each day.
Their soft fur coats in shades of white and gray,
Inviting all to come and stay.

With ears that listen, ears so fine,
They hear the whispers of the Andean mine.
Their footsteps quiet on the mountain slope,
As they graze on grasses, a peaceful hope.

In Incas' time, they were revered as friends,
Their packs they bore, until the very end.
The Spanish came, with guns and strife,
But llamas stood firm, for life.

Now, they roam free, in fields so wide,
A symbol of resilience, side by side.
With people's lives, a bond so strong,
Together they thrive, all day long.

Their soft hums echo through the air,
As they wander, without a care.
In their gentle hearts, a wisdom lies,
A testament to the Andean skies.

So here they'll stay, in this land of old,
The llama's spirit, forever to hold.[0m
[33mEnding conversation. Goodbye![0m


## Conversation History

Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session.

In [8]:
async def chat_loop():
    conversation_history = []
    while True:
        user_input = input('User> ')
        if user_input.lower() in ['exit', 'quit', 'bye']:
            cprint('Ending conversation. Goodbye!', 'yellow')
            break

        user_message = {"role": "user", "content": user_input}
        conversation_history.append(user_message)

        response = client.inference.chat_completion(
            messages=conversation_history,
            model_id=MODEL_NAME,
        )
        cprint(f'> Response: {response.completion_message.content}', 'cyan')

        # Append the assistant message with all required fields
        assistant_message = {
            "role": "user",
            "content": response.completion_message.content,
            # Add any additional required fields here if necessary
        }
        conversation_history.append(assistant_message)

# Use `await` in the Jupyter Notebook cell to call the function
await chat_loop()
# To run it in a python file, use this line instead
# asyncio.run(chat_loop())


[36m> Response: How can I help you today?[0m
[36m> Response: Here's a little poem about llamas:

In Andean highlands, they roam and play,
Their soft fur shining in the sunny day.
With ears so long and eyes so bright,
They watch with gentle curiosity, taking flight.

Their llama voices hum, a soothing sound,
As they wander through the mountains all around.
Their padded feet barely touch the ground,
As they move with ease, without a single bound.

In packs or alone, they make their way,
Carrying burdens, come what may.
Their gentle spirit, a sight to see,
A symbol of peace, for you and me.

With llamas calm, our souls take flight,
In their presence, all is right.
So let us cherish these gentle friends,
And honor their beauty that never ends.[0m
[33mEnding conversation. Goodbye![0m


## Streaming Responses

Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.

In [9]:
from llama_stack_client.lib.inference.event_logger import EventLogger

async def run_main(stream: bool = True):
    client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')

    message = {
        "role": "user",
        "content": 'Write me a 3 sentence poem about llama'
    }
    cprint(f'User> {message["content"]}', 'green')

    response = client.inference.chat_completion(
        messages=[message],
        model_id=MODEL_NAME,
        stream=stream,
    )

    if not stream:
        cprint(f'> Response: {response.completion_message.content}', 'cyan')
    else:
        for log in EventLogger().log(response):
            log.print()

# In a Jupyter Notebook cell, use `await` to call the function
await run_main()
# To run it in a python file, use this line instead
# asyncio.run(run_main())


[32mUser> Write me a 3 sentence poem about llama[0m
[36mAssistant> [0m[33mHere[0m[33m is[0m[33m a[0m[33m [0m[33m3[0m[33m sentence[0m[33m poem[0m[33m about[0m[33m a[0m[33m llama[0m[33m:

[0m[33mWith[0m[33m soft[0m[33m and[0m[33m fuzzy[0m[33m fur[0m[33m so[0m[33m bright[0m[33m,
[0m[33mThe[0m[33m llama[0m[33m ro[0m[33mams[0m[33m through[0m[33m the[0m[33m And[0m[33mean[0m[33m light[0m[33m,
[0m[33mA[0m[33m gentle[0m[33m giant[0m[33m,[0m[33m a[0m[33m w[0m[33mondrous[0m[33m sight[0m[33m.[0m[97m[0m
