4.9 KiB
Switching between Local and Cloud Model with Llama Stack
This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stack’s chat_completion
API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.
Pre-requisite
Before you begin, please ensure Llama Stack is installed and the distribution are set up by following the Getting Started Guide. You will need to run two distribution, a local and a cloud distribution, for this demo to work.
Implementation
-
Set Up Local and Cloud Clients
Initialize both clients, specifying the
base_url
for you intialized each instance. In this case, we have the local distribution running onhttp://localhost:5000
and the cloud distribution running onhttp://localhost:5001
.from llama_stack_client import LlamaStackClient # Configure local and cloud clients local_client = LlamaStackClient(base_url="http://localhost:5000") cloud_client = LlamaStackClient(base_url="http://localhost:5001")
-
Client Selection with Fallback
The
select_client
function checks if the local client is available using a lightweight/health
check. If the local client is unavailable, it automatically switches to the cloud client.import httpx from termcolor import cprint async def select_client() -> LlamaStackClient: """Use local client if available; otherwise, switch to cloud client.""" try: async with httpx.AsyncClient() as http_client: response = await http_client.get(f"{local_client.base_url}/health") if response.status_code == 200: cprint("Using local client.", "yellow") return local_client except httpx.RequestError: pass cprint("Local client unavailable. Switching to cloud client.", "yellow") return cloud_client
-
Generate a Response
After selecting the client, you can generate text using
chat_completion
. This example sends a sample prompt to the model and prints the response.from llama_stack_client.types import UserMessage async def get_llama_response(stream: bool = True): client = await select_client() # Selects the available client message = UserMessage(content="hello world, write me a 2 sentence poem about the moon", role="user") cprint(f"User> {message.content}", "green") response = client.inference.chat_completion( messages=[message], model="Llama3.2-11B-Vision-Instruct", stream=stream, ) if not stream: cprint(f"> Response: {response}", "cyan") else: # Stream tokens progressively async for log in EventLogger().log(response): log.print()
-
Run the Asynchronous Response Generation
Use
asyncio.run()
to executeget_llama_response
in an asynchronous event loop.import asyncio # Initiate the response generation process asyncio.run(get_llama_response())
Complete code
Summing it up, here's the code for local-cloud model implementation with llama-stack:
import asyncio
import httpx
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.inference.event_logger import EventLogger
from llama_stack_client.types import UserMessage
from termcolor import cprint
local_client = LlamaStackClient(base_url="http://localhost:5000")
cloud_client = LlamaStackClient(base_url="http://localhost:5001")
async def select_client() -> LlamaStackClient:
try:
async with httpx.AsyncClient() as http_client:
response = await http_client.get(f"{local_client.base_url}/health")
if response.status_code == 200:
cprint("Using local client.", "yellow")
return local_client
except httpx.RequestError:
pass
cprint("Local client unavailable. Switching to cloud client.", "yellow")
return cloud_client
async def get_llama_response(stream: bool = True):
client = await select_client()
message = UserMessage(
content="hello world, write me a 2 sentence poem about the moon", role="user"
)
cprint(f"User> {message.content}", "green")
response = client.inference.chat_completion(
messages=[message],
model="Llama3.2-11B-Vision-Instruct",
stream=stream,
)
if not stream:
cprint(f"> Response: {response}", "cyan")
else:
async for log in EventLogger().log(response):
log.print()
asyncio.run(get_llama_response())
With these fundamentals, you should be well on your way to leveraging Llama Stack’s text generation capabilities! For more advanced features, refer to the Llama Stack Documentation.