# Switching between Local and Cloud Model with Llama Stack

This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stack’s `chat_completion` API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.

### Prerequisites
Before you begin, please ensure Llama Stack is installed and the distribution is set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). You will need to run two distributions, a local and a cloud distribution, for this demo to work.

### Implementation

### 1. Configuration
Set up your connection parameters:

In [1]:
HOST = "localhost" # Replace with your host
LOCAL_PORT = 8321 # Replace with your local distro port
CLOUD_PORT = 8322 # Replace with your cloud distro port

#### 2. Set Up Local and Cloud Clients

Initialize both clients, specifying the `base_url` for each instance. In this case, we have the local distribution running on `http://localhost:8321` and the cloud distribution running on `http://localhost:5001`.


In [2]:
from llama_stack_client import LlamaStackClient

# Configure local and cloud clients
local_client = LlamaStackClient(base_url=f'http://{HOST}:{LOCAL_PORT}')
cloud_client = LlamaStackClient(base_url=f'http://{HOST}:{CLOUD_PORT}')

#### 3. Client Selection with Fallback

The `select_client` function checks if the local client is available using a lightweight `/health` check. If the local client is unavailable, it automatically switches to the cloud client.


In [3]:
import httpx
from termcolor import cprint

async def check_client_health(client, client_name: str) -> bool:
 try:
 async with httpx.AsyncClient() as http_client:
 response = await http_client.get(f'{client.base_url}/health')
 if response.status_code == 200:
 cprint(f'Using {client_name} client.', 'yellow')
 return True
 else:
 cprint(f'{client_name} client health check failed.', 'red')
 return False
 except httpx.RequestError:
 cprint(f'Failed to connect to {client_name} client.', 'red')
 return False

async def select_client(use_local: bool) -> LlamaStackClient:
 if use_local and await check_client_health(local_client, 'local'):
 return local_client

 if await check_client_health(cloud_client, 'cloud'):
 return cloud_client

 raise ConnectionError('Unable to connect to any client.')

# Example usage: pass True for local, False for cloud
client = await select_client(use_local=True)


[33mUsing local client.[0m


#### 4. Generate a Response

After selecting the client, you can generate text using `chat_completion`. This example sends a sample prompt to the model and prints the response.


In [4]:
from termcolor import cprint
from llama_stack_client.lib.inference.event_logger import EventLogger

async def get_llama_response(stream: bool = True, use_local: bool = True):
 client = await select_client(use_local) # Selects the available client
 message = {
 "role": "user",
 "content": 'hello world, write me a 2 sentence poem about the moon'
 }
 cprint(f'User> {message["content"]}', 'green')

 response = client.inference.chat_completion(
 messages=[message],
 model='Llama3.2-11B-Vision-Instruct',
 stream=stream,
 )

 if not stream:
 cprint(f'> Response: {response.completion_message.content}', 'cyan')
 else:
 async for log in EventLogger().log(response):
 log.print()


#### 5. Run with Cloud Model

Use `asyncio.run()` to execute `get_llama_response` in an asynchronous event loop.


In [7]:
import asyncio


# Run this function directly in a Jupyter Notebook cell with `await`
await get_llama_response(use_local=False)
# To run it in a python file, use this line instead
# asyncio.run(get_llama_response(use_local=False))

[33mUsing cloud client.[0m
[32mUser> hello world, write me a 2 sentence poem about the moon[0m
[36mAssistant> [0m[33mSilver[0m[33m cres[0m[33mcent[0m[33m in[0m[33m the[0m[33m midnight[0m[33m sky[0m[33m,
[0m[33mA[0m[33m gentle[0m[33m glow[0m[33m that[0m[33m whispers[0m[33m,[0m[33m "[0m[33mI[0m[33m'm[0m[33m passing[0m[33m by[0m[33m."[0m[97m[0m


#### 6. Run with Local Model


In [8]:
import asyncio

await get_llama_response(use_local=True)

[33mUsing local client.[0m
[32mUser> hello world, write me a 2 sentence poem about the moon[0m
[36mAssistant> [0m[33mSilver[0m[33m cres[0m[33mcent[0m[33m in[0m[33m the[0m[33m midnight[0m[33m sky[0m[33m,
[0m[33mA[0m[33m gentle[0m[33m glow[0m[33m that[0m[33m whispers[0m[33m,[0m[33m "[0m[33mI[0m[33m'm[0m[33m passing[0m[33m by[0m[33m."[0m[97m[0m


Thanks for checking out this notebook! 

The next one will be a guide on [Prompt Engineering](./02_Prompt_Engineering101.ipynb), please continue learning!