# Oracle Cloud Infrastructure (OCI) with Llama Stack

This notebook demonstrates how to start with using OCI Generative AI models through Llama Stack.

## Prerequisites

1. **Install required packages:**
   ```bash
   pip install llama-stack-client oci
   ```

2. **Configure OCI credentials:**
   - Set up `~/.oci/config` with your OCI credentials
   - Set the `OCI_COMPARTMENT_OCID` environment variable
   - Set the `OCI_REGION` environment variable

3. **Start Llama Stack server:**
   ```bash
   llama stack run /oci/[your_oci_config].yaml
   ```
   Make sure to set OCI as your inference provider in your configuration file as shown here:
```bash
providers:
  inference:
  - provider_id: oci
    provider_type: remote::oci
    config:
      oci_auth_type: ${env.OCI_AUTH_TYPE:=instance_principal}
      oci_config_file_path: ${env.OCI_CONFIG_FILE_PATH:=~/.oci/config}
      oci_config_profile: ${env.OCI_CLI_PROFILE:=DEFAULT}
      oci_region: ${env.OCI_REGION:=us-ashburn-1}
      oci_compartment_id: ${env.OCI_COMPARTMENT_OCID:=}
```
5. **Verify server is running:**
   - Server should be accessible at `http://localhost:8321`

In [1]:
# OPTION: Use venv environment with 0.4.0 client
# Optional in case you need to select a specific venv enviornment.
import sys
sys.path.insert(0, 'oci/venv/lib/python3.12/site-packages')
print(f"Python path updated to use venv")

Python path updated to use venv


In [2]:
# Import required libraries
from llama_stack_client import LlamaStackClient
import os

# Check if environment variable is set
if not os.getenv("OCI_COMPARTMENT_OCID"):
    print("⚠️  WARNING: OCI_COMPARTMENT_OCID environment variable not set")
    print("Please set it with: export OCI_COMPARTMENT_OCID='ocid1.compartment.oc1..xxx'")
else:
    print("✅ OCI_COMPARTMENT_OCID is set")

✅ OCI_COMPARTMENT_OCID is set


In [3]:
# Initialize the Llama Stack client
# Make sure the server is running at http://localhost:8321

client = LlamaStackClient(base_url="http://localhost:8321")
print("✅ Connected to Llama Stack server")

✅ Connected to Llama Stack server


## 1. List Available Models

First, let's see what OCI models are available through Llama Stack.

In [4]:
# List all available models
models = client.models.list()

print(f"Found {len(models)} models:\n")
for model in models:
    print(f" {model.id}")
    print(f"   Provider: {model.owned_by}")
    if hasattr(model, "custom_metadata") and model.custom_metadata:
        print(f"   Metadata: {model.custom_metadata}")
    print()

INFO:httpx:HTTP Request: GET http://localhost:8321/v1/models "HTTP/1.1 200 OK"


Found 11 models:

 oci/google.gemini-2.5-flash
   Provider: llama_stack
   Metadata: {'model_type': 'llm', 'provider_id': 'oci', 'provider_resource_id': 'google.gemini-2.5-flash'}

 oci/google.gemini-2.5-pro
   Provider: llama_stack
   Metadata: {'model_type': 'llm', 'provider_id': 'oci', 'provider_resource_id': 'google.gemini-2.5-pro'}

 oci/google.gemini-2.5-flash-lite
   Provider: llama_stack
   Metadata: {'model_type': 'llm', 'provider_id': 'oci', 'provider_resource_id': 'google.gemini-2.5-flash-lite'}

 oci/xai.grok-4-fast-non-reasoning
   Provider: llama_stack
   Metadata: {'model_type': 'llm', 'provider_id': 'oci', 'provider_resource_id': 'xai.grok-4-fast-non-reasoning'}

 oci/xai.grok-4-fast-reasoning
   Provider: llama_stack
   Metadata: {'model_type': 'llm', 'provider_id': 'oci', 'provider_resource_id': 'xai.grok-4-fast-reasoning'}

 oci/xai.grok-code-fast-1
   Provider: llama_stack
   Metadata: {'model_type': 'llm', 'provider_id': 'oci', 'provider_resource_id': 'xai.grok-cod

## 2. Non-Streaming Chat Completion

Let's run a simple chat completion request (non-streaming).

In [5]:
# Select the first available model
if len(models) == 0:
    print("No models available!")
else:
    model_id = models[0].id
    print(f"Using model: {model_id}")

Using model: oci/google.gemini-2.5-flash


In [6]:
# Run a simple chat completion
response = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "user", "content": "What is Oracle Cloud Infrastructure?"}
    ],
    temperature=0.7,
    max_tokens=4096,
)

print("\n Response:")
print("=" * 80)
print(response.choices[0].message.content)
print("=" * 80)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"



 Response:
**Oracle Cloud Infrastructure (OCI)** is a suite of cloud computing services that runs on a global network of Oracle-managed data centers. It provides a complete range of highly automated, high-performance, and cost-effective services, including compute, storage, networking, databases, analytics, machine learning, IoT, and more.

Essentially, OCI is Oracle's public cloud offering, designed to compete with industry giants like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

Here's a breakdown of what OCI is and what makes it stand out:

1.  **Cloud Computing Model:**
    *   **Infrastructure as a Service (IaaS):** Provides fundamental computing resources (virtual machines, bare metal servers, storage, networking) over the internet. Users manage operating systems, applications, and data.
    *   **Platform as a Service (PaaS):** Offers a platform for customers to develop, run, and manage applications without the complexity of building and maintai

## 3. Streaming Chat Completion

Now let's try streaming - the response will be printed token by token as it arrives.

In [7]:
# Run a streaming chat completion
print(" Streaming Response:")
print("=" * 80)

stream = client.chat.completions.create(
    model=model_id,
    messages=[
        {
            "role": "user",
            "content": "List 3 benefits of using OCI for AI workloads."
        }
    ],
    temperature=0.7,
    max_tokens=4096,
    stream=True,
)

# Print tokens as they arrive
for chunk in stream:
    if hasattr(chunk, "choices") and len(chunk.choices) > 0:
        delta = chunk.choices[0].delta
        if hasattr(delta, "content") and delta.content:
            print(delta.content, end="", flush=True)

print("\n" + "=" * 80)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


 Streaming Response:
Here are 3 key benefits of using Oracle Cloud Infrastructure (OCI) for AI workloads:

1.  **High-Performance Compute with Leading GPUs:** OCI offers powerful NVIDIA GPUs (such as A100s and H100s) on bare metal and high-core count virtual machines. This provides the raw, uncompromised compute power essential for rapidly training complex deep learning models, running large-scale simulations, and performing high-throughput inference, significantly reducing model development and deployment times.

2.  **Cost-Effectiveness and Flexible Pricing:** OCI is often recognized for its competitive pricing compared to other major cloud providers, especially for high-performance resources like GPUs. It also typically features lower data egress fees, which can lead to substantial cost savings for data-intensive AI workloads that frequently move large datasets in and out of the cloud. Flexible consumption models further help optimize spending.

3.  **Integrated AI/ML Services and M

## 4. Try Different Models

You can experiment with different OCI models. Here are some examples:

In [8]:
# List all model IDs for easy reference
print("Available models:")
for i, model in enumerate(models, 1):
    print(f"{i}. {model.id}")

Available models:
1. oci/google.gemini-2.5-flash
2. oci/google.gemini-2.5-pro
3. oci/google.gemini-2.5-flash-lite
4. oci/xai.grok-4-fast-non-reasoning
5. oci/xai.grok-4-fast-reasoning
6. oci/xai.grok-code-fast-1
7. oci/xai.grok-4
8. oci/xai.grok-3-mini-fast
9. oci/xai.grok-3-fast
10. oci/xai.grok-3
11. oci/xai.grok-3-mini


In [9]:
# Try a different model (change the index to try different models)
if len(models) > 1:
    model_id = models[1].id  # Try the second model
    print(f"Switching to: {model_id}\n")

    response = client.chat.completions.create(
        model=model_id,
        messages=[
            {"role": "user", "content": "Write a poem about cloud computing."}
        ],
        temperature=0.9,
        max_tokens=4096,
    )

    print(" Response:")
    print("=" * 80)
    print(response.choices[0].message.content)
    print("=" * 80)
else:
    print("Only one model available")

Switching to: oci/google.gemini-2.5-pro



INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


 Response:
No floppy disk, no silver sphere,
No heavy drive you hold so dear.
Your data’s gone, it flew away
To live and breathe a brighter day.

It rests within a nebulous haze,
Through sunlit and through moonlit days.
A wisp of thought, a digital stream,
The substance of a modern dream.

You pull it down on phone or screen,
A distant file, a long-lost scene.
A document, a shared design,
No longer solely yours or mine.

But this soft cloud is not of rain,
It's built on a terrestrial plane.
Of humming racks in cooled, vast halls,
Behind secure and fireproof walls.

A million lights that blink and gleam,
A flowing, cool, electric stream.
A silent army, code and wire,
That serves the world's immense desire.

It’s more than storage, safe and deep,
While all our local systems sleep.
It’s rented power, brain, and brawn,
To calculate from dusk till dawn.

So tap your key and make the call,
The cloud provides and serves us all.
A weightless vault, beyond the blue,
That holds the work, the wor

## 5. Multi-turn Conversation

You can maintain conversation context by including previous messages.

In [10]:
# Multi-turn conversation example
conversation = [
    {"role": "user", "content": "What are the main features of OCI?"},
]

# First turn
response1 = client.chat.completions.create(
    model=models[0].id,
    messages=conversation,
    temperature=0.7,
    max_tokens=4096,
)

first_response = response1.choices[0].message.content
print(" Turn 1:")
print(first_response)
print()

# Add assistant response to conversation
conversation.append({"role": "assistant", "content": first_response})
conversation.append({"role": "user", "content": "Can you elaborate on the compute features?"})

# Second turn
response2 = client.chat.completions.create(
    model=models[0].id,
    messages=conversation,
    temperature=0.7,
    max_tokens=4096,
)

print(" Turn 2:")
print(response2.choices[0].message.content)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


 Turn 1:
Oracle Cloud Infrastructure (OCI) is a suite of cloud computing services that runs on the Oracle Cloud Infrastructure platform. It aims to provide enterprise-grade performance, security, and cost-effectiveness for a wide range of workloads.

Here are the main features of OCI, categorized for clarity:

1.  **Core Infrastructure Services:**
    *   **Compute:**
        *   **Virtual Machines (VMs):** Flexible and scalable virtual servers.
        *   **Bare Metal Instances:** Dedicated physical servers for high-performance workloads, offering direct access to hardware resources without virtualization overhead.
        *   **Container Engine for Kubernetes (OKE):** A fully managed Kubernetes service for deploying, managing, and scaling containerized applications.
        *   **Functions:** Serverless computing platform that allows you to run code without provisioning or managing servers.
    *   **Storage:**
        *   **Block Volume:** High-performance, persistent block storage

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


 Turn 2:
OCI's compute features are designed to provide a flexible, high-performance, and cost-effective foundation for running a wide variety of workloads, from traditional enterprise applications to modern cloud-native and high-performance computing (HPC) tasks.

Here's an elaboration on the main compute features:

1.  **Virtual Machines (VMs):**
    *   **Description:** Standard virtual servers that run on shared underlying physical hardware. They offer a balance of flexibility, scalability, and cost-effectiveness.
    *   **Key Characteristics:**
        *   **Instance Shapes:** OCI offers a wide array of VM shapes, including:
            *   **Standard Shapes (e.g., VM.Standard.E4.Flex, VM.Standard.A1.Flex):** General-purpose shapes with different CPU architectures (Intel, AMD, Ampere ARM A1) and the ability to customize CPU and memory resources independently (Flex shapes), allowing for precise resource allocation and cost optimization.
            *   **Optimized Shapes:** Shapes

## 6. Adjusting Parameters

You can control the model's behavior with various parameters:

- **temperature** (0.0-2.0): Controls randomness. Lower = more focused, Higher = more creative
- **max_tokens**: Maximum length of the response
- **stream**: Enable/disable streaming

In [11]:
# Example: Creative response with high temperature
print(" Creative response (temperature=1.5):\n")
response = client.chat.completions.create(
    model=models[0].id,
    messages=[
        {"role": "user", "content": "Tell me a creative story about cloud computing in 50 words."}
    ],
    temperature=1.5,
    max_tokens=4096,
)
print(response.choices[0].message.content)

print("\n" + "-" * 80 + "\n")

# Example: Focused response with low temperature
print(" Focused response (temperature=0.3):\n")
response = client.chat.completions.create(
    model=models[0].id,
    messages=[
        {"role": "user", "content": "What is cloud computing? Be concise."}
    ],
    temperature=0.3,
    max_tokens=4096,
)
print(response.choices[0].message.content)

 Creative response (temperature=1.5):



INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


Data ascended into the luminous, silent sky – billions of bits housed within ethereal data banks: the Cloud. Resources scaled, programs ran anywhere, like magic. Want to save a dream, stream a cosmos, or build worlds? With an invisible whisper, your processing wishes flowed freely from the vast collective.

--------------------------------------------------------------------------------

 Focused response (temperature=0.3):



INFO:httpx:HTTP Request: POST http://localhost:8321/v1/chat/completions "HTTP/1.1 200 OK"


Cloud computing delivers on-demand computing services—including servers, storage, databases, networking, software, analytics, and intelligence—over the Internet ("the cloud") on a pay-as-you-go basis.


In [12]:
from llama_stack_client import LlamaStackClient, Agent
# Create a basic agent using the Agent class
agent = Agent(
    client=client,
    model=models[0].id,
    instructions="You are a helpful AI assistant that can answer questions and help with tasks.",
)

print("✅ Created agent successfully")

✅ Created agent successfully


In [13]:
# Create agent session
basic_session_id = agent.create_session(session_name="basic_example_session")

print(f"✅ Created session: {basic_session_id}")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/conversations "HTTP/1.1 200 OK"


✅ Created session: conv_3661f0b6f4a504617c6e47e6b3273687383173bc456b1826


In [14]:
# Send a message to the agent with streaming
query = "What is the capital of England?"

print(f"User: {query}\n")
print("Assistant: ", end='')

# Create a turn with streaming
response = agent.create_turn(
    session_id=basic_session_id,
    messages=[
        {"role": "user", "content": query}
    ],
    stream=True,
)

# Stream the response
output_text = ""
for chunk in response:
    if chunk.event.event_type == "turn_completed":
        output_text = chunk.event.final_text
        #print(output_text)
        break
    elif chunk.event.event_type == "step_progress":
        # Print text deltas as they arrive
        if hasattr(chunk.event.delta, 'text'):
            print(chunk.event.delta.text, end='', flush=True)

print(f"\n✅ Response captured: {len(output_text)} characters")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/responses "HTTP/1.1 200 OK"


User: What is the capital of England?

Assistant: The capital of England is **London**.
✅ Response captured: 37 characters
