Commit graph

644 commits

Author SHA1 Message Date
AN YU (安宇)
dd62a2388c
docs: add notes to websearch tool and two extra example scripts (#1354)
# What does this PR do?

- Adds a note about unexpected Brave Search output appearing even when
Tavily Search is called. This behavior is expected for now and is a work
in progress https://github.com/meta-llama/llama-stack/issues/1229. The
note aims to clear any confusion for new users.
- Adds two example scripts demonstrating how to build an agent using:
    1. WebSearch tool
    2. WolframAlpha tool
These examples provide new users with an instant understanding of how to
integrate these tools.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
Tested these example scripts using following steps:
step 1. `ollama run llama3.2:3b-instruct-fp16 --keepalive 60m`
step 2. 
```
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export LLAMA_STACK_PORT=8321
```
step 3: `llama stack run --image-type conda
~/llama-stack/llama_stack/templates/ollama/run.yaml`
step 4: run the example script with your api keys.

expected output:

![image](https://github.com/user-attachments/assets/308ddb17-a087-4cf2-8622-b085174ea0ab)

![image](https://github.com/user-attachments/assets/639f239f-8966-433d-943c-ee6b304c0d71)


[//]: # (## Documentation)
2025-04-17 20:20:52 -04:00
Sébastien Han
cb874287a4
fix: resync api spec (#1987) 2025-04-17 11:36:04 -04:00
Alexey Rybak
326cbba579
feat(agents): add agent naming functionality (#1922)
# What does this PR do?
Allow users to name an agent and use the name in telemetry instead of
relying on randomly generated agent_ids. This improves the developer
experience by making it easier to find specific agents in telemetry
logs.

Closes #1832

## Test Plan

- Added tests to verify the agent name is properly stored and retrieved
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_name_filtering`
from the root of the project and made sure the tests pass
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_query_spans`
to verify existing code without agent names still works correctly

## Use Example
```
agent = Agent(
    llama_stack_client, 
    model=text_model_id, 
    name="CustomerSupportAgent",  # New parameter
    instructions="You are a helpful customer support assistant"
)
session_id = agent.create_session(f"test-session-{uuid4()}")
```

## Implementation Notes
- Agent names are optional string parameters with no additional
validation
- Names are not required to be unique - multiple agents can have the
same name
- The agent_id remains the unique identifier for an agent

---------

Co-authored-by: raghotham <raghotham@gmail.com>
2025-04-17 07:02:47 -07:00
Ben Browning
5b8e75b392
fix: OpenAI spec cleanup for assistant requests (#1963)
# What does this PR do?

Some of our multi-turn verification tests were failing because I had
accidentally marked content as a required field in the OpenAI chat
completion request assistant messages, but it's actually optional. It is
required for messages from other roles, but assistant is explicitly
allowed to be optional.

Similarly, the assistant message tool_calls field should default to None
instead of an empty list.

These two changes get the openai-llama-stack verification test back to
100% passing, just like it passes 100% when not behind Llama Stack. They
also increase the pass rate of some of the other providers in the
verification test, but don't get them to 100%.

## Test Plan

I started a Llama Stack server setup to run all the verification tests
(requires OPENAI_API_KEY env variable)

```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```

Then, I manually ran the verification tests to see which were failing,
fix them, and ran them again after these changes to ensure they were all
passing.

```
python -m pytest -s -v tests/verifications/openai_api/test_chat_completion.py --provider=openai-llama-stack
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-17 06:56:10 -07:00
Matthew Farrellee
4205376653
chore: add meta/llama-3.3-70b-instruct as supported nvidia inference provider model (#1985)
see https://build.nvidia.com/meta/llama-3_3-70b-instruct
2025-04-17 06:50:40 -07:00
Jash Gulabrai
2ae1d7f4e6
docs: Add NVIDIA platform distro docs (#1971)
# What does this PR do?
Add NVIDIA platform docs that serve as a starting point for Llama Stack
users and explains all supported microservices.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-04-17 05:54:30 -07:00
Francisco Arceo
00b232c282
chore: Fix to persist the theme preference across page navigation. (#1974)
# What does this PR do?
This PR persists the theme preference across page navigation.

Currently, if the default theme is detected, it is used. 

But if a user flips **_the default theme_** and goes to a new page, the
theme will switch back to the default.

This resolves that issue.

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-04-16 13:58:25 -07:00
Chirag Modi
fb8ff77ff2
docs: 0.2.2 doc updates (#1961)
Add updates to android site readme for 0.2.2
2025-04-15 13:26:17 -07:00
Dmitry Rogozhkin
71ed47ea76
docs: add example for intel gpu in vllm remote (#1952)
# What does this PR do?

PR adds instructions to setup vLLM remote endpoint for vllm-remote llama
stack distribution.

## Test Plan

* Verified with manual tests of the configured vllm-remote against vllm
endpoint running on the system with Intel GPU
* Also verified with ci pytests (see cmdline below). Test passes in the
same capacity as it does on the A10 Nvidia setup (some tests do fail
which seems to be known issues with vllm remote llama stack
distribution)

```
pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=http://localhost:5001 \
   --text-model=meta-llama/Llama-3.2-3B-Instruct
```

CC: @ashwinb

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2025-04-15 07:56:23 -07:00
Ben Browning
7641a5cd0b
fix: 100% OpenAI API verification for together and fireworks (#1946)
# What does this PR do?

TLDR: Changes needed to get 100% passing tests for OpenAI API
verification tests when run against Llama Stack with the `together`,
`fireworks`, and `openai` providers. And `groq` is better than before,
at 88% passing.

This cleans up the OpenAI API support for image message types
(specifically `image_url` types) and handling of the `response_format`
chat completion parameter. Both of these required a few more Pydantic
model definitions in our Inference API, just to move from the
not-quite-right stubs I had in place to something fleshed out to match
the actual OpenAI API specs.

As part of testing this, I also found and fixed a bug in the litellm
implementation of openai_completion and openai_chat_completion, so the
providers based on those should actually be working now.

The method `prepare_openai_completion_params` in
`llama_stack/providers/utils/inference/openai_compat.py` was improved to
actually recursively clean up input parameters, including handling of
lists, dicts, and dumping of Pydantic models to dicts. These changes
were required to get to 100% passing tests on the OpenAI API
verification against the `openai` provider.

With the above, the together.ai provider was passing as well as it is
without Llama Stack. But, since we have Llama Stack in the middle, I
took the opportunity to clean up the together.ai provider so that it now
also passes the OpenAI API spec tests we have at 100%. That means
together.ai is now passing our verification test better when using an
OpenAI client talking to Llama Stack than it is when hitting together.ai
directly, without Llama Stack in the middle.

And, another round of work for Fireworks to improve translation of
incoming OpenAI chat completion requests to Llama Stack chat completion
requests gets the fireworks provider passing at 100%. The server-side
fireworks.ai tool calling support with OpenAI chat completions and Llama
4 models isn't great yet, but by pointing the OpenAI clients at Llama
Stack's API we can clean things up and get everything working as
expected for Llama 4 models.

## Test Plan

### OpenAI API Verification Tests

I ran the OpenAI API verification tests as below and 100% of the tests
passed.

First, start a Llama Stack server that runs the `openai` provider with
the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template
setup to do this out of the box, so I added a
`tests/verifications/openai-api-verification-run.yaml` to do this.

First, ensure you have the necessary API key environment variables set:

```
export TOGETHER_API_KEY="..."
export FIREWORKS_API_KEY="..."
export OPENAI_API_KEY="..."
```

Then, run a Llama Stack server that serves up all these providers:

```
llama stack run \
      --image-type venv \
      tests/verifications/openai-api-verification-run.yaml
```

Finally, generate a new verification report against all these providers,
both with and without the Llama Stack server in the middle.

```
python tests/verifications/generate_report.py \
      --run-tests \
      --provider \
        together \
        fireworks \
        groq \
        openai \
        together-llama-stack \
        fireworks-llama-stack \
        groq-llama-stack \
        openai-llama-stack
```

You'll see that most of the configurations with Llama Stack in the
middle now pass at 100%, even though some of them do not pass at 100%
when hitting the backend provider's API directly with an OpenAI client.

### OpenAI Completion Integration Tests with vLLM:

I also ran the smaller `test_openai_completion.py` test suite (that's
not yet merged with the verification tests) on multiple of the
providers, since I had to adjust the method signature of
openai_chat_completion a bit and thus had to touch lots of these
providers to match. Here's the tests I ran there, all passing:

```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```

### OpenAI Completion Integration Tests with ollama

```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```

### OpenAI Completion Integration Tests with together.ai

```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo"
```

### OpenAI Completion Integration Tests with fireworks.ai

```
INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct"

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-14 08:56:29 -07:00
Sébastien Han
68eeacec0e
docs: resync missing nvidia doc (#1947)
# What does this PR do?

Resync doc.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-14 15:09:16 +02:00
Sébastien Han
69554158fa
feat: add health to all providers through providers endpoint (#1418)
The `/v1/providers` now reports the health status of each
provider when implemented.

```
curl -L http://127.0.0.1:8321/v1/providers|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4072  100  4072    0     0   246k      0 --:--:-- --:--:-- --:--:--  248k
{
  "data": [
    {
      "api": "inference",
      "provider_id": "ollama",
      "provider_type": "remote::ollama",
      "config": {
        "url": "http://localhost:11434"
      },
      "health": {
        "status": "OK"
      }
    },
    {
      "api": "vector_io",
      "provider_id": "faiss",
      "provider_type": "inline::faiss",
      "config": {
        "kvstore": {
          "type": "sqlite",
          "namespace": null,
          "db_path": "/Users/leseb/.llama/distributions/ollama/faiss_store.db"
        }
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "safety",
      "provider_id": "llama-guard",
      "provider_type": "inline::llama-guard",
      "config": {
        "excluded_categories": []
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "agents",
      "provider_id": "meta-reference",
      "provider_type": "inline::meta-reference",
      "config": {
        "persistence_store": {
          "type": "sqlite",
          "namespace": null,
          "db_path": "/Users/leseb/.llama/distributions/ollama/agents_store.db"
        }
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "telemetry",
      "provider_id": "meta-reference",
      "provider_type": "inline::meta-reference",
      "config": {
        "service_name": "llama-stack",
        "sinks": "console,sqlite",
        "sqlite_db_path": "/Users/leseb/.llama/distributions/ollama/trace_store.db"
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "eval",
      "provider_id": "meta-reference",
      "provider_type": "inline::meta-reference",
      "config": {
        "kvstore": {
          "type": "sqlite",
          "namespace": null,
          "db_path": "/Users/leseb/.llama/distributions/ollama/meta_reference_eval.db"
        }
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "datasetio",
      "provider_id": "huggingface",
      "provider_type": "remote::huggingface",
      "config": {
        "kvstore": {
          "type": "sqlite",
          "namespace": null,
          "db_path": "/Users/leseb/.llama/distributions/ollama/huggingface_datasetio.db"
        }
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "datasetio",
      "provider_id": "localfs",
      "provider_type": "inline::localfs",
      "config": {
        "kvstore": {
          "type": "sqlite",
          "namespace": null,
          "db_path": "/Users/leseb/.llama/distributions/ollama/localfs_datasetio.db"
        }
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "scoring",
      "provider_id": "basic",
      "provider_type": "inline::basic",
      "config": {},
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "scoring",
      "provider_id": "llm-as-judge",
      "provider_type": "inline::llm-as-judge",
      "config": {},
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "scoring",
      "provider_id": "braintrust",
      "provider_type": "inline::braintrust",
      "config": {
        "openai_api_key": "********"
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "brave-search",
      "provider_type": "remote::brave-search",
      "config": {
        "api_key": "********",
        "max_results": 3
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "tavily-search",
      "provider_type": "remote::tavily-search",
      "config": {
        "api_key": "********",
        "max_results": 3
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "code-interpreter",
      "provider_type": "inline::code-interpreter",
      "config": {},
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "rag-runtime",
      "provider_type": "inline::rag-runtime",
      "config": {},
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "model-context-protocol",
      "provider_type": "remote::model-context-protocol",
      "config": {},
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    },
    {
      "api": "tool_runtime",
      "provider_id": "wolfram-alpha",
      "provider_type": "remote::wolfram-alpha",
      "config": {
        "api_key": "********"
      },
      "health": {
        "status": "Not Implemented",
        "message": "Provider does not implement health check"
      }
    }
  ]
}
```

Per providers too:

```
curl -L http://127.0.0.1:8321/v1/providers/ollama
{"api":"inference","provider_id":"ollama","provider_type":"remote::ollama","config":{"url":"http://localhost:11434"},"health":{"status":"OK"}}
```

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-14 11:59:36 +02:00
Ashwin Bharambe
f34f22f8c7
feat: add batch inference API to llama stack inference (#1945)
# What does this PR do?

This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`

The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.

Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.

So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.

## Test Plan

Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144

LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```

Then run the batch inference test case.
2025-04-12 11:41:12 -07:00
Nathan Weinberg
854c2ad264
fix: misleading help text for 'llama stack build' and 'llama stack run' (#1910)
# What does this PR do?
current text for 'llama stack build' and 'llama stack run' says that if
no argument is passed to '--image-name' that the active Conda
environment will be used

in reality, the active enviroment is used whether it is from conda,
virtualenv, etc.

## Test Plan
N/A

## Documentation
N/A

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
2025-04-12 01:19:11 -07:00
Charlie Doern
0751a960a5
feat: make training config fields optional (#1861)
# What does this PR do?

Today, supervised_fine_tune itself and the `TrainingConfig` class have a
bunch of required fields that a provider implementation might not need.

for example, if a provider wants to handle hyperparameters in its
configuration as well as any type of dataset retrieval, optimizer or
LoRA config, a user will still need to pass in a virtually empty
`DataConfig`, `OptimizerConfig` and `AlgorithmConfig` in some cases.

Many of these fields are intended to work specifically with llama models
and knobs intended for customizing inline.

Adding remote post_training providers will require loosening these
arguments, or forcing users to pass in empty objects to satisfy the
pydantic models.

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-04-12 01:13:45 -07:00
Aidan Reilly
51492bd9b6
docs: Update docs and fix warning in start-stack.sh (#1937)
Small docs update and an update for `start-stack.sh` with missing color
and if statment logic.

# What does this PR do?
1. Makes a small change to start-stack.sh to resolve this error:
```cmd
/home/aireilly/.local/lib/python3.13/site-packages/llama_stack/distribution/start_stack.sh: line 76: [: missing ]'
```
2. Adds a missing $GREEN colour to start-stack.sh
3. Updated `docs/source/getting_started/detailed_tutorial.md` with some
small changes and corrections.

## Test Plan
Procedures described in
`docs/source/getting_started/detailed_tutorial.md` were verified on
Linux Fedora 41.
2025-04-11 16:26:17 -07:00
raghotham
ed58a94b30
docs: fixes to quick start (#1943)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Co-authored-by: Francisco Arceo <farceo@redhat.com>
2025-04-11 13:41:23 -07:00
Ben Browning
2b2db5fbda
feat: OpenAI-Compatible models, completions, chat/completions (#1894)
# What does this PR do?

This stubs in some OpenAI server-side compatibility with three new
endpoints:

/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions

This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .

The two "v1" instances in there isn't awesome, but the thinking is that
Llama Stack's API is v1 and then our OpenAI compatibility layer is
compatible with OpenAI V1. And, some OpenAI clients implicitly assume
the URL ends with "v1", so this gives maximum compatibility.

The openai models endpoint is implemented in the routing layer, and just
returns all the models Llama Stack knows about.

The following providers should be working with the new OpenAI
completions and chat/completions API:
* remote::anthropic (untested)
* remote::cerebras-openai-compat (untested)
* remote::fireworks (tested)
* remote::fireworks-openai-compat (untested)
* remote::gemini (untested)
* remote::groq-openai-compat (untested)
* remote::nvidia (tested)
* remote::ollama (tested)
* remote::openai (untested)
* remote::passthrough (untested)
* remote::sambanova-openai-compat (untested)
* remote::together (tested)
* remote::together-openai-compat (untested)
* remote::vllm (tested)

The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API, we'll
add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses to
OpenAI responses.

This is related to #1817 but is a bit larger in scope than just chat
completions, as I have real use-cases that need the older completions
API as well.

## Test Plan

### vLLM

```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```

### ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```



## Documentation

Run a Llama Stack distribution that uses one of the providers mentioned
in the list above. Then, use your favorite OpenAI client to send
completion or chat completion requests with the base_url set to
http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the
host and port of your Llama Stack server, if different.

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-11 13:14:17 -07:00
Francisco Arceo
24d70cedca
docs: Updated docs to show minimal RAG example and some other minor changes (#1935)
# What does this PR do?
Incorporating some feedback into the docs.

- **`docs/source/getting_started/index.md`:**
    - Demo actually does RAG now
    - Simplified the installation command for dependencies.
    - Updated demo script examples to align with the latest API changes.
- Replaced manual document manipulation with `RAGDocument` for clarity
and maintainability.
- Introduced new logic for model and embedding selection using the Llama
Stack Client SDK.
- Enhanced examples to showcase proper agent initialization and logging.
- **`docs/source/getting_started/detailed_tutorial.md`:**
- Updated the section for listing models to include proper code
formatting with `bash`.
    - Removed and reorganized the "Run the Demos" section for clarity.
- Adjusted tab-item structures and added new instructions for demo
scripts.
- **`docs/_static/css/my_theme.css`:**
- Updated heading styles to include `h2`, `h3`, and `h4` for consistent
font weight.
- Added a new style for `pre` tags to wrap text and break long words,
this is particularly useful for rendering long output from generation.

    
## Test Plan
Tested locally. Screenshot for reference:

<img width="1250" alt="Screenshot 2025-04-10 at 10 12 12 PM"
src="https://github.com/user-attachments/assets/ce1c8986-e072-4c6f-a697-ed0d8fb75b34"
/>

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-04-11 11:50:36 -07:00
Mark Campbell
6aa459b00c
docs: fix errors in kubernetes deployment guide (#1914)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
Fixes a couple of errors in PVC/Secret setup and adds context for
expected Hugging Face token
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
2025-04-11 13:04:13 +02:00
Francisco Arceo
49955a06b1
docs: Update quickstart page to structure things a little more for the novices (#1873)
# What does this PR do?
Another doc enhancement for
https://github.com/meta-llama/llama-stack/issues/1818

Summary of changes:
- `docs/source/distributions/configuration.md`
   - Updated dropdown title to include a more user-friendly description.

- `docs/_static/css/my_theme.css`
   - Added styling for `<h3>` elements to set a normal font weight.

- `docs/source/distributions/starting_llama_stack_server.md`
- Changed section headers from bold text to proper markdown headers
(e.g., `##`).
- Improved descriptions for starting Llama Stack server using different
methods (library, container, conda, Kubernetes).
- Enhanced clarity and structure by converting instructions into
markdown headers and improved formatting.

- `docs/source/getting_started/index.md`
   - Major restructuring of the "Quick Start" guide:
- Added new introductory section for Llama Stack and its capabilities.
- Reorganized steps into clearer subsections with proper markdown
headers.
- Replaced dropdowns with tabbed content for OS-specific instructions.
- Added detailed steps for setting up and running the Llama Stack server
and client.
- Introduced new sections for running basic inference and building
agents.
- Enhanced readability and visual structure with emojis, admonitions,
and examples.

- `docs/source/providers/index.md`
   - Updated the list of LLM inference providers to include "Ollama."
   - Expanded the list of vector databases to include "SQLite-Vec."

Let me know if you need further details!

## Test Plan
Renders locally, included screenshot.

# Documentation

For https://github.com/meta-llama/llama-stack/issues/1818

<img width="1332" alt="Screenshot 2025-04-09 at 11 07 12 AM"
src="https://github.com/user-attachments/assets/c106efb9-076c-4059-a4e0-a30fa738585b"
/>

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-04-10 14:09:00 -07:00
Francisco Arceo
09a83b1ec1
docs: Updating background color for code in darkmode (#1930)
# What does this PR do?
A small quality of life adjustment to make the code background for
darkmode black. Makes it much easier to differentiate between code and
non-code text.

From:
<img width="1250" alt="Screenshot 2025-04-10 at 9 22 23 AM"
src="https://github.com/user-attachments/assets/3a3aea8b-e540-4e76-a7db-6c276e389cc2"
/>
To:
<img width="1273" alt="Screenshot 2025-04-10 at 9 22 43 AM"
src="https://github.com/user-attachments/assets/6ada2cb1-2c33-4a95-be88-7b4c65d4ba93"
/>

The CSS was sourced from here:
https://github.com/MrDogeBro/sphinx_rtd_dark_mode/blob/main/sphinx_rtd_dark_mode/static/dark_mode_css/dark.css

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-04-10 09:38:57 -07:00
Sébastien Han
1f2df59ece
docs: fix model name (#1926)
# What does this PR do?

Use llama3.2:3b for consistency.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-10 09:37:48 -07:00
Yuan Tang
1be66d754e
docs: Redirect instructions for additional hardware accelerators for remote vLLM provider (#1923)
# What does this PR do?

vLLM website just added a [new index page for installing for different
hardware
accelerators](https://docs.vllm.ai/en/latest/getting_started/installation.html).
This PR adds a link to that page with additional edits to make sure
readers are aware that the use of GPUs on this page are for
demonstration purposes only.

This closes https://github.com/meta-llama/llama-stack/issues/1813.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-04-10 10:04:17 +02:00
Yuan Tang
712c6758c6
docs: Avoid bash script syntax highlighting for dark mode (#1918)
See
https://github.com/meta-llama/llama-stack/pull/1913#issuecomment-2790153778

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-04-09 15:43:43 -07:00
Sébastien Han
770b38f8b5
chore: simplify running the demo UI (#1907)
# What does this PR do?

* Manage UI deps in pyproject
* Use a new "ui" dep group to pull the deps with "uv"
* Simplify the run command
* Bump versions in requirements.txt

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-09 11:22:29 -07:00
Francisco Arceo
b93318e40b
chore: Detect browser setting for dark/light mode and set default to light mode (#1913)
# What does this PR do?

1. Adding some lightweight JS to detect the default browser setting for
dark/light mode
3. Setting default screen setting to light mode as to not change default
behavior.

From the docs: https://github.com/MrDogeBro/sphinx_rtd_dark_mode

>This lets you choose which theme the user sees when they load the docs
for the first time ever. After the first time however, this setting has
no effect as the users preference is stored in local storage within
their browser. This option accepts a boolean for the value. If this
option is true (the default option), users will start in dark mode when
first visiting the site. If this option is false, users will start in
light mode when they first visit the site.

# Closes #1915 

## Test Plan
Tested locally on my Mac on Safari and Chrome.

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-04-09 12:40:56 -04:00
Matthew Farrellee
a2cf299906
fix: update getting started guide to use ollama pull (#1855)
# What does this PR do?

download the getting started w/ ollama model instead of downloading and
running it.

directly running it was necessary before
https://github.com/meta-llama/llama-stack/pull/1854

## Test Plan

run the code on the page
2025-04-09 10:35:19 +02:00
Sébastien Han
389767010b
feat: ability to execute external providers (#1672)
# What does this PR do?

Providers that live outside of the llama-stack codebase are now
supported.
A new property `external_providers_dir` has been added to the main
config and can be configured as follow:

```
external_providers_dir: /etc/llama-stack/providers.d/
```

Where the expected structure is:

```
providers.d/
  inference/
    custom_ollama.yaml
    vllm.yaml
  vector_io/
    qdrant.yaml
```

Where `custom_ollama.yaml` is:

```
adapter:
  adapter_type: custom_ollama
  pip_packages: ["ollama", "aiohttp"]
  config_class: llama_stack_ollama_provider.config.OllamaImplConfig
  module: llama_stack_ollama_provider
api_dependencies: []
optional_api_dependencies: []
```

Obviously the package must be installed on the system, here is the
`llama_stack_ollama_provider` example:

```
$ uv pip show llama-stack-ollama-provider
Using Python 3.10.16 environment at: /Users/leseb/Documents/AI/llama-stack/.venv
Name: llama-stack-ollama-provider
Version: 0.1.0
Location: /Users/leseb/Documents/AI/llama-stack/.venv/lib/python3.10/site-packages
Editable project location: /private/var/folders/mq/rnm5w_7s2d3fxmtkx02knvhm0000gn/T/tmp.ZBHU5Ezxg4/ollama/llama-stack-ollama-provider
Requires:
Required-by:
```

Closes: https://github.com/meta-llama/llama-stack/issues/658

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-09 10:30:41 +02:00
AlexHe99
983f6feeb8
docs: Update remote-vllm.md with AMD GPU vLLM server supported. (#1858)
Add the content to use AMD GPU as the vLLM server. Split the original
part to two sub chapters,
1. AMD vLLM server
2. NVIDIA vLLM server (orignal)

# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Signed-off-by: Alex He <alehe@amd.com>
2025-04-08 21:35:32 -07:00
ehhuang
7b4eb0967e
test: verification on provider's OAI endpoints (#1893)
# What does this PR do?


## Test Plan
export MODEL=accounts/fireworks/models/llama4-scout-instruct-basic;
LLAMA_STACK_CONFIG=verification pytest -s -v tests/integration/inference
--vision-model $MODEL --text-model $MODEL
2025-04-07 23:06:28 -07:00
Matthew Farrellee
c52ccc4bbd
docs: update importing_as_library.md (#1863)
LlamaStackAsLibraryClient.initialize is not async, cannot be await'd
2025-04-07 12:31:04 +02:00
ehhuang
378f0de439
docs: llama4 getting started nb (#1878)
# What does this PR do?


## Test Plan
2025-04-06 18:51:34 -07:00
raghotham
fd7ab37c14
docs: fixing sphinx imports (#1884)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
2025-04-05 14:21:45 -07:00
Ashwin Bharambe
b8f1561956
feat: introduce llama4 support (#1877)
As title says. Details in README, elsewhere.
2025-04-05 11:53:35 -07:00
Francisco Arceo
23a99a4b22
docs: Minor updates to docs to make them a little friendlier to new users (#1871)
# What does this PR do?
This PR modifies some of the docs to help them map to (1) the mental
model of software engineers building AI models starting with RAG and
then moving to Agents and (2) aligning the navbar somewhat closer to the
diagram on the home page.

## Test Plan
N/A Tested locally.

# Documentation
Take a look at the screen shot for below and after.
## Before 
![Screenshot 2025-04-03 at 10 39
32 PM](https://github.com/user-attachments/assets/c4dc9998-3e46-43b0-8425-892c94ec3a6a)

## After
![Screenshot 2025-04-03 at 10 38
37 PM](https://github.com/user-attachments/assets/05670fcd-e56b-42dd-8af2-07b81f941d40)

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-04-04 08:10:35 -04:00
Francisco Arceo
19f504e9e2
docs: Updating docs to source from CONTRIBUTING.md (#1850)
# What does this PR do?
Another for https://github.com/meta-llama/llama-stack/issues/1815

This links the `CONTRIBUTING.md` file directly so that we don't have to
maintain two different files.

Also I updated the title for RAG under Building AI Applications.

## Changes 
Look of what the Contributing page looks like, proof it sources directly
from the markdown file.

![Screenshot 2025-04-01 at 12 43
51 AM](https://github.com/user-attachments/assets/f7021d29-eec3-44ad-a5b3-55c4480ea9ac)

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-04-01 14:50:04 +02:00
Ihar Hrachyshka
0a895c70d1
fix(api): don't return list for runtime tools (#1686)
# What does this PR do?

Don't return list for runtime tools. Instead return Response object for
pagination and consistency with other APIs.

---------

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-04-01 09:53:11 +02:00
Sébastien Han
2ffa2b77ed
refactor: extract pagination logic into shared helper function (#1770)
# What does this PR do?

Move pagination logic from LocalFS and HuggingFace implementations into
a common helper function to ensure consistent pagination behavior across
providers. This reduces code duplication and centralizes pagination
logic in one place.


## Test Plan

Run this script:

```
from llama_stack_client import LlamaStackClient

# Initialize the client
client = LlamaStackClient(base_url="http://localhost:8321")

# Register a dataset
response = client.datasets.register(
    purpose="eval/messages-answer",  # or "eval/question-answer" or "post-training/messages"
    source={"type": "uri", "uri": "huggingface://datasets/llamastack/simpleqa?split=train"},
    dataset_id="my_dataset",  # optional, will be auto-generated if not provided
    metadata={"description": "My evaluation dataset"},  # optional
)

# Verify the dataset was registered by listing all datasets
datasets = client.datasets.list()
print(f"Registered datasets: {[d.identifier for d in datasets]}")

# You can then access the data using the datasetio API
# rows = client.datasets.iterrows(dataset_id="my_dataset", start_index=1, limit=2)
rows = client.datasets.iterrows(dataset_id="my_dataset")
print(f"Data: {rows.data}")
```

And play with `start_index` and `limit`.

[//]: # (## Documentation)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-31 13:08:29 -07:00
Francisco Arceo
d495922949
docs: Updated documentation and Sphinx configuration (#1845)
# What does this PR do?

The goal of this PR is to make the pages easier to navigate by surfacing
the child pages on the navbar, updating some of the copy, moving some of
the files around.

Some changes:
1. Clarifying Titles
2. Restructuring "Distributions" more formally in its own page to be
consistent with Providers and adding some clarity to the child pages to
surface them and make them easier to navigate
3. Updated sphinx config to not collapse navigation by default
4. Updated copyright year to be calculated dynamically 
5. Moved `docs/source/distributions/index.md` ->
`docs/source/distributions/starting_llama_stack_server.md`

Another for https://github.com/meta-llama/llama-stack/issues/1815

## Test Plan
Tested locally and pages build (screen shots for example).

## Documentation
###  Before:
![Screenshot 2025-03-31 at 1 09
21 PM](https://github.com/user-attachments/assets/98e34f76-f0d9-4055-8e2c-441b1e7d8f6a)

### After:
![Screenshot 2025-03-31 at 1 08
52 PM](https://github.com/user-attachments/assets/dfb6b8ad-3a1d-46b6-8f54-0c553664093f)

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-03-31 13:08:05 -07:00
Francisco Arceo
9b478f3756
docs: Adding darkmode to documentation (#1843)
# What does this PR do?
docs: Adding darkmode to documentation


## Test Plan
Tested locally. 

Here's the look:
![Screenshot 2025-03-31 at 9 43
05 AM](https://github.com/user-attachments/assets/5989dbc8-ba03-4710-ad8d-6d4b9ac79786)


## Issues

Related to https://github.com/meta-llama/llama-stack/issues/1815 

Closes https://github.com/meta-llama/llama-stack/issues/1844

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-03-31 08:31:53 -07:00
Anamika
d8a8a734b5
fix: update sink name for traces and metrics in LlamaStack 0.1.8 (#1836)
# What does this PR do?
This PR updates the sink name configuration for traces and metrics in
LlamaStack to align with the latest changes introduced in version 0.1.8.
Previously, when using the `otel` sink along with other sinks (like
`console` and `sqlite`), the system threw a **ValueError**, with the
message:

```shell
Value error, 'otel' is not a valid TelemetrySink [type=value_error, input_value='console,otel,sqlite', input_type=str]
For further information visit https://errors.pydantic.dev/2.10/v/value_error
``` 

## Test Plan
- **Test 1:**  
Ran the LlamaStack server with a configuration containing
`console,otel,sqlite` as sinks.
   - **Expected result:** No errors related to invalid sink names.
   - **Result:** The system ran without throwing a `ValueError`.

- **Test 2:**  
Verified that the `otel_trace`, `otel_metric` sink now works in
combination with other sinks (`console`, `sqlite`).
- **Expected result:** Telemetry data is correctly sent to all specified
sinks without errors.
- **Result:** All telemetry data was successfully sent to the specified
sinks.
2025-03-29 10:09:08 -07:00
Francisco Arceo
37b6da37ba
docs: Document sqlite-vec faiss comparison (#1821)
# What does this PR do?
This PR documents and benchmarks the performance tradeoffs between
sqlite-vec and FAISS inline VectorDB providers.

# Closes https://github.com/meta-llama/llama-stack/issues/1165

## Test Plan

The test was run using this script:

<details>
<summary>CLICK TO SHOW SCRIPT 👋  </summary>

```python

import cProfile
import os
import uuid
import time
import random
import string
import matplotlib.pyplot as plt
import pandas as pd
from termcolor import cprint
from llama_stack_client.types import Document
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
from memory_profiler import profile
from line_profiler import LineProfiler

os.environ["INFERENCE_MODEL"] = "llama3.2:3b-instruct-fp16"
os.environ["LLAMA_STACK_CONFIG"] = "ollama"

def generate_random_chars(count=400):
    return ''.join(random.choices(string.ascii_letters, k=count))

def generate_documents(num_docs: int, num_chars: int):
    documents = [
        Document(
            document_id=f"doc-{i}",
            content=f"Document content for document {i} - {generate_random_chars(count=num_chars)}",
            mime_type="text/plain",
            metadata={},
        )
        for i in range(num_docs)
    ]
    return documents


@profile
def benchmark_write(client, vector_db_id, documents, batch_size=100):
    write_times = []
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        start_time = time.time()
        client.tool_runtime.rag_tool.insert(
            documents=batch,
            vector_db_id=vector_db_id,
            chunk_size_in_tokens=512,
        )
        end_time = time.time()
        write_times.append(end_time - start_time)

    return write_times

@profile
def benchmark_read(client, provider_id, vector_db_id, user_prompts):
    response_times = []
    for prompt in user_prompts:
        start_time = time.time()
        response = client.vector_io.query(
            vector_db_id=vector_db_id,
            query=prompt,
        )
        end_time = time.time()
        response_times.append(end_time - start_time)
    return response_times

def profile_functions():
    profiler = LineProfiler()
    profiler.add_function(benchmark_write)
    profiler.add_function(benchmark_read)
    return profiler


def plot_results(output, batch_size):
    # Create a DataFrame for easy manipulation
    df_sqlite = pd.DataFrame(output['sqlite-vec'])
    df_faiss = pd.DataFrame(output['faiss'])

    df_sqlite['write_times'] *= 1000
    df_faiss['write_times'] *= 1000

    avg_write_sqlite = df_sqlite['write_times'].mean()
    avg_write_faiss = df_faiss['write_times'].mean()
    avg_read_sqlite = df_sqlite['read_times'].mean()
    avg_read_faiss = df_faiss['read_times'].mean()

    plt.figure(figsize=(12, 6))
    plt.hist(df_sqlite['write_times'], bins=10, alpha=0.5, color='blue', label='sqlite-vec Write Times')
    plt.hist(df_faiss['write_times'], bins=10, alpha=0.5, color='red', label='faiss Write Times')
    plt.axvline(avg_write_sqlite, color='blue', linestyle='--',
                label=f'Average Write Time (sqlite-vec): {avg_write_sqlite:.3f} ms')
    plt.axvline(avg_write_faiss, color='red', linestyle='--',
                label=f'Average Write Time (faiss): {avg_write_faiss:.3f} ms')
    plt.title(f'Histogram of Write Times for sqlite-vec and faiss\nn = {df_faiss.shape[0]} with batch size = {batch_size}')
    plt.xlabel('Time (milliseconds)')
    plt.ylabel('Density')
    plt.legend()
    plt.savefig('write_time_comparison.png')
    plt.close()

    plt.figure(figsize=(12, 6))
    plt.hist(df_sqlite['read_times'], bins=10, alpha=0.5, color='blue', label='sqlite-vec Read Times')
    plt.hist(df_faiss['read_times'], bins=10, alpha=0.5, color='red', label='faiss Read Times')
    plt.axvline(avg_read_sqlite, color='blue', linestyle='--',
                label=f'Average Read Time (sqlite-vec): {avg_read_sqlite:.3f} ms')
    plt.axvline(avg_read_faiss, color='red', linestyle='--',
                label=f'Average Read Time (faiss): {avg_read_faiss:.3f} ms')
    plt.title(f'Histogram of Read Times for sqlite-vec and faiss\nn = {df_faiss.shape[0]}')
    plt.xlabel('Time (milliseconds)')
    plt.ylabel('Density')
    plt.legend()
    plt.savefig('read_time_comparison.png')
    plt.close()

    plt.figure(figsize=(12, 6))
    plt.hist(df_sqlite['read_times'], bins=10, alpha=0.5, color='blue', label='sqlite-vec Read Times')
    plt.hist(df_faiss['read_times'], bins=10, alpha=0.5, color='red', label='faiss Read Times')
    plt.axvline(avg_read_sqlite, color='blue', linestyle='--',
                label=f'Average Read Time (sqlite-vec): {avg_read_sqlite:.3f} ms')
    plt.axvline(avg_read_faiss, color='red', linestyle='--',
                label=f'Average Read Time (faiss): {avg_read_faiss:.3f} ms')
    plt.title(f'Histogram of Read Times for sqlite-vec and faiss\nn = {df_faiss.shape[0]}')
    plt.xlabel('Time (milliseconds)')
    plt.ylabel('Density')
    plt.legend()
    plt.savefig('read_time_comparison.png')
    plt.close()

    plt.figure(figsize=(12, 6))
    plt.plot(df_sqlite.index, df_sqlite['write_times'],
             marker='o', markersize=4, linestyle='-', color='blue',
             label='sqlite-vec Write Times')
    plt.plot(df_faiss.index, df_faiss['write_times'],
             marker='x', markersize=4, linestyle='-', color='red',
             label='faiss Write Times')

    plt.title(f'Write Times by Operation Sequence\n(batch size = {batch_size})')
    plt.xlabel('Write Operation Sequence')
    plt.ylabel('Time (milliseconds)')
    plt.legend()
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.savefig('write_time_sequence.png')
    plt.close()
    # Print out the summary table
    print("\nPerformance Summary for sqlite-vec:")
    print(df_sqlite)

    # Print out the summary table
    print("\nPerformance Summary for faiss:")
    print(df_faiss)


def main():
    # Initialize the client
    client = LlamaStackAsLibraryClient("ollama")
    vector_db_id = f"test-vector-db-{uuid.uuid4().hex}"
    _ = client.initialize()

    # Generate a large dataset
    num_chars = 50
    num_docs = 100
    num_writes = 100
    write_batch_size = 100
    num_reads = 100

    documents = generate_documents(num_docs * write_batch_size, num_chars)
    user_prompts = [
        f"Tell me about document {i}" for i in range(1, num_reads + 1)
    ]

    providers = ["sqlite-vec", "faiss"]
    output = {
        provider_id: {"write_times": None, "read_times": None} for provider_id in providers
    }

    # Benchmark writes and reads for SQLite and Faiss
    for provider_id in providers:
        cprint(f"Benchmarking provider: {provider_id}", "yellow")
        client.vector_dbs.register(
            provider_id=provider_id,
            vector_db_id=vector_db_id,
            embedding_model="all-MiniLM-L6-v2",
            embedding_dimension=384,
        )
        write_times = benchmark_write(client, vector_db_id, documents, write_batch_size)

        average_write_time_ms = sum(write_times) / len(write_times) * 1000.
        cprint(f"Average write time for {provider_id} is {average_write_time_ms:.2f} milliseconds for {num_writes} runs", "blue")

        cprint(f"Benchmarking reads for provider: {provider_id}", "yellow")
        read_times = benchmark_read(client, provider_id, vector_db_id, user_prompts)

        average_read_time_ms = sum(read_times) / len(read_times) * 1000.
        cprint(f"Average read time for {provider_id} is {average_read_time_ms:.2f} milliseconds for {num_reads} runs", "blue")

        client.vector_dbs.unregister(vector_db_id=vector_db_id)
        output[provider_id]['write_times'] = write_times
        output[provider_id]['read_times'] = read_times
    # Generate plots and summary
    plot_results(output, write_batch_size)


if __name__ == "__main__":
    cProfile.run('main()', 'profile_output.prof')
```
</details>

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-03-28 17:41:33 +01:00
Ihar Hrachyshka
18bac27d4e
fix: Use CONDA_DEFAULT_ENV presence as a flag to use conda mode (#1555)
# What does this PR do?

This is the second attempt to switch to system packages by default. Now
with a hack to detect conda environment - in which case conda image-type
is used.

Note: Conda will only be used when --image-name is unset *and*
CONDA_DEFAULT_ENV is set. This means that users without conda will
correctly fall back to using system packages when no --image-* arguments
are passed at all.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Uses virtualenv:

```
$ llama stack build --template ollama --image-type venv
$ llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Using virtual environment: /home/ec2-user/src/llama-stack/schedule/.local
[...]
```

Uses system packages (virtualenv already initialized):

```
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
INFO     2025-03-27 20:46:22,882 llama_stack.cli.stack.run:142 server: No image type or image name provided. Assuming environment packages.
[...]
```

Attempt to run from environment packages without necessary packages
installed:
```
$ python -m venv barebones
$ . ./barebones/bin/activate
$ pip install -e . # to install llama command
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
ModuleNotFoundError: No module named 'fastapi'
```

^ failed as expected because the environment doesn't have necessary
packages installed.

Now install some packages in the new environment:

```
$ pip install fastapi opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp aiosqlite ollama openai datasets faiss-cpu mcp autoevals
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
```

Now see if setting CONDA_DEFAULT_ENV will change what happens by
default:

```
$ export CONDA_DEFAULT_ENV=base
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Using conda environment: base
Conda environment base does not exist.
[...]
```

---------

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-27 17:13:22 -04:00
Xi Yan
b5c27f77ad
chore: clean up distro doc (#1804)
# What does this PR do?
- hide distro doc (docker needs to be thoroughly tested). 

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
- docs

[//]: # (## Documentation)
2025-03-27 12:12:14 -07:00
Ihar Hrachyshka
81393afb35
chore: require data field for all List*Response models (#1799)
# What does this PR do?

No violators are currently in-tree. This is just hardening the api specs
for future consistency.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-27 18:15:16 +01:00
Dmitry Rogozhkin
935e706b15
docs: fix remote-vllm instructions (#1805)
# What does this PR do?

* Fix location of `run.yaml` relative to the cloned llama stack
repository
* Drop `-it` from `docker run` commands as its not needed running
services

## Test Plan

* Verified running the llama stack following updated instruction

CC: @ashwinb

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2025-03-27 10:19:51 -04:00
Hardik Shah
f8445b0d69
fix: update mcp commands in getting_started.ipynb (#1800)
as titled
2025-03-26 14:47:32 -07:00
Hardik Shah
e8d5959048
fix: update getting_started.ipynb (#1797)
using simple `pip install llama-stack-client`
2025-03-26 12:54:21 -07:00
Hardik Shah
cb2a9784ab
fix: multiple issues with getting_started notebook (#1795)
Fixes multiple issues 

1. llama stack build of dependencies was breaking with incompatible
numpy / pandas when importing datasets

Moved the notebook to start a local server instead of using library as a
client. This way the setup is cleaner since its all contained and by
using `uv run --with` we can test both the server setup process too in
CI and release time.

2. The change to [1] surfaced some other issues 
- running `llama stack run` was defaulting to conda env name 
- provider data was not being managed properly 
- Some notebook cells (telemetry for evals) were not updated with latest
changes

Fixed all the issues and update the notebook. 

### Test 

1. Manually run it all in local env 
2. `pytest -v -s --nbval-lax docs/getting_started.ipynb`
2025-03-26 10:59:12 -07:00