diff --git a/docs/source/building_applications/tools.md b/docs/source/building_applications/tools.md index c7af17bfa..b19be888c 100644 --- a/docs/source/building_applications/tools.md +++ b/docs/source/building_applications/tools.md @@ -9,29 +9,24 @@ When instantiating an agent, you can provide it a list of tool groups that it ha Refer to the [Building AI Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) notebook for more examples on how to use tools. -## Types of Tool Group providers +## Server-side vs. client-side tool execution -There are three types of providers for tool groups that are supported by Llama Stack. +Llama Stack allows you to use both server-side and client-side tools. With server-side tools, `agent.create_turn` can perform execution of the tool calls emitted by the model +transparently giving the user the final answer desired. If client-side tools are provided, the tool call is sent back to the user for execution +and optional continuation using the `agent.resume_turn` method. -1. Built-in providers -2. Model Context Protocol (MCP) providers -3. Client provided tools -### Built-in providers +### Server-side tools -Built-in providers come packaged with Llama Stack. These providers provide common functionalities like web search, code interpretation, and computational capabilities. +Llama Stack provides built-in providers for some common tools. These include web search, math, and RAG capabilities. -#### Web Search providers -There are three web search providers that are supported by Llama Stack. +#### Web Search -1. Brave Search -2. Bing Search -3. Tavily Search +You have three providers to execute the web search tool calls generated by a model: Brave Search, Bing Search, and Tavily Search. -Example client SDK call to register a "websearch" toolgroup that is provided by brave-search. +To indicate that the web search tool calls should be executed by brave-search, you can point the "builtin::websearch" toolgroup to the "brave-search" provider. ```python -# Register Brave Search tool group client.toolgroups.register( toolgroup_id="builtin::websearch", provider_id="brave-search", @@ -39,17 +34,17 @@ client.toolgroups.register( ) ``` -The tool requires an API key which can be provided either in the configuration or through the request header `X-LlamaStack-Provider-Data`. The format of the header is `{"_api_key": }`. - -> **NOTE:** When using Tavily Search and Bing Search, the inference output will still display "Brave Search." This is because Llama models have been trained with Brave Search as a built-in tool. Tavily and bing is just being used in lieu of Brave search. +The tool requires an API key which can be provided either in the configuration or through the request header `X-LlamaStack-Provider-Data`. The format of the header is: +``` +{"_api_key": } +``` -#### WolframAlpha +#### Math The WolframAlpha tool provides access to computational knowledge through the WolframAlpha API. ```python -# Register WolframAlpha tool group client.toolgroups.register( toolgroup_id="builtin::wolfram_alpha", provider_id="wolfram-alpha" ) @@ -83,11 +78,49 @@ Features: > **Note:** By default, llama stack run.yaml defines toolgroups for web search, wolfram alpha and rag, that are provided by tavily-search, wolfram-alpha and rag providers. -## Model Context Protocol (MCP) Tools +## Model Context Protocol (MCP) -MCP tools are special tools that can interact with llama stack over model context protocol. These tools are dynamically discovered from an MCP endpoint and can be used to extend the agent's capabilities. +[MCP](https://github.com/modelcontextprotocol) is an upcoming, popular standard for tool discovery and execution. It is a protocol that allows tools to be dynamically discovered +from an MCP endpoint and can be used to extend the agent's capabilities. -Refer to [https://github.com/modelcontextprotocol/servers](https://github.com/modelcontextprotocol/servers) for available MCP servers. + +### Using Remote MCP Servers + +You can find some popular remote MCP servers [here](https://github.com/jaw9c/awesome-remote-mcp-servers). You can register them as toolgroups in the same way as local providers. + +```python +client.toolgroups.register( + toolgroup_id="mcp::deepwiki", + provider_id="model-context-protocol", + mcp_endpoint=URL(uri="https://mcp.deepwiki.com/sse"), +) +``` + +Note that most of the more useful MCP servers need you to authenticate with them. Many of them use OAuth2.0 for authentication. You can provide authorization headers to send to the MCP server +using the "Provider Data" abstraction provided by Llama Stack. When making an agent call, + +```python +agent = Agent( + ..., + tools=["mcp::deepwiki"], + extra_headers={ + "X-LlamaStack-Provider-Data": json.dumps( + { + "mcp_headers": { + "http://mcp.deepwiki.com/sse": { + "Authorization": "Bearer ", + }, + }, + } + ), + }, +) +agent.create_turn(...) +``` + +### Running your own MCP server + +Here's an example of how to run a simple MCP server that exposes a File System as a set of tools to the Llama Stack agent. ```shell # start your MCP server @@ -106,13 +139,9 @@ client.toolgroups.register( ) ``` -MCP tools require: -- A valid MCP endpoint URL -- The endpoint must implement the Model Context Protocol -- Tools are discovered dynamically from the endpoint -## Adding Custom Tools +## Adding Custom (Client-side) Tools When you want to use tools other than the built-in tools, you just need to implement a python function with a docstring. The content of the docstring will be used to describe the tool and the parameters and passed along to the generative model. diff --git a/docs/source/concepts/api_providers.md b/docs/source/concepts/api_providers.md new file mode 100644 index 000000000..6e6502c0c --- /dev/null +++ b/docs/source/concepts/api_providers.md @@ -0,0 +1,12 @@ +## API Providers + +The goal of Llama Stack is to build an ecosystem where users can easily swap out different implementations for the same API. Examples for these include: +- LLM inference providers (e.g., Fireworks, Together, AWS Bedrock, Groq, Cerebras, SambaNova, vLLM, etc.), +- Vector databases (e.g., ChromaDB, Weaviate, Qdrant, Milvus, FAISS, PGVector, etc.), +- Safety providers (e.g., Meta's Llama Guard, AWS Bedrock Guardrails, etc.) + +Providers come in two flavors: +- **Remote**: the provider runs as a separate service external to the Llama Stack codebase. Llama Stack contains a small amount of adapter code. +- **Inline**: the provider is fully specified and implemented within the Llama Stack codebase. It may be a simple wrapper around an existing library, or a full fledged implementation within Llama Stack. + +Most importantly, Llama Stack always strives to provide at least one fully inline provider for each API so you can iterate on a fully featured environment locally. diff --git a/docs/source/concepts/apis.md b/docs/source/concepts/apis.md new file mode 100644 index 000000000..38c6a7a73 --- /dev/null +++ b/docs/source/concepts/apis.md @@ -0,0 +1,18 @@ +## APIs + +A Llama Stack API is described as a collection of REST endpoints. We currently support the following APIs: + +- **Inference**: run inference with a LLM +- **Safety**: apply safety policies to the output at a Systems (not only model) level +- **Agents**: run multi-step agentic workflows with LLMs with tool usage, memory (RAG), etc. +- **DatasetIO**: interface with datasets and data loaders +- **Scoring**: evaluate outputs of the system +- **Eval**: generate outputs (via Inference or Agents) and perform scoring +- **VectorIO**: perform operations on vector stores, such as adding documents, searching, and deleting documents +- **Telemetry**: collect telemetry data from the system + +We are working on adding a few more APIs to complete the application lifecycle. These will include: +- **Batch Inference**: run inference on a dataset of inputs +- **Batch Agents**: run agents on a dataset of inputs +- **Post Training**: fine-tune a Llama model +- **Synthetic Data Generation**: generate synthetic data for model development diff --git a/docs/source/concepts/distributions.md b/docs/source/concepts/distributions.md new file mode 100644 index 000000000..c3be12d93 --- /dev/null +++ b/docs/source/concepts/distributions.md @@ -0,0 +1,9 @@ +## Distributions + +While there is a lot of flexibility to mix-and-match providers, often users will work with a specific set of providers (hardware support, contractual obligations, etc.) We therefore need to provide a _convenient shorthand_ for such collections. We call this shorthand a **Llama Stack Distribution** or a **Distro**. One can think of it as specific pre-packaged versions of the Llama Stack. Here are some examples: + +**Remotely Hosted Distro**: These are the simplest to consume from a user perspective. You can simply obtain the API key for these providers, point to a URL and have _all_ Llama Stack APIs working out of the box. Currently, [Fireworks](https://fireworks.ai/) and [Together](https://together.xyz/) provide such easy-to-consume Llama Stack distributions. + +**Locally Hosted Distro**: You may want to run Llama Stack on your own hardware. Typically though, you still need to use Inference via an external service. You can use providers like HuggingFace TGI, Fireworks, Together, etc. for this purpose. Or you may have access to GPUs and can run a [vLLM](https://github.com/vllm-project/vllm) or [NVIDIA NIM](https://build.nvidia.com/nim?filters=nimType%3Anim_type_run_anywhere&q=llama) instance. If you "just" have a regular desktop machine, you can use [Ollama](https://ollama.com/) for inference. To provide convenient quick access to these options, we provide a number of such pre-configured locally-hosted Distros. + +**On-device Distro**: To run Llama Stack directly on an edge device (mobile phone or a tablet), we provide Distros for [iOS](https://llama-stack.readthedocs.io/en/latest/distributions/ondevice_distro/ios_sdk.html) and [Android](https://llama-stack.readthedocs.io/en/latest/distributions/ondevice_distro/android_sdk.html) diff --git a/docs/source/concepts/evaluation_concepts.md b/docs/source/concepts/evaluation_concepts.md index 14390c0a2..3f03d098f 100644 --- a/docs/source/concepts/evaluation_concepts.md +++ b/docs/source/concepts/evaluation_concepts.md @@ -1,4 +1,4 @@ -# Evaluation Concepts +## Evaluation Concepts The Llama Stack Evaluation flow allows you to run evaluations on your GenAI application datasets or pre-registered benchmarks. @@ -10,11 +10,7 @@ We introduce a set of APIs in Llama Stack for supporting running evaluations of This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases. Checkout our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing). -## Evaluation Concepts - -The Evaluation APIs are associated with a set of Resources as shown in the following diagram. Please visit the Resources section in our [Core Concepts](../concepts/index.md) guide for better high-level understanding. - -![Eval Concepts](../references/evals_reference/resources/eval-concept.png) +The Evaluation APIs are associated with a set of Resources. Please visit the Resources section in our [Core Concepts](../concepts/index.md) guide for better high-level understanding. - **DatasetIO**: defines interface with datasets and data loaders. - Associated with `Dataset` resource. @@ -24,9 +20,9 @@ The Evaluation APIs are associated with a set of Resources as shown in the follo - Associated with `Benchmark` resource. -## Open-benchmark Eval +### Open-benchmark Eval -### List of open-benchmarks Llama Stack support +#### List of open-benchmarks Llama Stack support Llama stack pre-registers several popular open-benchmarks to easily evaluate model perfomance via CLI. @@ -39,7 +35,7 @@ The list of open-benchmarks we currently support: You can follow this [contributing guide](https://llama-stack.readthedocs.io/en/latest/references/evals_reference/index.html#open-benchmark-contributing-guide) to add more open-benchmarks to Llama Stack -### Run evaluation on open-benchmarks via CLI +#### Run evaluation on open-benchmarks via CLI We have built-in functionality to run the supported open-benckmarks using llama-stack-client CLI @@ -74,7 +70,7 @@ evaluation results over there. -## What's Next? +#### What's Next? - Check out our Colab notebook on working examples with running benchmark evaluations [here](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb#scrollTo=mxLCsP4MvFqP). - Check out our [Building Applications - Evaluation](../building_applications/evals.md) guide for more details on how to use the Evaluation APIs to evaluate your applications. diff --git a/docs/source/concepts/index.md b/docs/source/concepts/index.md index a94511a0d..1c31dc232 100644 --- a/docs/source/concepts/index.md +++ b/docs/source/concepts/index.md @@ -1,74 +1,23 @@ # Core Concepts - -```{toctree} -:maxdepth: 1 -:hidden: - -evaluation_concepts -``` - Given Llama Stack's service-oriented philosophy, a few concepts and workflows arise which may not feel completely natural in the LLM landscape, especially if you are coming with a background in other frameworks. - -## APIs - -A Llama Stack API is described as a collection of REST endpoints. We currently support the following APIs: - -- **Inference**: run inference with a LLM -- **Safety**: apply safety policies to the output at a Systems (not only model) level -- **Agents**: run multi-step agentic workflows with LLMs with tool usage, memory (RAG), etc. -- **DatasetIO**: interface with datasets and data loaders -- **Scoring**: evaluate outputs of the system -- **Eval**: generate outputs (via Inference or Agents) and perform scoring -- **VectorIO**: perform operations on vector stores, such as adding documents, searching, and deleting documents -- **Telemetry**: collect telemetry data from the system - -We are working on adding a few more APIs to complete the application lifecycle. These will include: -- **Batch Inference**: run inference on a dataset of inputs -- **Batch Agents**: run agents on a dataset of inputs -- **Post Training**: fine-tune a Llama model -- **Synthetic Data Generation**: generate synthetic data for model development - -## API Providers - -The goal of Llama Stack is to build an ecosystem where users can easily swap out different implementations for the same API. Examples for these include: -- LLM inference providers (e.g., Fireworks, Together, AWS Bedrock, Groq, Cerebras, SambaNova, vLLM, etc.), -- Vector databases (e.g., ChromaDB, Weaviate, Qdrant, Milvus, FAISS, PGVector, etc.), -- Safety providers (e.g., Meta's Llama Guard, AWS Bedrock Guardrails, etc.) - -Providers come in two flavors: -- **Remote**: the provider runs as a separate service external to the Llama Stack codebase. Llama Stack contains a small amount of adapter code. -- **Inline**: the provider is fully specified and implemented within the Llama Stack codebase. It may be a simple wrapper around an existing library, or a full fledged implementation within Llama Stack. - -Most importantly, Llama Stack always strives to provide at least one fully inline provider for each API so you can iterate on a fully featured environment locally. -## Resources - -Some of these APIs are associated with a set of **Resources**. Here is the mapping of APIs to resources: - -- **Inference**, **Eval** and **Post Training** are associated with `Model` resources. -- **Safety** is associated with `Shield` resources. -- **Tool Runtime** is associated with `ToolGroup` resources. -- **DatasetIO** is associated with `Dataset` resources. -- **VectorIO** is associated with `VectorDB` resources. -- **Scoring** is associated with `ScoringFunction` resources. -- **Eval** is associated with `Model` and `Benchmark` resources. - -Furthermore, we allow these resources to be **federated** across multiple providers. For example, you may have some Llama models served by Fireworks while others are served by AWS Bedrock. Regardless, they will all work seamlessly with the same uniform Inference API provided by Llama Stack. - -```{admonition} Registering Resources -:class: tip - -Given this architecture, it is necessary for the Stack to know which provider to use for a given resource. This means you need to explicitly _register_ resources (including models) before you can use them with the associated APIs. +```{include} apis.md +:start-after: ## APIs ``` -## Distributions +```{include} api_providers.md +:start-after: ## API Providers +``` -While there is a lot of flexibility to mix-and-match providers, often users will work with a specific set of providers (hardware support, contractual obligations, etc.) We therefore need to provide a _convenient shorthand_ for such collections. We call this shorthand a **Llama Stack Distribution** or a **Distro**. One can think of it as specific pre-packaged versions of the Llama Stack. Here are some examples: +```{include} resources.md +:start-after: ## Resources +``` -**Remotely Hosted Distro**: These are the simplest to consume from a user perspective. You can simply obtain the API key for these providers, point to a URL and have _all_ Llama Stack APIs working out of the box. Currently, [Fireworks](https://fireworks.ai/) and [Together](https://together.xyz/) provide such easy-to-consume Llama Stack distributions. +```{include} distributions.md +:start-after: ## Distributions +``` -**Locally Hosted Distro**: You may want to run Llama Stack on your own hardware. Typically though, you still need to use Inference via an external service. You can use providers like HuggingFace TGI, Fireworks, Together, etc. for this purpose. Or you may have access to GPUs and can run a [vLLM](https://github.com/vllm-project/vllm) or [NVIDIA NIM](https://build.nvidia.com/nim?filters=nimType%3Anim_type_run_anywhere&q=llama) instance. If you "just" have a regular desktop machine, you can use [Ollama](https://ollama.com/) for inference. To provide convenient quick access to these options, we provide a number of such pre-configured locally-hosted Distros. - - -**On-device Distro**: To run Llama Stack directly on an edge device (mobile phone or a tablet), we provide Distros for [iOS](https://llama-stack.readthedocs.io/en/latest/distributions/ondevice_distro/ios_sdk.html) and [Android](https://llama-stack.readthedocs.io/en/latest/distributions/ondevice_distro/android_sdk.html) +```{include} evaluation_concepts.md +:start-after: ## Evaluation Concepts +``` diff --git a/docs/source/concepts/resources.md b/docs/source/concepts/resources.md new file mode 100644 index 000000000..0cdc9a227 --- /dev/null +++ b/docs/source/concepts/resources.md @@ -0,0 +1,19 @@ +## Resources + +Some of these APIs are associated with a set of **Resources**. Here is the mapping of APIs to resources: + +- **Inference**, **Eval** and **Post Training** are associated with `Model` resources. +- **Safety** is associated with `Shield` resources. +- **Tool Runtime** is associated with `ToolGroup` resources. +- **DatasetIO** is associated with `Dataset` resources. +- **VectorIO** is associated with `VectorDB` resources. +- **Scoring** is associated with `ScoringFunction` resources. +- **Eval** is associated with `Model` and `Benchmark` resources. + +Furthermore, we allow these resources to be **federated** across multiple providers. For example, you may have some Llama models served by Fireworks while others are served by AWS Bedrock. Regardless, they will all work seamlessly with the same uniform Inference API provided by Llama Stack. + +```{admonition} Registering Resources +:class: tip + +Given this architecture, it is necessary for the Stack to know which provider to use for a given resource. This means you need to explicitly _register_ resources (including models) before you can use them with the associated APIs. +```