From 42104361a378a361b5097daa71cd2a4d14c82318 Mon Sep 17 00:00:00 2001 From: Xi Yan Date: Tue, 29 Oct 2024 14:04:41 -0700 Subject: [PATCH] refactor structure --- distributions/ollama/README.md | 13 + docs/source/api_providers/index.md | 13 + .../new_api_provider.md | 0 .../index.md} | 307 ++---------------- .../distribution_dev/building_distro.md | 43 +-- docs/source/getting_started.md | 247 -------------- docs/source/getting_started/conda.md | 2 - .../developer_cookbook.md | 2 +- .../getting_started/distributions/index.md | 9 + .../distributions/meta-reference-gpu.md | 111 +++++++ docs/source/getting_started/docker.md | 3 - docs/source/getting_started/index.md | 81 +++++ docs/source/index.md | 24 +- 13 files changed, 293 insertions(+), 562 deletions(-) create mode 100644 docs/source/api_providers/index.md rename docs/source/{providers_dev => api_providers}/new_api_provider.md (100%) rename docs/source/{cli_reference.md => cli_reference/index.md} (50%) delete mode 100644 docs/source/getting_started.md delete mode 100644 docs/source/getting_started/conda.md rename docs/source/{ => getting_started}/developer_cookbook.md (93%) create mode 100644 docs/source/getting_started/distributions/index.md create mode 100644 docs/source/getting_started/distributions/meta-reference-gpu.md delete mode 100644 docs/source/getting_started/docker.md create mode 100644 docs/source/getting_started/index.md diff --git a/distributions/ollama/README.md b/distributions/ollama/README.md index 0d2ce6973..f969d86ec 100644 --- a/distributions/ollama/README.md +++ b/distributions/ollama/README.md @@ -92,6 +92,19 @@ llama stack run ./gpu/run.yaml ### Model Serving +#### Downloading model via Ollama + +You can use ollama for managing model downloads. + +``` +ollama pull llama3.1:8b-instruct-fp16 +ollama pull llama3.1:70b-instruct-fp16 +``` + +> [!NOTE] +> Please check the [OLLAMA_SUPPORTED_MODELS](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/adapters/inference/ollama/ollama.py) for the supported Ollama models. + + To serve a new model with `ollama` ``` ollama run diff --git a/docs/source/api_providers/index.md b/docs/source/api_providers/index.md new file mode 100644 index 000000000..f4352b043 --- /dev/null +++ b/docs/source/api_providers/index.md @@ -0,0 +1,13 @@ +# API Providers + +A Provider is what makes the API real -- they provide the actual implementation backing the API. + +As an example, for Inference, we could have the implementation be backed by open source libraries like `[ torch | vLLM | TensorRT ]` as possible options. + +A provider can also be just a pointer to a remote REST service -- for example, cloud providers or dedicated inference providers could serve these APIs. + +```{toctree} +:maxdepth: 2 + +new_api_provider +``` diff --git a/docs/source/providers_dev/new_api_provider.md b/docs/source/api_providers/new_api_provider.md similarity index 100% rename from docs/source/providers_dev/new_api_provider.md rename to docs/source/api_providers/new_api_provider.md diff --git a/docs/source/cli_reference.md b/docs/source/cli_reference/index.md similarity index 50% rename from docs/source/cli_reference.md rename to docs/source/cli_reference/index.md index 39ac99615..f87ff0f72 100644 --- a/docs/source/cli_reference.md +++ b/docs/source/cli_reference/index.md @@ -1,11 +1,35 @@ # Llama CLI Reference -The `llama` CLI tool helps you setup and use the Llama Stack & agentic systems. It should be available on your path after installing the `llama-stack` package. +The `llama` CLI tool helps you setup and use the Llama Stack. It should be available on your path after installing the `llama-stack` package. -### Subcommands +## Installation + +You have two ways to install Llama Stack: + +1. **Install as a package**: + You can install the repository directly from [PyPI](https://pypi.org/project/llama-stack/) by running the following command: + ```bash + pip install llama-stack + ``` + +2. **Install from source**: + If you prefer to install from the source code, follow these steps: + ```bash + mkdir -p ~/local + cd ~/local + git clone git@github.com:meta-llama/llama-stack.git + + conda create -n myenv python=3.10 + conda activate myenv + + cd llama-stack + $CONDA_PREFIX/bin/pip install -e . + + +## `llama` subcommands 1. `download`: `llama` cli tools supports downloading the model from Meta or Hugging Face. 2. `model`: Lists available models and their properties. -3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](cli_reference.md#step-3-building-and-configuring-llama-stack-distributions). +3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](../distribution_dev/building_distro.md). ### Sample Usage @@ -24,7 +48,7 @@ subcommands: {download,model,stack} -## Step 1. Get the models +## Downloading models You first need to have models downloaded locally. @@ -129,18 +153,6 @@ llama download --source huggingface --model-id Prompt-Guard-86M --ignore-pattern > **Tip:** Default for `llama download` is to run with `--ignore-patterns *.safetensors` since we use the `.pth` files in the `original` folder. For Llama Guard and Prompt Guard, however, we need safetensors. Hence, please run with `--ignore-patterns original` so that safetensors are downloaded and `.pth` files are ignored. -#### Downloading via Ollama - -If you're already using ollama, we also have a supported Llama Stack distribution `local-ollama` and you can continue to use ollama for managing model downloads. - -``` -ollama pull llama3.1:8b-instruct-fp16 -ollama pull llama3.1:70b-instruct-fp16 -``` - -> [!NOTE] -> Only the above two models are currently supported by Ollama. - ## Step 2: Understand the models The `llama model` command helps you explore the model’s interface. @@ -215,271 +227,10 @@ You can even run `llama model prompt-format` see all of the templates and their ``` llama model prompt-format -m Llama3.2-3B-Instruct ``` -![alt text](resources/prompt-format.png) +![alt text](../resources/prompt-format.png) You will be shown a Markdown formatted description of the model interface and how prompts / messages are formatted for various scenarios. **NOTE**: Outputs in terminal are color printed to show special tokens. - - -## Step 3: Building, and Configuring Llama Stack Distributions - -- Please see our [Getting Started](getting_started.md) guide for more details on how to build and start a Llama Stack distribution. - -### Step 3.1 Build -In the following steps, imagine we'll be working with a `Llama3.1-8B-Instruct` model. We will name our build `tgi` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify: -- `name`: the name for our distribution (e.g. `tgi`) -- `image_type`: our build image type (`conda | docker`) -- `distribution_spec`: our distribution specs for specifying API providers - - `description`: a short description of the configurations for the distribution - - `providers`: specifies the underlying implementation for serving each API endpoint - - `image_type`: `conda` | `docker` to specify whether to build the distribution in the form of Docker image or Conda environment. - - -At the end of build command, we will generate `-build.yaml` file storing the build configurations. - -After this step is complete, a file named `-build.yaml` will be generated and saved at the output file path specified at the end of the command. - -#### Building from scratch -- For a new user, we could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations. -``` -llama stack build -``` - -Running the command above will allow you to fill in the configuration to build your Llama Stack distribution, you will see the following outputs. - -``` -> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-llama-stack -> Enter the image type you want your distribution to be built with (docker or conda): conda - - Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs. -> Enter the API provider for the inference API: (default=meta-reference): meta-reference -> Enter the API provider for the safety API: (default=meta-reference): meta-reference -> Enter the API provider for the agents API: (default=meta-reference): meta-reference -> Enter the API provider for the memory API: (default=meta-reference): meta-reference -> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference - - > (Optional) Enter a short description for your Llama Stack distribution: - -Build spec configuration saved at ~/.conda/envs/llamastack-my-local-llama-stack/my-local-llama-stack-build.yaml -``` - -#### Building from templates -- To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers. - -The following command will allow you to see the available templates and their corresponding providers. -``` -llama stack build --list-templates -``` - -![alt text](resources/list-templates.png) - -You may then pick a template to build your distribution with providers fitted to your liking. - -``` -llama stack build --template tgi --image-type conda -``` - -``` -$ llama stack build --template tgi --image-type conda -... -... -Build spec configuration saved at ~/.conda/envs/llamastack-tgi/tgi-build.yaml -You may now run `llama stack configure tgi` or `llama stack configure ~/.conda/envs/llamastack-tgi/tgi-build.yaml` -``` - -#### Building from config file -- In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command. - -- The config file will be of contents like the ones in `llama_stack/templates/`. - -``` -$ cat build.yaml - -name: ollama -distribution_spec: - description: Like local, but use ollama for running LLM inference - providers: - inference: remote::ollama - memory: meta-reference - safety: meta-reference - agents: meta-reference - telemetry: meta-reference -image_type: conda -``` - -``` -llama stack build --config build.yaml -``` - -#### How to build distribution with Docker image - -To build a docker image, you may start off from a template and use the `--image-type docker` flag to specify `docker` as the build image type. - -``` -llama stack build --template tgi --image-type docker -``` - -Alternatively, you may use a config file and set `image_type` to `docker` in our `-build.yaml` file, and run `llama stack build -build.yaml`. The `-build.yaml` will be of contents like: - -``` -name: local-docker-example -distribution_spec: - description: Use code from `llama_stack` itself to serve all llama stack APIs - docker_image: null - providers: - inference: meta-reference - memory: meta-reference-faiss - safety: meta-reference - agentic_system: meta-reference - telemetry: console -image_type: docker -``` - -The following command allows you to build a Docker image with the name `` -``` -llama stack build --config -build.yaml - -Dockerfile created successfully in /tmp/tmp.I0ifS2c46A/DockerfileFROM python:3.10-slim -WORKDIR /app -... -... -You can run it with: podman run -p 8000:8000 llamastack-docker-local -Build spec configuration saved at ~/.llama/distributions/docker/docker-local-build.yaml -``` - - -### Step 3.2 Configure -After our distribution is built (either in form of docker or conda environment), we will run the following command to -``` -llama stack configure [ | ] -``` -- For `conda` environments: would be the generated build spec saved from Step 1. -- For `docker` images downloaded from Dockerhub, you could also use as the argument. - - Run `docker images` to check list of available images on your machine. - -``` -$ llama stack configure ~/.llama/distributions/conda/tgi-build.yaml - -Configuring API: inference (meta-reference) -Enter value for model (existing: Llama3.1-8B-Instruct) (required): -Enter value for quantization (optional): -Enter value for torch_seed (optional): -Enter value for max_seq_len (existing: 4096) (required): -Enter value for max_batch_size (existing: 1) (required): - -Configuring API: memory (meta-reference-faiss) - -Configuring API: safety (meta-reference) -Do you want to configure llama_guard_shield? (y/n): y -Entering sub-configuration for llama_guard_shield: -Enter value for model (default: Llama-Guard-3-1B) (required): -Enter value for excluded_categories (default: []) (required): -Enter value for disable_input_check (default: False) (required): -Enter value for disable_output_check (default: False) (required): -Do you want to configure prompt_guard_shield? (y/n): y -Entering sub-configuration for prompt_guard_shield: -Enter value for model (default: Prompt-Guard-86M) (required): - -Configuring API: agentic_system (meta-reference) -Enter value for brave_search_api_key (optional): -Enter value for bing_search_api_key (optional): -Enter value for wolfram_api_key (optional): - -Configuring API: telemetry (console) - -YAML configuration has been written to ~/.llama/builds/conda/8b-instruct-run.yaml -``` - -After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/8b-instruct-run.yaml` with the following contents. You may edit this file to change the settings. - -As you can see, we did basic configuration above and configured: -- inference to run on model `Llama3.1-8B-Instruct` (obtained from `llama model list`) -- Llama Guard safety shield with model `Llama-Guard-3-1B` -- Prompt Guard safety shield with model `Prompt-Guard-86M` - -For how these configurations are stored as yaml, checkout the file printed at the end of the configuration. - -Note that all configurations as well as models are stored in `~/.llama` - - -### Step 3.3 Run -Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack configure` step. - -``` -llama stack run ~/.llama/builds/conda/tgi-run.yaml -``` - -You should see the Llama Stack server start and print the APIs that it is supporting - -``` -$ llama stack run ~/.llama/builds/local/conda/tgi-run.yaml - -> initializing model parallel with size 1 -> initializing ddp with size 1 -> initializing pipeline with size 1 -Loaded in 19.28 seconds -NCCL version 2.20.5+cuda12.4 -Finished model load YES READY -Serving POST /inference/batch_chat_completion -Serving POST /inference/batch_completion -Serving POST /inference/chat_completion -Serving POST /inference/completion -Serving POST /safety/run_shield -Serving POST /agentic_system/memory_bank/attach -Serving POST /agentic_system/create -Serving POST /agentic_system/session/create -Serving POST /agentic_system/turn/create -Serving POST /agentic_system/delete -Serving POST /agentic_system/session/delete -Serving POST /agentic_system/memory_bank/detach -Serving POST /agentic_system/session/get -Serving POST /agentic_system/step/get -Serving POST /agentic_system/turn/get -Listening on :::5000 -INFO: Started server process [453333] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) -``` - -> [!NOTE] -> Configuration is in `~/.llama/builds/local/conda/tgi-run.yaml`. Feel free to increase `max_seq_len`. - -> [!IMPORTANT] -> The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines. - -> [!TIP] -> You might need to use the flag `--disable-ipv6` to Disable IPv6 support - -This server is running a Llama model locally. - -### Step 3.4 Test with Client -Once the server is setup, we can test it with a client to see the example outputs. -``` -cd /path/to/llama-stack -conda activate # any environment containing the llama-stack pip package will work - -python -m llama_stack.apis.inference.client localhost 5000 -``` - -This will run the chat completion client and query the distribution’s /inference/chat_completion API. - -Here is an example output: -``` -User>hello world, write me a 2 sentence poem about the moon -Assistant> Here's a 2-sentence poem about the moon: - -The moon glows softly in the midnight sky, -A beacon of wonder, as it passes by. -``` - -Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by: - -``` -python -m llama_stack.apis.safety.client localhost 5000 -``` - -You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo. diff --git a/docs/source/distribution_dev/building_distro.md b/docs/source/distribution_dev/building_distro.md index 234c553da..49daba71c 100644 --- a/docs/source/distribution_dev/building_distro.md +++ b/docs/source/distribution_dev/building_distro.md @@ -240,31 +240,24 @@ This server is running a Llama model locally. ## Step 4. Test with Client Once the server is setup, we can test it with a client to see the example outputs. -``` -cd /path/to/llama-stack -conda activate # any environment containing the llama-stack pip package will work - -python -m llama_stack.apis.inference.client localhost 5000 -``` - -This will run the chat completion client and query the distribution’s /inference/chat_completion API. - -Here is an example output: -``` -User>hello world, write me a 2 sentence poem about the moon -Assistant> Here's a 2-sentence poem about the moon: - -The moon glows softly in the midnight sky, -A beacon of wonder, as it passes by. -``` - -Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by: ``` -python -m llama_stack.apis.safety.client localhost 5000 +curl http://localhost:5000/inference/chat_completion \ +-H "Content-Type: application/json" \ +-d '{ + "model": "Llama3.1-8B-Instruct", + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Write me a 2 sentence poem about the moon"} + ], + "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512} +}' + +Output: +{'completion_message': {'role': 'assistant', + 'content': 'The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.', + 'stop_reason': 'out_of_tokens', + 'tool_calls': []}, + 'logprobs': null} + ``` - - -Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications. - -You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo. diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md deleted file mode 100644 index 3725c540f..000000000 --- a/docs/source/getting_started.md +++ /dev/null @@ -1,247 +0,0 @@ -# Getting Started with Llama Stack - -At the end of the guide, you will have learnt how to: -- get a Llama Stack server up and running -- get a agent (with tool-calling, vector stores) which works with the above server - -To see more example apps built using Llama Stack, see [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main). - -## Installation - -You have two ways to install Llama Stack: - -1. **Install as a package**: - You can install the repository directly from [PyPI](https://pypi.org/project/llama-stack/) by running the following command: - ```bash - pip install llama-stack - ``` - -2. **Install from source**: - If you prefer to install from the source code, follow these steps: - ```bash - mkdir -p ~/local - cd ~/local - git clone git@github.com:meta-llama/llama-stack.git - - conda create -n myenv python=3.10 - conda activate myenv - - cd llama-stack - $CONDA_PREFIX/bin/pip install -e . - - -## Starting Up Llama Stack Server - -There are two ways to start a Llama Stack: - -- **Docker**: we provide a number of pre-built Docker containers allowing you to get started instantly. If you are focused on application development, we recommend this option. -- **Conda**: the `llama` CLI provides a simple set of commands to build, configure and run a Llama Stack server containing the exact combination of providers you wish. We have provided various templates to make getting started easier. - -Both of these provide options to run model inference using our reference implementations, Ollama, TGI, vLLM or even remote providers like Fireworks, Together, Bedrock, etc. - -### Docker - -Running inference of the underlying Llama model is one of the most critical requirements. Depending on what hardware you have available, you have various options: - -**Do you have access to a machine with powerful GPUs?** -If so, we suggest... - -**Are you running on a "regular" desktop machine?** -In that case, we suggest ollama - -**Do you have access to a remote inference provider like Fireworks, Togther, etc.?** -... - -We provide pre-built Docker image of Llama Stack distribution, which can be found in the following links in the [distributions](../distributions/) folder. - -> [!NOTE] -> For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container. -``` -export LLAMA_CHECKPOINT_DIR=~/.llama -``` - -> [!NOTE] -> `~/.llama` should be the path containing downloaded weights of Llama models. - -To download llama models, use -``` -llama download --model-id Llama3.1-8B-Instruct -``` - -To download and start running a pre-built docker container, you may use the following commands: - -``` -cd llama-stack/distributions/meta-reference-gpu -docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml -``` - -> [!TIP] -> Pro Tip: We may use `docker compose up` for starting up a distribution with remote providers (e.g. TGI) using [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general). You can checkout [these scripts](../distributions/) to help you get started. - - -### Conda - - You can use this method to build a Llama Stack distribution from scratch. This is useful when you intend to hack on the Llama Stack server codebase (or just want to understand.) - - **`llama stack build`** - - You'll be prompted to enter build information interactively. - ``` - llama stack build - - > Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-stack - > Enter the image type you want your distribution to be built with (docker or conda): conda - - Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs. - > Enter the API provider for the inference API: (default=meta-reference): meta-reference - > Enter the API provider for the safety API: (default=meta-reference): meta-reference - > Enter the API provider for the agents API: (default=meta-reference): meta-reference - > Enter the API provider for the memory API: (default=meta-reference): meta-reference - > Enter the API provider for the telemetry API: (default=meta-reference): meta-reference - - > (Optional) Enter a short description for your Llama Stack distribution: - - Build spec configuration saved at ~/.conda/envs/llamastack-my-local-stack/my-local-stack-build.yaml - You can now run `llama stack configure my-local-stack` - ``` - - **`llama stack configure`** - - Run `llama stack configure ` with the name you have previously defined in `build` step. - ``` - llama stack configure - ``` - - You will be prompted to enter configurations for your Llama Stack - - ``` - $ llama stack configure my-local-stack - - Configuring API `inference`... - === Configuring provider `meta-reference` for API inference... - Enter value for model (default: Llama3.1-8B-Instruct) (required): - Do you want to configure quantization? (y/n): n - Enter value for torch_seed (optional): - Enter value for max_seq_len (default: 4096) (required): - Enter value for max_batch_size (default: 1) (required): - - Configuring API `safety`... - === Configuring provider `meta-reference` for API safety... - Do you want to configure llama_guard_shield? (y/n): n - Do you want to configure prompt_guard_shield? (y/n): n - - Configuring API `agents`... - === Configuring provider `meta-reference` for API agents... - Enter `type` for persistence_store (options: redis, sqlite, postgres) (default: sqlite): - - Configuring SqliteKVStoreConfig: - Enter value for namespace (optional): - Enter value for db_path (default: /home/xiyan/.llama/runtime/kvstore.db) (required): - - Configuring API `memory`... - === Configuring provider `meta-reference` for API memory... - > Please enter the supported memory bank type your provider has for memory: vector - - Configuring API `telemetry`... - === Configuring provider `meta-reference` for API telemetry... - - > YAML configuration has been written to ~/.llama/builds/conda/my-local-stack-run.yaml. - You can now run `llama stack run my-local-stack --port PORT` - ``` - - **`llama stack run`** - - Run `llama stack run ` with the name you have previously defined. - ``` - llama stack run my-local-stack - - ... - > initializing model parallel with size 1 - > initializing ddp with size 1 - > initializing pipeline with size 1 - ... - Finished model load YES READY - Serving POST /inference/chat_completion - Serving POST /inference/completion - Serving POST /inference/embeddings - Serving POST /memory_banks/create - Serving DELETE /memory_bank/documents/delete - Serving DELETE /memory_banks/drop - Serving GET /memory_bank/documents/get - Serving GET /memory_banks/get - Serving POST /memory_bank/insert - Serving GET /memory_banks/list - Serving POST /memory_bank/query - Serving POST /memory_bank/update - Serving POST /safety/run_shield - Serving POST /agentic_system/create - Serving POST /agentic_system/session/create - Serving POST /agentic_system/turn/create - Serving POST /agentic_system/delete - Serving POST /agentic_system/session/delete - Serving POST /agentic_system/session/get - Serving POST /agentic_system/step/get - Serving POST /agentic_system/turn/get - Serving GET /telemetry/get_trace - Serving POST /telemetry/log_event - Listening on :::5000 - INFO: Started server process [587053] - INFO: Waiting for application startup. - INFO: Application startup complete. - INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) - ``` - - -## Testing with client -Once the server is setup, we can test it with a client to see the example outputs. -``` -cd /path/to/llama-stack -conda activate # any environment containing the llama-stack pip package will work - -python -m llama_stack.apis.inference.client localhost 5000 -``` - -This will run the chat completion client and query the distribution’s `/inference/chat_completion` API. - -Here is an example output: -``` -User>hello world, write me a 2 sentence poem about the moon -Assistant> Here's a 2-sentence poem about the moon: - -The moon glows softly in the midnight sky, -A beacon of wonder, as it passes by. -``` - -You may also send a POST request to the server: -``` -curl http://localhost:5000/inference/chat_completion \ --H "Content-Type: application/json" \ --d '{ - "model": "Llama3.1-8B-Instruct", - "messages": [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Write me a 2 sentence poem about the moon"} - ], - "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512} -}' - -Output: -{'completion_message': {'role': 'assistant', - 'content': 'The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.', - 'stop_reason': 'out_of_tokens', - 'tool_calls': []}, - 'logprobs': null} - -``` - - -Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by: - -``` -python -m llama_stack.apis.safety.client localhost 5000 -``` - - -Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications. - -You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo. - - -## Advanced Guides -Please see our [Building a LLama Stack Distribution](./building_distro.md) guide for more details on how to assemble your own Llama Stack Distribution. diff --git a/docs/source/getting_started/conda.md b/docs/source/getting_started/conda.md deleted file mode 100644 index 3ae6c5f43..000000000 --- a/docs/source/getting_started/conda.md +++ /dev/null @@ -1,2 +0,0 @@ -# Conda -WIP diff --git a/docs/source/developer_cookbook.md b/docs/source/getting_started/developer_cookbook.md similarity index 93% rename from docs/source/developer_cookbook.md rename to docs/source/getting_started/developer_cookbook.md index f5dedafbf..a85055a39 100644 --- a/docs/source/developer_cookbook.md +++ b/docs/source/getting_started/developer_cookbook.md @@ -19,7 +19,7 @@ Based on your developer needs, below are references to guides to help you get st * Developer need: I want a Llama Stack distribution with a remote provider. * Effort: 10min * Guide - - Please see our [Distributions Guide](../distributions/) on starting up distributions with remote providers. + - Please see our [Distributions Guide](../../../distributions/) on starting up distributions with remote providers. ### On-Device (iOS) Llama Stack diff --git a/docs/source/getting_started/distributions/index.md b/docs/source/getting_started/distributions/index.md new file mode 100644 index 000000000..94c676611 --- /dev/null +++ b/docs/source/getting_started/distributions/index.md @@ -0,0 +1,9 @@ +# Llama Stack Distribution + +A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications. + +```{toctree} +:maxdepth: 2 + +meta-reference-gpu +``` diff --git a/docs/source/getting_started/distributions/meta-reference-gpu.md b/docs/source/getting_started/distributions/meta-reference-gpu.md new file mode 100644 index 000000000..5c576122f --- /dev/null +++ b/docs/source/getting_started/distributions/meta-reference-gpu.md @@ -0,0 +1,111 @@ +# Meta Reference Distribution + +The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations. + + +| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** | +|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- | +| **Provider(s)** | meta-reference | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference | + + +### Prerequisite +Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide]() here to download the models. + +``` +$ ls ~/.llama/checkpoints +Llama3.1-8B Llama3.2-11B-Vision-Instruct Llama3.2-1B-Instruct Llama3.2-90B-Vision-Instruct Llama-Guard-3-8B +Llama3.1-8B-Instruct Llama3.2-1B Llama3.2-3B-Instruct Llama-Guard-3-1B Prompt-Guard-86M +``` + +### Start the Distribution (Single Node GPU) + +``` +$ cd distributions/meta-reference-gpu +$ ls +build.yaml compose.yaml README.md run.yaml +$ docker compose up +``` + +> [!NOTE] +> This assumes you have access to GPU to start a local server with access to your GPU. + + +> [!NOTE] +> `~/.llama` should be the path containing downloaded weights of Llama models. + + +This will download and start running a pre-built docker container. Alternatively, you may use the following commands: + +``` +docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml +``` + +### Alternative (Build and start distribution locally via conda) +- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on building locally via conda and starting up a meta-reference distribution. + +### Start Distribution With pgvector/chromadb Memory Provider +##### pgvector +1. Start running the pgvector server: + +``` +docker run --network host --name mypostgres -it -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres pgvector/pgvector:pg16 +``` + +2. Edit the `run.yaml` file to point to the pgvector server. +``` +memory: + - provider_id: pgvector + provider_type: remote::pgvector + config: + host: 127.0.0.1 + port: 5432 + db: postgres + user: postgres + password: mysecretpassword +``` + +> [!NOTE] +> If you get a `RuntimeError: Vector extension is not installed.`. You will need to run `CREATE EXTENSION IF NOT EXISTS vector;` to include the vector extension. E.g. + +``` +docker exec -it mypostgres ./bin/psql -U postgres +postgres=# CREATE EXTENSION IF NOT EXISTS vector; +postgres=# SELECT extname from pg_extension; + extname +``` + +3. Run `docker compose up` with the updated `run.yaml` file. + +##### chromadb +1. Start running chromadb server +``` +docker run -it --network host --name chromadb -p 6000:6000 -v ./chroma_vdb:/chroma/chroma -e IS_PERSISTENT=TRUE chromadb/chroma:latest +``` + +2. Edit the `run.yaml` file to point to the chromadb server. +``` +memory: + - provider_id: remote::chromadb + provider_type: remote::chromadb + config: + host: localhost + port: 6000 +``` + +3. Run `docker compose up` with the updated `run.yaml` file. + +### Serving a new model +You may change the `config.model` in `run.yaml` to update the model currently being served by the distribution. Make sure you have the model checkpoint downloaded in your `~/.llama`. +``` +inference: + - provider_id: meta0 + provider_type: meta-reference + config: + model: Llama3.2-11B-Vision-Instruct + quantization: null + torch_seed: null + max_seq_len: 4096 + max_batch_size: 1 +``` + +Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints. diff --git a/docs/source/getting_started/docker.md b/docs/source/getting_started/docker.md deleted file mode 100644 index bcadec9ab..000000000 --- a/docs/source/getting_started/docker.md +++ /dev/null @@ -1,3 +0,0 @@ -# Docker - -WIP diff --git a/docs/source/getting_started/index.md b/docs/source/getting_started/index.md new file mode 100644 index 000000000..fbde781a6 --- /dev/null +++ b/docs/source/getting_started/index.md @@ -0,0 +1,81 @@ +# Getting Started with Llama Stack + +At the end of the guide, you will have learnt how to: +- get a Llama Stack server up and running +- get a agent (with tool-calling, vector stores) which works with the above server + +To see more example apps built using Llama Stack, see [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main). + +## Starting Up Llama Stack Server + +### Decide your +There are two ways to start a Llama Stack: + +- **Docker**: we provide a number of pre-built Docker containers allowing you to get started instantly. If you are focused on application development, we recommend this option. +- **Conda**: the `llama` CLI provides a simple set of commands to build, configure and run a Llama Stack server containing the exact combination of providers you wish. We have provided various templates to make getting started easier. + +Both of these provide options to run model inference using our reference implementations, Ollama, TGI, vLLM or even remote providers like Fireworks, Together, Bedrock, etc. + +### Decide Your Inference Provider + +Running inference of the underlying Llama model is one of the most critical requirements. Depending on what hardware you have available, you have various options: + +- **Do you have access to a machine with powerful GPUs?** +If so, we suggest: + - `distribution-meta-reference-gpu`: + - [Docker]() + - [Conda]() + - `distribution-tgi`: + - [Docker]() + - [Conda]() + +- **Are you running on a "regular" desktop machine?** +If so, we suggest: + - `distribution-ollama`: + - [Docker]() + - [Conda]() + +- **Do you have access to a remote inference provider like Fireworks, Togther, etc.?** If so, we suggest: + - `distribution-fireworks`: + - [Docker]() + - [Conda]() + - `distribution-together`: + - [Docker]() + - [Conda]() + +## Testing with client +Once the server is setup, we can test it with a client to see the example outputs by . This will run the chat completion client and query the distribution’s `/inference/chat_completion` API. Send a POST request to the server: + +``` +curl http://localhost:5000/inference/chat_completion \ +-H "Content-Type: application/json" \ +-d '{ + "model": "Llama3.1-8B-Instruct", + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Write me a 2 sentence poem about the moon"} + ], + "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512} +}' + +Output: +{'completion_message': {'role': 'assistant', + 'content': 'The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.', + 'stop_reason': 'out_of_tokens', + 'tool_calls': []}, + 'logprobs': null} + +``` + +Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications. + +You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo. + + +```{toctree} +:hidden: +:maxdepth: 2 + +developer_cookbook +distributions/index +``` diff --git a/docs/source/index.md b/docs/source/index.md index def2356bc..1093caceb 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -39,13 +39,25 @@ A provider can also be just a pointer to a remote REST service -- for example, c A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications. +## Llama Stack Client SDK + +| **Language** | **Client SDK** | **Package** | +| :----: | :----: | :----: | +| Python | [llama-stack-client-python](https://github.com/meta-llama/llama-stack-client-python) | [![PyPI version](https://img.shields.io/pypi/v/llama_stack_client.svg)](https://pypi.org/project/llama_stack_client/) +| Swift | [llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift) | [![Swift Package Index](https://img.shields.io/endpoint?url=https%3A%2F%2Fswiftpackageindex.com%2Fapi%2Fpackages%2Fmeta-llama%2Fllama-stack-client-swift%2Fbadge%3Ftype%3Dswift-versions)](https://swiftpackageindex.com/meta-llama/llama-stack-client-swift) +| Node | [llama-stack-client-node](https://github.com/meta-llama/llama-stack-client-node) | [![NPM version](https://img.shields.io/npm/v/llama-stack-client.svg)](https://npmjs.org/package/llama-stack-client) +| Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) | + +Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications. + +You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo. + + ```{toctree} :hidden: -:maxdepth: 1 +:maxdepth: 2 -cli_reference -getting_started - -getting_started/conda -getting_started/docker +getting_started/index +cli_reference/index +api_providers/index ```