diff --git a/README.md b/README.md index 4b001ed2c..a76393047 100644 --- a/README.md +++ b/README.md @@ -37,7 +37,7 @@ A provider can also be just a pointer to a remote REST service -- for example, c ## Llama Stack Distribution -A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by inline code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications. +A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications. ## Installation diff --git a/docs/cli_reference.md b/docs/cli_reference.md index 07effdbbb..0aab7893e 100644 --- a/docs/cli_reference.md +++ b/docs/cli_reference.md @@ -208,7 +208,7 @@ $ llama distribution list +---------------+---------------------------------------------+----------------------------------------------------------------------+ | Spec ID | ProviderSpecs | Description | +---------------+---------------------------------------------+----------------------------------------------------------------------+ -| inline | { | Use code from `llama_toolchain` itself to serve all llama stack APIs | +| local | { | Use code from `llama_toolchain` itself to serve all llama stack APIs | | | "inference": "meta-reference", | | | | "safety": "meta-reference", | | | | "agentic_system": "meta-reference" | | @@ -220,7 +220,7 @@ $ llama distribution list | | "agentic_system": "agentic_system-remote" | | | | } | | +---------------+---------------------------------------------+----------------------------------------------------------------------+ -| ollama-inline | { | Like local-source, but use ollama for running LLM inference | +| local-ollama | { | Like local, but use ollama for running LLM inference | | | "inference": "meta-ollama", | | | | "safety": "meta-reference", | | | | "agentic_system": "meta-reference" | | @@ -229,16 +229,16 @@ $ llama distribution list ``` -As you can see above, each “spec” details the “providers” that make up that spec. For eg. The inline uses the “meta-reference” provider for inference while the ollama-inline relies on a different provider ( ollama ) for inference. +As you can see above, each “spec” details the “providers” that make up that spec. For eg. The `local` spec uses the “meta-reference” provider for inference while the `local-ollama` spec relies on a different provider ( ollama ) for inference. -Lets install the fully local implementation of the llama-stack – named `inline` above. +Lets install the fully local implementation of the llama-stack – named `local` above. To install a distro, we run a simple command providing 2 inputs – - **Spec Id** of the distribution that we want to install ( as obtained from the list command ) - A **Name** by which this installation will be known locally. ``` -llama distribution install --spec inline --name inline_llama_8b +llama distribution install --spec local --name local_llama_8b ``` This will create a new conda environment (name can be passed optionally) and install dependencies (via pip) as required by the distro. @@ -246,12 +246,12 @@ This will create a new conda environment (name can be passed optionally) and ins Once it runs successfully , you should see some outputs in the form ``` -$ llama distribution install --spec inline --name inline_llama_8b +$ llama distribution install --spec local --name local_llama_8b .... .... Successfully installed cfgv-3.4.0 distlib-0.3.8 identify-2.6.0 libcst-1.4.0 llama_toolchain-0.0.2 moreorless-0.4.0 nodeenv-1.9.1 pre-commit-3.8.0 stdlibs-2024.5.15 toml-0.10.2 tomlkit-0.13.0 trailrunner-1.4.0 ufmt-2.7.0 usort-1.0.8 virtualenv-20.26.3 -Distribution `inline_llama_8b` (with spec inline) has been installed successfully! +Distribution `local_llama_8b` (with spec local) has been installed successfully! ``` Next step is to configure the distribution that you just installed. We provide a simple CLI tool to enable simple configuration. @@ -260,12 +260,12 @@ It will ask for some details like model name, paths to models, etc. NOTE: You will have to download the models if not done already. Follow instructions here on how to download using the llama cli ``` -llama distribution configure --name inline_llama_8b +llama distribution configure --name local_llama_8b ``` Here is an example screenshot of how the cli will guide you to fill the configuration ``` -$ llama distribution configure --name inline_llama_8b +$ llama distribution configure --name local_llama_8b Configuring API surface: inference Enter value for model (required): Meta-Llama3.1-8B-Instruct @@ -278,7 +278,7 @@ Do you want to configure llama_guard_shield? (y/n): n Do you want to configure prompt_guard_shield? (y/n): n Configuring API surface: agentic_system -YAML configuration has been written to ~/.llama/distributions/inline0/config.yaml +YAML configuration has been written to ~/.llama/distributions/local0/config.yaml ``` As you can see, we did basic configuration above and configured inference to run on model Meta-Llama3.1-8B-Instruct ( obtained from the llama model list command ). @@ -290,12 +290,12 @@ For how these configurations are stored as yaml, checkout the file printed at th Now let’s start the distribution using the cli. ``` -llama distribution start --name inline_llama_8b --port 5000 +llama distribution start --name local_llama_8b --port 5000 ``` You should see the distribution start and print the APIs that it is supporting, ``` -$ llama distribution start --name inline_llama_8b --port 5000 +$ llama distribution start --name local_llama_8b --port 5000 > initializing model parallel with size 1 > initializing ddp with size 1 @@ -329,7 +329,7 @@ Lets test with a client ``` cd /path/to/llama-toolchain -conda activate # ( Eg. local_inline in above example ) +conda activate # ( Eg. local_llama_8b in above example ) python -m llama_toolchain.inference.client localhost 5000 ``` diff --git a/llama_toolchain/cli/distribution/install.py b/llama_toolchain/cli/distribution/install.py index 68d42938d..a056dba36 100644 --- a/llama_toolchain/cli/distribution/install.py +++ b/llama_toolchain/cli/distribution/install.py @@ -36,7 +36,7 @@ class DistributionInstall(Subcommand): self.parser.add_argument( "--spec", type=str, - help="Distribution spec to install (try ollama-inline)", + help="Distribution spec to install (try local-ollama)", required=True, choices=[d.spec_id for d in available_distribution_specs()], ) diff --git a/llama_toolchain/data/default_inference_config.yaml b/llama_toolchain/data/default_inference_config.yaml deleted file mode 100644 index eda4c9b47..000000000 --- a/llama_toolchain/data/default_inference_config.yaml +++ /dev/null @@ -1,14 +0,0 @@ -inference_config: - impl_config: - impl_type: "inline" - checkpoint_config: - checkpoint: - checkpoint_type: "pytorch" - checkpoint_dir: {checkpoint_dir}/ - tokenizer_path: {checkpoint_dir}/tokenizer.model - model_parallel_size: {model_parallel_size} - quantization_format: bf16 - quantization: null - torch_seed: null - max_seq_len: 16384 - max_batch_size: 1 diff --git a/llama_toolchain/distribution/install_distribution.sh b/llama_toolchain/distribution/install_distribution.sh index f0c66c99f..6ae74392c 100755 --- a/llama_toolchain/distribution/install_distribution.sh +++ b/llama_toolchain/distribution/install_distribution.sh @@ -96,7 +96,7 @@ ensure_conda_env_python310() { if [ "$#" -ne 3 ]; then echo "Usage: $0 " >&2 - echo "Example: $0 my_env local-inline 'numpy pandas scipy'" >&2 + echo "Example: $0 my_env local-llama-8b 'numpy pandas scipy'" >&2 exit 1 fi diff --git a/llama_toolchain/distribution/registry.py b/llama_toolchain/distribution/registry.py index a60b3cd4f..b208abf9c 100644 --- a/llama_toolchain/distribution/registry.py +++ b/llama_toolchain/distribution/registry.py @@ -28,7 +28,7 @@ def available_distribution_specs() -> List[DistributionSpec]: providers = api_providers() return [ DistributionSpec( - spec_id="inline", + spec_id="local", description="Use code from `llama_toolchain` itself to serve all llama stack APIs", provider_specs={ Api.inference: providers[Api.inference]["meta-reference"], @@ -42,8 +42,8 @@ def available_distribution_specs() -> List[DistributionSpec]: provider_specs={x: remote_spec(x) for x in providers}, ), DistributionSpec( - spec_id="ollama-inline", - description="Like local-source, but use ollama for running LLM inference", + spec_id="local-ollama", + description="Like local, but use ollama for running LLM inference", provider_specs={ Api.inference: providers[Api.inference]["meta-ollama"], Api.safety: providers[Api.safety]["meta-reference"],