mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-10-16 14:57:20 +00:00
refactor structure
This commit is contained in:
parent
9ddc28eca7
commit
42104361a3
13 changed files with 293 additions and 562 deletions
|
@ -1,2 +0,0 @@
|
|||
# Conda
|
||||
WIP
|
41
docs/source/getting_started/developer_cookbook.md
Normal file
41
docs/source/getting_started/developer_cookbook.md
Normal file
|
@ -0,0 +1,41 @@
|
|||
# Llama Stack Developer Cookbook
|
||||
|
||||
Based on your developer needs, below are references to guides to help you get started.
|
||||
|
||||
### Hosted Llama Stack Endpoint
|
||||
* Developer Need: I want to connect to a Llama Stack endpoint to build my applications.
|
||||
* Effort: 1min
|
||||
* Guide:
|
||||
- Checkout our [DeepLearning course](https://www.deeplearning.ai/short-courses/introducing-multimodal-llama-3-2) on building with Llama Stack apps on pre-hosted Llama Stack endpoint.
|
||||
|
||||
|
||||
### Local meta-reference Llama Stack Server
|
||||
* Developer Need: I want to start a local Llama Stack server with my GPU using meta-reference implementations.
|
||||
* Effort: 5min
|
||||
* Guide:
|
||||
- Please see our [Getting Started Guide](./getting_started.md) on starting up a meta-reference Llama Stack server.
|
||||
|
||||
### Llama Stack Server with Remote Providers
|
||||
* Developer need: I want a Llama Stack distribution with a remote provider.
|
||||
* Effort: 10min
|
||||
* Guide
|
||||
- Please see our [Distributions Guide](../../../distributions/) on starting up distributions with remote providers.
|
||||
|
||||
|
||||
### On-Device (iOS) Llama Stack
|
||||
* Developer Need: I want to use Llama Stack on-Device
|
||||
* Effort: 1.5hr
|
||||
* Guide:
|
||||
- Please see our [iOS Llama Stack SDK](./ios_setup.md) implementations
|
||||
|
||||
### Assemble your own Llama Stack Distribution
|
||||
* Developer Need: I want to assemble my own distribution with API providers to my likings
|
||||
* Effort: 30min
|
||||
* Guide
|
||||
- Please see our [Building Distribution](./building_distro.md) guide for assembling your own Llama Stack distribution with your choice of API providers.
|
||||
|
||||
### Adding a New API Provider
|
||||
* Developer Need: I want to add a new API provider to Llama Stack.
|
||||
* Effort: 3hr
|
||||
* Guide
|
||||
- Please see our [Adding a New API Provider](./new_api_provider.md) guide for adding a new API provider.
|
9
docs/source/getting_started/distributions/index.md
Normal file
9
docs/source/getting_started/distributions/index.md
Normal file
|
@ -0,0 +1,9 @@
|
|||
# Llama Stack Distribution
|
||||
|
||||
A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications.
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
|
||||
meta-reference-gpu
|
||||
```
|
111
docs/source/getting_started/distributions/meta-reference-gpu.md
Normal file
111
docs/source/getting_started/distributions/meta-reference-gpu.md
Normal file
|
@ -0,0 +1,111 @@
|
|||
# Meta Reference Distribution
|
||||
|
||||
The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | meta-reference | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
|
||||
|
||||
|
||||
### Prerequisite
|
||||
Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide]() here to download the models.
|
||||
|
||||
```
|
||||
$ ls ~/.llama/checkpoints
|
||||
Llama3.1-8B Llama3.2-11B-Vision-Instruct Llama3.2-1B-Instruct Llama3.2-90B-Vision-Instruct Llama-Guard-3-8B
|
||||
Llama3.1-8B-Instruct Llama3.2-1B Llama3.2-3B-Instruct Llama-Guard-3-1B Prompt-Guard-86M
|
||||
```
|
||||
|
||||
### Start the Distribution (Single Node GPU)
|
||||
|
||||
```
|
||||
$ cd distributions/meta-reference-gpu
|
||||
$ ls
|
||||
build.yaml compose.yaml README.md run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a local server with access to your GPU.
|
||||
|
||||
|
||||
> [!NOTE]
|
||||
> `~/.llama` should be the path containing downloaded weights of Llama models.
|
||||
|
||||
|
||||
This will download and start running a pre-built docker container. Alternatively, you may use the following commands:
|
||||
|
||||
```
|
||||
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
### Alternative (Build and start distribution locally via conda)
|
||||
- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on building locally via conda and starting up a meta-reference distribution.
|
||||
|
||||
### Start Distribution With pgvector/chromadb Memory Provider
|
||||
##### pgvector
|
||||
1. Start running the pgvector server:
|
||||
|
||||
```
|
||||
docker run --network host --name mypostgres -it -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres pgvector/pgvector:pg16
|
||||
```
|
||||
|
||||
2. Edit the `run.yaml` file to point to the pgvector server.
|
||||
```
|
||||
memory:
|
||||
- provider_id: pgvector
|
||||
provider_type: remote::pgvector
|
||||
config:
|
||||
host: 127.0.0.1
|
||||
port: 5432
|
||||
db: postgres
|
||||
user: postgres
|
||||
password: mysecretpassword
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> If you get a `RuntimeError: Vector extension is not installed.`. You will need to run `CREATE EXTENSION IF NOT EXISTS vector;` to include the vector extension. E.g.
|
||||
|
||||
```
|
||||
docker exec -it mypostgres ./bin/psql -U postgres
|
||||
postgres=# CREATE EXTENSION IF NOT EXISTS vector;
|
||||
postgres=# SELECT extname from pg_extension;
|
||||
extname
|
||||
```
|
||||
|
||||
3. Run `docker compose up` with the updated `run.yaml` file.
|
||||
|
||||
##### chromadb
|
||||
1. Start running chromadb server
|
||||
```
|
||||
docker run -it --network host --name chromadb -p 6000:6000 -v ./chroma_vdb:/chroma/chroma -e IS_PERSISTENT=TRUE chromadb/chroma:latest
|
||||
```
|
||||
|
||||
2. Edit the `run.yaml` file to point to the chromadb server.
|
||||
```
|
||||
memory:
|
||||
- provider_id: remote::chromadb
|
||||
provider_type: remote::chromadb
|
||||
config:
|
||||
host: localhost
|
||||
port: 6000
|
||||
```
|
||||
|
||||
3. Run `docker compose up` with the updated `run.yaml` file.
|
||||
|
||||
### Serving a new model
|
||||
You may change the `config.model` in `run.yaml` to update the model currently being served by the distribution. Make sure you have the model checkpoint downloaded in your `~/.llama`.
|
||||
```
|
||||
inference:
|
||||
- provider_id: meta0
|
||||
provider_type: meta-reference
|
||||
config:
|
||||
model: Llama3.2-11B-Vision-Instruct
|
||||
quantization: null
|
||||
torch_seed: null
|
||||
max_seq_len: 4096
|
||||
max_batch_size: 1
|
||||
```
|
||||
|
||||
Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
|
|
@ -1,3 +0,0 @@
|
|||
# Docker
|
||||
|
||||
WIP
|
81
docs/source/getting_started/index.md
Normal file
81
docs/source/getting_started/index.md
Normal file
|
@ -0,0 +1,81 @@
|
|||
# Getting Started with Llama Stack
|
||||
|
||||
At the end of the guide, you will have learnt how to:
|
||||
- get a Llama Stack server up and running
|
||||
- get a agent (with tool-calling, vector stores) which works with the above server
|
||||
|
||||
To see more example apps built using Llama Stack, see [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main).
|
||||
|
||||
## Starting Up Llama Stack Server
|
||||
|
||||
### Decide your
|
||||
There are two ways to start a Llama Stack:
|
||||
|
||||
- **Docker**: we provide a number of pre-built Docker containers allowing you to get started instantly. If you are focused on application development, we recommend this option.
|
||||
- **Conda**: the `llama` CLI provides a simple set of commands to build, configure and run a Llama Stack server containing the exact combination of providers you wish. We have provided various templates to make getting started easier.
|
||||
|
||||
Both of these provide options to run model inference using our reference implementations, Ollama, TGI, vLLM or even remote providers like Fireworks, Together, Bedrock, etc.
|
||||
|
||||
### Decide Your Inference Provider
|
||||
|
||||
Running inference of the underlying Llama model is one of the most critical requirements. Depending on what hardware you have available, you have various options:
|
||||
|
||||
- **Do you have access to a machine with powerful GPUs?**
|
||||
If so, we suggest:
|
||||
- `distribution-meta-reference-gpu`:
|
||||
- [Docker]()
|
||||
- [Conda]()
|
||||
- `distribution-tgi`:
|
||||
- [Docker]()
|
||||
- [Conda]()
|
||||
|
||||
- **Are you running on a "regular" desktop machine?**
|
||||
If so, we suggest:
|
||||
- `distribution-ollama`:
|
||||
- [Docker]()
|
||||
- [Conda]()
|
||||
|
||||
- **Do you have access to a remote inference provider like Fireworks, Togther, etc.?** If so, we suggest:
|
||||
- `distribution-fireworks`:
|
||||
- [Docker]()
|
||||
- [Conda]()
|
||||
- `distribution-together`:
|
||||
- [Docker]()
|
||||
- [Conda]()
|
||||
|
||||
## Testing with client
|
||||
Once the server is setup, we can test it with a client to see the example outputs by . This will run the chat completion client and query the distribution’s `/inference/chat_completion` API. Send a POST request to the server:
|
||||
|
||||
```
|
||||
curl http://localhost:5000/inference/chat_completion \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Llama3.1-8B-Instruct",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "Write me a 2 sentence poem about the moon"}
|
||||
],
|
||||
"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
|
||||
}'
|
||||
|
||||
Output:
|
||||
{'completion_message': {'role': 'assistant',
|
||||
'content': 'The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.',
|
||||
'stop_reason': 'out_of_tokens',
|
||||
'tool_calls': []},
|
||||
'logprobs': null}
|
||||
|
||||
```
|
||||
|
||||
Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
|
||||
|
||||
You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
|
||||
|
||||
|
||||
```{toctree}
|
||||
:hidden:
|
||||
:maxdepth: 2
|
||||
|
||||
developer_cookbook
|
||||
distributions/index
|
||||
```
|
Loading…
Add table
Add a link
Reference in a new issue