refactor structure

This commit is contained in:
Xi Yan 2024-10-29 14:04:41 -07:00
parent 9ddc28eca7
commit 42104361a3
13 changed files with 293 additions and 562 deletions

View file

@ -1,2 +0,0 @@
# Conda
WIP

View file

@ -0,0 +1,41 @@
# Llama Stack Developer Cookbook
Based on your developer needs, below are references to guides to help you get started.
### Hosted Llama Stack Endpoint
* Developer Need: I want to connect to a Llama Stack endpoint to build my applications.
* Effort: 1min
* Guide:
- Checkout our [DeepLearning course](https://www.deeplearning.ai/short-courses/introducing-multimodal-llama-3-2) on building with Llama Stack apps on pre-hosted Llama Stack endpoint.
### Local meta-reference Llama Stack Server
* Developer Need: I want to start a local Llama Stack server with my GPU using meta-reference implementations.
* Effort: 5min
* Guide:
- Please see our [Getting Started Guide](./getting_started.md) on starting up a meta-reference Llama Stack server.
### Llama Stack Server with Remote Providers
* Developer need: I want a Llama Stack distribution with a remote provider.
* Effort: 10min
* Guide
- Please see our [Distributions Guide](../../../distributions/) on starting up distributions with remote providers.
### On-Device (iOS) Llama Stack
* Developer Need: I want to use Llama Stack on-Device
* Effort: 1.5hr
* Guide:
- Please see our [iOS Llama Stack SDK](./ios_setup.md) implementations
### Assemble your own Llama Stack Distribution
* Developer Need: I want to assemble my own distribution with API providers to my likings
* Effort: 30min
* Guide
- Please see our [Building Distribution](./building_distro.md) guide for assembling your own Llama Stack distribution with your choice of API providers.
### Adding a New API Provider
* Developer Need: I want to add a new API provider to Llama Stack.
* Effort: 3hr
* Guide
- Please see our [Adding a New API Provider](./new_api_provider.md) guide for adding a new API provider.

View file

@ -0,0 +1,9 @@
# Llama Stack Distribution
A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications.
```{toctree}
:maxdepth: 2
meta-reference-gpu
```

View file

@ -0,0 +1,111 @@
# Meta Reference Distribution
The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations.
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
| **Provider(s)** | meta-reference | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
### Prerequisite
Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide]() here to download the models.
```
$ ls ~/.llama/checkpoints
Llama3.1-8B Llama3.2-11B-Vision-Instruct Llama3.2-1B-Instruct Llama3.2-90B-Vision-Instruct Llama-Guard-3-8B
Llama3.1-8B-Instruct Llama3.2-1B Llama3.2-3B-Instruct Llama-Guard-3-1B Prompt-Guard-86M
```
### Start the Distribution (Single Node GPU)
```
$ cd distributions/meta-reference-gpu
$ ls
build.yaml compose.yaml README.md run.yaml
$ docker compose up
```
> [!NOTE]
> This assumes you have access to GPU to start a local server with access to your GPU.
> [!NOTE]
> `~/.llama` should be the path containing downloaded weights of Llama models.
This will download and start running a pre-built docker container. Alternatively, you may use the following commands:
```
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
```
### Alternative (Build and start distribution locally via conda)
- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on building locally via conda and starting up a meta-reference distribution.
### Start Distribution With pgvector/chromadb Memory Provider
##### pgvector
1. Start running the pgvector server:
```
docker run --network host --name mypostgres -it -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres pgvector/pgvector:pg16
```
2. Edit the `run.yaml` file to point to the pgvector server.
```
memory:
- provider_id: pgvector
provider_type: remote::pgvector
config:
host: 127.0.0.1
port: 5432
db: postgres
user: postgres
password: mysecretpassword
```
> [!NOTE]
> If you get a `RuntimeError: Vector extension is not installed.`. You will need to run `CREATE EXTENSION IF NOT EXISTS vector;` to include the vector extension. E.g.
```
docker exec -it mypostgres ./bin/psql -U postgres
postgres=# CREATE EXTENSION IF NOT EXISTS vector;
postgres=# SELECT extname from pg_extension;
extname
```
3. Run `docker compose up` with the updated `run.yaml` file.
##### chromadb
1. Start running chromadb server
```
docker run -it --network host --name chromadb -p 6000:6000 -v ./chroma_vdb:/chroma/chroma -e IS_PERSISTENT=TRUE chromadb/chroma:latest
```
2. Edit the `run.yaml` file to point to the chromadb server.
```
memory:
- provider_id: remote::chromadb
provider_type: remote::chromadb
config:
host: localhost
port: 6000
```
3. Run `docker compose up` with the updated `run.yaml` file.
### Serving a new model
You may change the `config.model` in `run.yaml` to update the model currently being served by the distribution. Make sure you have the model checkpoint downloaded in your `~/.llama`.
```
inference:
- provider_id: meta0
provider_type: meta-reference
config:
model: Llama3.2-11B-Vision-Instruct
quantization: null
torch_seed: null
max_seq_len: 4096
max_batch_size: 1
```
Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.

View file

@ -1,3 +0,0 @@
# Docker
WIP

View file

@ -0,0 +1,81 @@
# Getting Started with Llama Stack
At the end of the guide, you will have learnt how to:
- get a Llama Stack server up and running
- get a agent (with tool-calling, vector stores) which works with the above server
To see more example apps built using Llama Stack, see [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main).
## Starting Up Llama Stack Server
### Decide your
There are two ways to start a Llama Stack:
- **Docker**: we provide a number of pre-built Docker containers allowing you to get started instantly. If you are focused on application development, we recommend this option.
- **Conda**: the `llama` CLI provides a simple set of commands to build, configure and run a Llama Stack server containing the exact combination of providers you wish. We have provided various templates to make getting started easier.
Both of these provide options to run model inference using our reference implementations, Ollama, TGI, vLLM or even remote providers like Fireworks, Together, Bedrock, etc.
### Decide Your Inference Provider
Running inference of the underlying Llama model is one of the most critical requirements. Depending on what hardware you have available, you have various options:
- **Do you have access to a machine with powerful GPUs?**
If so, we suggest:
- `distribution-meta-reference-gpu`:
- [Docker]()
- [Conda]()
- `distribution-tgi`:
- [Docker]()
- [Conda]()
- **Are you running on a "regular" desktop machine?**
If so, we suggest:
- `distribution-ollama`:
- [Docker]()
- [Conda]()
- **Do you have access to a remote inference provider like Fireworks, Togther, etc.?** If so, we suggest:
- `distribution-fireworks`:
- [Docker]()
- [Conda]()
- `distribution-together`:
- [Docker]()
- [Conda]()
## Testing with client
Once the server is setup, we can test it with a client to see the example outputs by . This will run the chat completion client and query the distributions `/inference/chat_completion` API. Send a POST request to the server:
```
curl http://localhost:5000/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
"model": "Llama3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write me a 2 sentence poem about the moon"}
],
"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
}'
Output:
{'completion_message': {'role': 'assistant',
'content': 'The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.',
'stop_reason': 'out_of_tokens',
'tool_calls': []},
'logprobs': null}
```
Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
```{toctree}
:hidden:
:maxdepth: 2
developer_cookbook
distributions/index
```