Commit graph

610 commits

Author SHA1 Message Date
Ashwin Bharambe
c7bfac5382 Add a section for run.yamls 2024-11-22 22:58:39 -08:00
Ashwin Bharambe
1e6006c599 More simplification of the "Starting a Llama Stack" doc 2024-11-22 22:38:53 -08:00
Martin Hickey
76fc5d9f31
Update Ollama supported llama model list (#483)
# What does this PR do?

Update the llama model supported list for Ollama.

- [x] Addresses issue (#462)

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
2024-11-22 21:56:43 -08:00
Xi Yan
039e303707 docs fix 2024-11-22 21:15:21 -08:00
Xi Yan
988f424c9c
[docs] evals (#511)
# What does this PR do?

- add evals docs

## Test Plan



https://github.com/user-attachments/assets/7a1bcfcc-2c37-4cd2-9a72-bf43c2321022



## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-22 21:09:39 -08:00
Ashwin Bharambe
0481fa9540 Fix broken links with docs 2024-11-22 20:42:17 -08:00
Ashwin Bharambe
36938b716c broken reference link 2024-11-22 18:32:32 -08:00
Ashwin Bharambe
00c59b7e39 Add Python SDK reference 2024-11-22 18:27:30 -08:00
Dinesh Yeduguru
501e7c9d64
Fix opentelemetry adapter (#510)
# What does this PR do?

This PR fixes some of the issues with our telemetry setup to enable logs
to be delivered to opentelemetry and jaeger. Main fixes
1) Updates the open telemetry provider to use the latest oltp exports
instead of deprected ones.
2) Adds a tracing middleware, which injects traces into each HTTP
request that the server recieves and this is going to be the root trace.
Previously, we did this in the create_dynamic_route method, which is
actually not the actual exectuion flow, but more of a config and this
causes the traces to end prematurely. Through middleware, we plugin the
trace start and end at the right location.
3) We manage our own methods to create traces and spans and this does
not fit well with Opentelemetry SDK since it does not support provide a
way to take in traces and spans that are already created. it expects us
to use the SDK to create them. For now, I have a hacky approach of just
maintaining a map from our internal telemetry objects to the open
telemetry specfic ones. This is not the ideal solution. I will explore
other ways to get around this issue. for now, to have something that
works, i am going to keep this as is.

Addresses: #509
2024-11-22 18:18:11 -08:00
dltn
beab798a1d Add initial direct client docs 2024-11-22 18:04:48 -08:00
Ashwin Bharambe
31e983ab68 Simplify feature request ISSUE template 2024-11-22 18:02:39 -08:00
Xi Yan
d97cfaa9d9
[docs] add openapi spec to docs (#508)
# What does this PR do?
- modify openapi generator to add coming soon tag for unimplemented api
- sphinx-redocs extension for openapi spec to readthedocs page

## Test Plan



https://github.com/user-attachments/assets/b4c7eebc-2361-4198-a987-dbfbcff914cf






## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-22 17:54:32 -08:00
Ashwin Bharambe
1b2b32f959 Minor updates to docs 2024-11-22 17:44:05 -08:00
Ashwin Bharambe
6229562760 Organize references 2024-11-22 16:46:45 -08:00
Ashwin Bharambe
6fbf526d5c Move gitignore from docs/ to the main gitignore 2024-11-22 15:55:34 -08:00
Ashwin Bharambe
526a8dcfe0 Minor edit to zero_to_hero_guide 2024-11-22 15:52:56 -08:00
Ashwin Bharambe
0bd774716c Kill pancakes logo 2024-11-22 15:51:11 -08:00
Ashwin Bharambe
5acb15d2bf Make quickstart.md -> README.md so it shows up as default 2024-11-22 15:50:25 -08:00
Justin Lee
9928405e2c
Docs improvement v3 (#433)
# What does this PR do?

- updated the notebooks to reflect past changes up to llama-stack 0.0.53
- updated readme to  provide accurate and up-to-date info
- improve the current zero to hero by integrating an example using
together api


## Before submitting

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.

---------

Co-authored-by: Sanyam Bhutani <sanyambhutani@meta.com>
2024-11-22 15:43:31 -08:00
Ashwin Bharambe
97dc5b68e5 model -> model_id for TGI 2024-11-22 15:40:08 -08:00
Ashwin Bharambe
c2c53d0272 More doc cleanup 2024-11-22 14:37:22 -08:00
Ashwin Bharambe
900b0556e7 Much more documentation work, things are getting a bit consumable right now 2024-11-22 14:06:18 -08:00
Ashwin Bharambe
98e213e96c More docs work 2024-11-22 14:06:18 -08:00
Ashwin Bharambe
eb2063bc3d Updates to the main doc page 2024-11-22 14:06:18 -08:00
dltn
eaf4fbef75 another print -> log fix 2024-11-22 13:35:34 -08:00
dltn
302a0145e5 we do want prints in print_pip_install_help 2024-11-22 13:32:54 -08:00
Dalton Flanagan
b007b062f3
Fix llama stack build in 0.0.54 (#505)
# What does this PR do?

Safety provider `inline::meta-reference` is now deprecated. However, we 

* aren't checking / printing the deprecation message in `llama stack
build`
* make the deprecated (unusable) provider

So I (1) added checking and (2) made `inline::llama-guard` the default

## Test Plan

Before

```
Traceback (most recent call last):
  File "/home/dalton/.conda/envs/nov22/bin/llama", line 8, in <module>
    sys.exit(main())
  File "/home/dalton/all/llama-stack/llama_stack/cli/llama.py", line 46, in main
    parser.run(args)
  File "/home/dalton/all/llama-stack/llama_stack/cli/llama.py", line 40, in run
    args.func(args)
  File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 177, in _run_stack_build_command
    self._run_stack_build_command_from_build_config(build_config)
  File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 305, in _run_stack_build_command_from_build_config
    self._generate_run_config(build_config, build_dir)
  File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 226, in _generate_run_config
    config_type = instantiate_class_type(
  File "/home/dalton/all/llama-stack/llama_stack/distribution/utils/dynamic.py", line 12, in instantiate_class_type
    module = importlib.import_module(module_name)
  File "/home/dalton/.conda/envs/nov22/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'llama_stack.providers.inline.safety.meta_reference'
```

After

```
Traceback (most recent call last):
  File "/home/dalton/.conda/envs/nov22/bin/llama", line 8, in <module>
    sys.exit(main())
  File "/home/dalton/all/llama-stack/llama_stack/cli/llama.py", line 46, in main
    parser.run(args)
  File "/home/dalton/all/llama-stack/llama_stack/cli/llama.py", line 40, in run
    args.func(args)
  File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 177, in _run_stack_build_command
    self._run_stack_build_command_from_build_config(build_config)
  File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 309, in _run_stack_build_command_from_build_config
    self._generate_run_config(build_config, build_dir)
  File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 228, in _generate_run_config
    raise InvalidProviderError(p.deprecation_error)
llama_stack.distribution.resolver.InvalidProviderError: 
Provider `inline::meta-reference` for API `safety` does not work with the latest Llama Stack.
- if you are using Llama Guard v3, please use the `inline::llama-guard` provider instead.
- if you are using Prompt Guard, please use the `inline::prompt-guard` provider instead.
- if you are using Code Scanner, please use the `inline::code-scanner` provider instead.
```

<img width="469" alt="Screenshot 2024-11-22 at 4 10 24 PM"
src="https://github.com/user-attachments/assets/8c2e09fe-379a-4504-b246-7925f80a6ed6">

## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-22 16:23:44 -05:00
Ashwin Bharambe
2137b0af40 Bump version to 0.0.54 2024-11-21 16:28:30 -08:00
Ashwin Bharambe
c1025ebfdb Delete some dead code 2024-11-21 15:20:06 -08:00
Ashwin Bharambe
a0a00f1345 Update telemetry to have TEXT be the default log format 2024-11-21 15:18:45 -08:00
Xi Yan
945db5dac2 fix logging 2024-11-21 15:02:57 -08:00
Ashwin Bharambe
d790be28b3 Don't skip meta-reference for the tests 2024-11-21 13:29:53 -08:00
Ashwin Bharambe
55c55b9f51 Update Quick Start significantly 2024-11-21 13:20:55 -08:00
Xi Yan
654722da7d fix model id for llm_as_judge_405b 2024-11-21 11:34:49 -08:00
Dinesh Yeduguru
6395dadc2b
use logging instead of prints (#499)
# What does this PR do?

This PR moves all print statements to use logging. Things changed:
- Had to add `await start_trace("sse_generator")` to server.py to
actually get tracing working. else was not seeing any logs
- If no telemetry provider is provided in the run.yaml, we will write to
stdout
- by default, the logs are going to be in JSON, but we expose an option
to configure to output in a human readable way.
2024-11-21 11:32:53 -08:00
liyunlu0618
4e1105e563
Fix fp8 quantization script. (#500)
# What does this PR do?

Fix fp8 quantization script.

## Test Plan

```
sh run_quantize_checkpoint.sh localhost fp8 /home/yll/fp8_test/ /home/yll/fp8_test/quantized_2 /home/yll/fp8_test/tokenizer.model 1 1
```

## Sources

Please link relevant resources if necessary.


## Before submitting

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.

Co-authored-by: Yunlu Li <yll@meta.com>
2024-11-21 09:15:28 -08:00
Ashwin Bharambe
cf079a22a0 Plurals 2024-11-20 23:24:59 -08:00
Ashwin Bharambe
cd6ccb664c Integrate distro docs into the restructured docs 2024-11-20 23:20:05 -08:00
Ashwin Bharambe
2411a44833 Update more distribution docs to be simpler and partially codegen'ed 2024-11-20 22:03:44 -08:00
Ashwin Bharambe
e84d4436b5
Since we are pushing for HF repos, we should accept them in inference configs (#497)
# What does this PR do?

As the title says. 

## Test Plan

This needs
8752149f58
to also land. So the next package (0.0.54) will make this work properly.

The test is:

```bash
pytest -v -s -m "llama_3b and meta_reference" test_model_registration.py
```
2024-11-20 16:14:37 -08:00
Dinesh Yeduguru
b3f9e8b2f2
Restructure docs (#494)
Rendered docs at: https://llama-stack.readthedocs.io/en/doc-simplify/
2024-11-20 15:54:47 -08:00
Ashwin Bharambe
068ac00a3b
Don't depend on templates.py when print llama stack build messages (#496) 2024-11-20 15:44:49 -08:00
Ashwin Bharambe
00816cc8ef make sure codegen doesn't cause spurious diffs for no reason 2024-11-20 13:56:30 -08:00
Ashwin Bharambe
681322731b
Make run yaml optional so dockers can start with just --env (#492)
When running with dockers, the idea is that users be able to work purely
with the `llama stack` CLI. They should not need to know about the
existence of any YAMLs unless they need to. This PR enables it.

The docker command now doesn't need to volume mount a yaml and can
simply be:
```bash
docker run -v ~/.llama/:/root/.llama \
  --env A=a --env B=b
```

## Test Plan

Check with conda first (no regressions):
```bash
LLAMA_STACK_DIR=. llama stack build --template ollama
llama stack run ollama --port 5001

# server starts up correctly
```

Check with docker
```bash
# build the docker
LLAMA_STACK_DIR=. llama stack build --template ollama --image-type docker

export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"

docker run -it  -p 5001:5001 \
  -v ~/.llama:/root/.llama \
  -v $PWD:/app/llama-stack-source \
  localhost/distribution-ollama:dev \
  --port 5001 \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=http://host.docker.internal:11434
```

Note that volume mounting to `/app/llama-stack-source` is only needed
because we built the docker with uncommitted source code.
2024-11-20 13:11:40 -08:00
Dinesh Yeduguru
1d8d0593af
register with provider even if present in stack (#491)
# What does this PR do?

Remove a check which skips provider registration if a resource is
already in stack registry. Since we do not reconcile state with
provider, register should always call into provider's register endpoint.


## Test Plan
```
# stack run
╰─❯ llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml

#register memory bank
❯ llama-stack-client memory_banks register your_memory_bank_name --type vector --provider-id inline::faiss-0

Memory Bank Configuration:
{
│   'memory_bank_type': 'vector',
│   'chunk_size_in_tokens': 512,
│   'embedding_model': 'all-MiniLM-L6-v2',
│   'overlap_size_in_tokens': 64
}

#register again
❯ llama-stack-client memory_banks register your_memory_bank_name --type vector --provider-id inline::faiss-0

Memory Bank Configuration:
{
│   'memory_bank_type': 'vector',
│   'chunk_size_in_tokens': 512,
│   'embedding_model': 'all-MiniLM-L6-v2',
│   'overlap_size_in_tokens': 64
}
```
2024-11-20 11:05:50 -08:00
Dinesh Yeduguru
91e7efbc91
fall to back to read from chroma/pgvector when not in cache (#489)
# What does this PR do?

The chroma provider maintains a cache but does not sync up with chroma
on a cold start. this change adds a fallback to read from chroma on a
cache miss.


## Test Plan
```bash
#start stack
llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml
# Add documents
PYTHONPATH=. python -m examples.agents.rag_with_memory_bank localhost 5000

No available shields. Disable safety.
Using model: Llama3.1-8B-Instruct
Created session_id=b951b14f-a9d2-43a3-8b80-d80114d58322 for Agent(0687a251-6906-4081-8d4c-f52e19db9dd7)
memory_retrieval> Retrieved context from banks: ['test_bank'].
====
Here are the retrieved documents for relevant context:
=== START-RETRIEVED-CONTEXT ===
 id:num-1; content:_
the template from Llama2 to better support multiturn conversations. The same text
in the Lla...
>
inference> Based on the retrieved documentation, the top 5 topics that were explained are:
...............

# Kill stack
# Bootup stack
llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml
# Run a RAG app with just the agent flow. it discovers the previously added documents
No available shields. Disable safety.
Using model: Llama3.1-8B-Instruct
Created session_id=7a30c1a7-c87e-4787-936c-d0306589fe5d for Agent(b30420f3-c928-498a-887b-d084f0f3806c)
memory_retrieval> Retrieved context from banks: ['test_bank'].
====
Here are the retrieved documents for relevant context:
=== START-RETRIEVED-CONTEXT ===
 id:num-1; content:_
the template from Llama2 to better support multiturn conversations. The same text
in the Lla...
>
inference> Based on the provided documentation, the top 5 topics that were explained are:
.....
```
2024-11-20 10:30:23 -08:00
Justin Lee
ae49a4cb97
Reorganizing Zero to Hero Folder structure (#447)
Putting Zero to Hero Guide to root for increased visibility
2024-11-20 10:27:29 -08:00
Ashwin Bharambe
89f5093dfc Fix tgi doc 2024-11-19 21:06:11 -08:00
Mengtao Yuan
1086b500f9
Support Tavily as built-in search tool. (#485)
# What does this PR do?

Add Tavily as a built-in search tool, in addition to Brave and Bing.

## Test Plan

It's tested using ollama remote, showing parity to the Brave search
tool.
- Install and run ollama with `ollama run llama3.1:8b-instruct-fp16`
- Build ollama distribution `llama stack build --template ollama
--image-type conda`
- Run ollama `stack run
/$USER/.llama/distributions/llamastack-ollama/ollama-run.yaml --port
5001`
- Client test command: `python - m
agents.test_agents.TestAgents.test_create_agent_turn_with_tavily_search`,
with enviroments:

MASTER_ADDR=0.0.0.0;MASTER_PORT=5001;RANK=0;REMOTE_STACK_HOST=0.0.0.0;REMOTE_STACK_PORT=5001;TAVILY_SEARCH_API_KEY=tvly-<YOUR-KEY>;WORLD_SIZE=1

Test passes on the specific case (ollama remote).

Server output: 
```
Listening on ['::', '0.0.0.0']:5001
INFO:     Started server process [7220]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5001 (Press CTRL+C to quit)
INFO:     127.0.0.1:65209 - "POST /agents/create HTTP/1.1" 200 OK
INFO:     127.0.0.1:65210 - "POST /agents/session/create HTTP/1.1" 200 OK
INFO:     127.0.0.1:65211 - "POST /agents/turn/create HTTP/1.1" 200 OK
role='user' content='What are the latest developments in quantum computing?' context=None
role='assistant' content='' stop_reason=<StopReason.end_of_turn: 'end_of_turn'> tool_calls=[ToolCall(call_id='fc92ccb8-1039-4ce8-ba5e-8f2b0147661c', tool_name=<BuiltinTool.brave_search: 'brave_search'>, arguments={'query': 'latest developments in quantum computing'})]
role='ipython' call_id='fc92ccb8-1039-4ce8-ba5e-8f2b0147661c' tool_name=<BuiltinTool.brave_search: 'brave_search'> content='{"query": "latest developments in quantum computing", "top_k": [{"title": "IBM Unveils 400 Qubit-Plus Quantum Processor and Next-Generation IBM ...", "url": "https://newsroom.ibm.com/2022-11-09-IBM-Unveils-400-Qubit-Plus-Quantum-Processor-and-Next-Generation-IBM-Quantum-System-Two", "content": "This system is targeted to be online by the end of 2023 and will be a building b...<more>...onnect large-scale ...", "url": "https://news.mit.edu/2023/quantum-interconnects-photon-emission-0105", "content": "Quantum computers hold the promise of performing certain tasks that are intractable even on the world\'s most powerful supercomputers. In the future, scientists anticipate using quantum computing to emulate materials systems, simulate quantum chemistry, and optimize hard tasks, with impacts potentially spanning finance to pharmaceuticals.", "score": 0.71721, "raw_content": null}]}'
Assistant: The latest developments in quantum computing include:

* IBM unveiling its 400 qubit-plus quantum processor and next-generation IBM Quantum System Two, which will be a building block of quantum-centric supercomputing.
* The development of utility-scale quantum computing, which can serve as a scientific tool to explore utility-scale classes of problems in chemistry, physics, and materials beyond brute force classical simulation of quantum mechanics.
* The introduction of advanced hardware across IBM's global fleet of 100+ qubit systems, as well as easy-to-use software that users and computational scientists can now obtain reliable results from quantum systems as they map increasingly larger and more complex problems to quantum circuits.
* Research on quantum repeaters, which use defects in diamond to interconnect quantum systems and could provide the foundation for scalable quantum networking.
* The development of a new source of quantum light, which could be used to improve the efficiency of quantum computers.
* The creation of a new mathematical "blueprint" that is accelerating fusion device development using Dyson maps.
* Research on canceling noise to improve quantum devices, with MIT researchers developing a protocol to extend the life of quantum coherence.
```

Verified with tool response. The final model response is updated with
the search requests.

## Sources

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.

Co-authored-by: Martin Yuan <myuan@meta.com>
2024-11-19 20:59:02 -08:00
varunfb
08be023290
Added optional md5 validate command once download is completed (#486)
# What does this PR do?

Adds description at the end of successful download the optionally run
the verify md5 checksums command.

## Test Plan
<img width="2004" alt="Screenshot 2024-11-19 at 12 11 37 PM"
src="https://github.com/user-attachments/assets/8d617aef-99f5-4c3b-b93c-eff3e68289ea">

## Before submitting

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.

---------

Co-authored-by: varunfb <vontimitta@devgpu004.eag5.facebook.com>
2024-11-19 17:42:43 -08:00