Commit graph

844 commits

Author SHA1 Message Date
Henry Tu
64c6df8392
Cerebras Inference Integration (#265)
Adding Cerebras Inference as an API provider.

## Testing

### Conda
```
$ llama stack build --template cerebras --image-type conda
$ llama stack run ~/.llama/distributions/llamastack-cerebras/cerebras-run.yaml
...
Listening on ['::', '0.0.0.0']:5000
INFO:     Started server process [12443]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
```

### Chat Completion
```
$ curl --location 'http://localhost:5000/alpha/inference/chat-completion' --header 'Content-Type: application/json' --data '{
    "model_id": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": "What is the temperature in Seattle right now?"
        }
    ],
    "stream": false,
    "sampling_params": {
        "strategy": "top_p",
        "temperature": 0.5,
        "max_tokens": 100
    },                   
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "tools": [                   
        {
            "tool_name": "getTemperature",
            "description": "Gets the current temperature of a location.",
            "parameters": {                                              
                "location": {
                    "param_type": "string",
                    "description": "The name of the place to get the temperature from in degress celsius.",
                    "required": true                                                                       
                }                   
            }    
        }    
    ]    
}' 
```

#### Non-Streaming Response
```
{
  "completion_message": {
    "role": "assistant",
    "content": "",
    "stop_reason": "end_of_message",
    "tool_calls": [
      {
        "call_id": "6f42fdcc-6cbb-46ad-a17b-5d20ac64b678",
        "tool_name": "getTemperature",
        "arguments": {
          "location": "Seattle"
        }
      }
    ]
  },
  "logprobs": null
}
```

#### Streaming Response
```
data: {"event":{"event_type":"start","delta":"","logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"","parse_status":"started"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"{\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"type","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"function","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"name","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"get","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"Temperature","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"parameters","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" {\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"location","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"Seattle","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\"}}","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":{"call_id":"e742df1f-0ae9-40ad-a49e-18e5c905484f","tool_name":"getTemperature","arguments":{"location":"Seattle"}},"parse_status":"success"},"logprobs":null,"stop_reason":"end_of_message"}}
data: {"event":{"event_type":"complete","delta":"","logprobs":null,"stop_reason":"end_of_message"}}
```

### Completion
```
$ curl --location 'http://localhost:5000/alpha/inference/completion' --header 'Content-Type: application/json' --data '{
    "model_id": "meta-llama/Llama-3.1-8B-Instruct",
    "content": "1,2,3,",
    "stream": true,
    "sampling_params": {
        "strategy": "top_p",
        "temperature": 0.5,
        "max_tokens": 10
    },                   
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "tools": [                   
        {
            "tool_name": "getTemperature",
            "description": "Gets the current temperature of a location.",
            "parameters": {                                              
                "location": {
                    "param_type": "string",
                    "description": "The name of the place to get the temperature from in degress celsius.",
                    "required": true                                                                       
                }                   
            }    
        }    
    ]    
}'
```

#### Non-Streaming Response
```
{
  "content": "4,5,6,7,8,",
  "stop_reason": "out_of_tokens",
  "logprobs": null
}
```

#### Streaming Response
```
data: {"delta":"4","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"5","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"6","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"7","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"8","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"","stop_reason":null,"logprobs":null}
data: {"delta":"","stop_reason":"out_of_tokens","logprobs":null}
```

### Pre-Commit Checks
```
trim trailing whitespace.................................................Passed
check python ast.........................................................Passed
check for merge conflicts................................................Passed
check for added large files..............................................Passed
fix end of files.........................................................Passed
Insert license in comments...............................................Passed
flake8...................................................................Passed
Format files with µfmt...................................................Passed
```

### Testing with `test_inference.py`
```
$ export CEREBRAS_API_KEY=<insert API key here>
$ pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -m "cerebras and llama_8b" 
/net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=================================================== test session starts ===================================================
platform linux -- Python 3.12.3, pytest-8.3.3, pluggy-1.5.0 -- /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/bin/python3.12
cachedir: .pytest_cache
rootdir: /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack
configfile: pyproject.toml
plugins: anyio-4.6.2.post1, asyncio-0.24.0
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 128 items / 120 deselected / 8 selected                                                                         

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_8b-cerebras] Resolved 4 providers
 inner-inference => cerebras
 models => __routing_table__
 inference => __autorouted__
 inspect => __builtin__

Models: meta-llama/Llama-3.1-8B-Instruct served by cerebras

PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_8b-cerebras] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-cerebras] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_8b-cerebras] PASSED

================================ 6 passed, 2 skipped, 120 deselected, 6 warnings in 3.95s =================================
```

I ran `python llama_stack/scripts/distro_codegen.py` to run codegen.
2024-12-03 21:15:32 -08:00
Kai Wu
b6500974ec
removed assertion in ollama.py and fixed typo in the readme (#563)
# What does this PR do?
1. removed [incorrect
assertion](435f34b05e/llama_stack/providers/remote/inference/ollama/ollama.py (L183))
in ollama.py
2. fixed a typo in [this
line](435f34b05e/docs/source/distributions/importing_as_library.md (L24)),
as `model=` should be `model_id=` .

- [x] Addresses issue
([#issue562](https://github.com/meta-llama/llama-stack/issues/562))


## Test Plan

tested with code:

```python
import asyncio
import os

# pip install aiosqlite ollama faiss
from llama_stack_client.lib.direct.direct import LlamaStackDirectClient
from llama_stack_client.types import SystemMessage, UserMessage


async def main():
    os.environ["INFERENCE_MODEL"] = "meta-llama/Llama-3.2-1B-Instruct"
    client = await LlamaStackDirectClient.from_template("ollama")
    await client.initialize()
    response = await client.models.list()
    print(response)
    model_name = response[0].identifier
    response = await client.inference.chat_completion(
        messages=[
            SystemMessage(content="You are a friendly assistant.", role="system"),
            UserMessage(
                content="hello world, write me a 2 sentence poem about the moon",
                role="user",
            ),
        ],
        model_id=model_name,
        stream=False,
    )
    print("\nChat completion response:")
    print(response, type(response))


asyncio.run(main())

```
OUTPUT:
```
python test.py
Using template ollama with config:
apis:
- agents
- inference
- memory
- safety
- telemetry
conda_env: ollama
datasets: []
docker_image: null
eval_tasks: []
image_name: ollama
memory_banks: []
metadata_store:
  db_path: /Users/kaiwu/.llama/distributions/ollama/registry.db
  namespace: null
  type: sqlite
models:
- metadata: {}
  model_id: meta-llama/Llama-3.2-1B-Instruct
  provider_id: ollama
  provider_model_id: null
providers:
  agents:
  - config:
      persistence_store:
        db_path:
/Users/kaiwu/.llama/distributions/ollama/agents_store.db
        namespace: null
        type: sqlite
    provider_id: meta-reference
    provider_type: inline::meta-reference
  inference:
  - config:
      url: http://localhost:11434
    provider_id: ollama
    provider_type: remote::ollama
  memory:
  - config:
      kvstore:
        db_path:
/Users/kaiwu/.llama/distributions/ollama/faiss_store.db
        namespace: null
        type: sqlite
    provider_id: faiss
    provider_type: inline::faiss
  safety:
  - config: {}
    provider_id: llama-guard
    provider_type: inline::llama-guard
  telemetry:
  - config: {}
    provider_id: meta-reference
    provider_type: inline::meta-reference
scoring_fns: []
shields: []
version: '2'

[Model(identifier='meta-llama/Llama-3.2-1B-Instruct', provider_resource_id='llama3.2:1b-instruct-fp16', provider_id='ollama', type='model', metadata={})]

Chat completion response:
completion_message=CompletionMessage(role='assistant', content='Here is a short poem about the moon:\n\nThe moon glows bright in the midnight sky,\nA silver crescent shining, catching the eye.', stop_reason=<StopReason.end_of_turn: 'end_of_turn'>, tool_calls=[]) logprobs=None <class 'llama_stack.apis.inference.inference.ChatCompletionResponse'>
```

## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-03 20:11:19 -08:00
Xi Yan
6e10d0b23e precommit 2024-12-03 18:52:43 -08:00
Xi Yan
fd19a8a517 add missing __init__ 2024-12-03 18:50:18 -08:00
Matthew Farrellee
435f34b05e
reduce the accuracy requirements to pass the chat completion structured output test (#522)
i find `test_structured_output` to be flakey. it's both a functionality
and accuracy test -

```
        answer = AnswerFormat.model_validate_json(response.completion_message.content)
        assert answer.first_name == "Michael"
        assert answer.last_name == "Jordan"
        assert answer.year_of_birth == 1963
        assert answer.num_seasons_in_nba == 15
```

it's an accuracy test because it checks the value of first/last name,
birth year, and num seasons.

i find that -
- llama-3.1-8b-instruct and llama-3.2-3b-instruct pass the functionality
portion
- llama-3.2-3b-instruct consistently fails the accuracy portion
(thinking MJ was in the NBA for 14 seasons)
 - llama-3.1-8b-instruct occasionally fails the accuracy portion

suggestions (not mutually exclusive) -
1. turn the test into functionality only, skip the value checks
2. split the test into a functionality version and an xfail accuracy
version
3. add context to the prompt so the llm can answer without accessing
embedded memory

# What does this PR do?

implements option (3) by adding context to the system prompt.


## Test Plan


`pytest -s -v ... llama_stack/providers/tests/inference/ ... -k
structured_output`


## Before submitting

- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
2024-12-03 02:55:14 -08:00
dltn
4c7b1a8fb3 Bump version to 0.0.57 2024-12-02 19:48:46 -08:00
Dinesh Yeduguru
1e2faa461f
update client cli docs (#560)
Test plan: 
make html
sphinx-autobuild source build/html


![Screenshot 2024-12-02 at 3 32
18 PM](https://github.com/user-attachments/assets/061d5ca6-178f-463a-854c-acb96ca3bb0d)
2024-12-02 16:10:16 -08:00
Aidan Do
6bcd1bd9f1
Fix broken Ollama link (#554)
# What does this PR do?

Fixes a broken Ollama link and formatting on this page:
https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/ollama.html

<img width="714" alt="Screenshot 2024-12-02 at 21 04 17"
src="https://github.com/user-attachments/assets/ada893c3-e1bd-4f04-826f-9ce1a11330a3">

<img width="822" alt="image"
src="https://github.com/user-attachments/assets/ab47cec3-3fcc-4671-92ae-febbc5003e6f">

To:

<img width="714" alt="Screenshot 2024-12-02 at 21 05 07"
src="https://github.com/user-attachments/assets/07a41653-1978-4472-bfa0-5f65dbf5cab5">

<img width="616" alt="image"
src="https://github.com/user-attachments/assets/dd0022e6-3468-4de0-bd55-c4ce2840c7d6">


## Before submitting

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).

Co-authored-by: Aidan Do <aidand@canva.com>
2024-12-02 11:06:20 -08:00
Dinesh Yeduguru
fe48b9fb8c Bump version to 0.0.56 2024-11-30 12:27:31 -08:00
raghotham
8a3887c7eb
Guide readme fix (#552)
# What does this PR do?
Fixes readme to remove redundant information and added
llama-stack-client cli instructions.

## Before submitting

- [ X] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ X] Ran pre-commit to handle lint / formatting issues.
- [ X] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ X] Updated relevant documentation.
2024-11-30 12:28:03 -06:00
Jeffrey Lind
2fc1c16d58
Fix Zero to Hero README.md Formatting (#546)
# What does this PR do?
The formatting shown in the picture below in the Zero to Hero README.md
was fixed with this PR (also shown in a picture below).

**Before**
<img width="1014" alt="Screenshot 2024-11-28 at 1 47 32 PM"
src="https://github.com/user-attachments/assets/02d2281e-83ae-43eb-a1c7-702bc2365120">

**After**
<img width="1014" alt="Screenshot 2024-11-28 at 1 50 19 PM"
src="https://github.com/user-attachments/assets/03e54f40-c347-4737-8b91-197eee70a52f">
2024-11-29 10:12:53 -06:00
Jeffrey Lind
5fc2ee6f77
Fix URLs to Llama Stack Read the Docs Webpages (#547)
# What does this PR do?

Many of the URLs pointing to the Llama Stack's Read The Docs webpages
were broken, presumably due to recent refactor of the documentation.
This PR fixes all effected URLs throughout the repository.
2024-11-29 10:11:50 -06:00
Sean
9088206eda
fix[documentation]: Update links to point to correct pages (#549)
# What does this PR do?

In short, provide a summary of what this PR does and why. Usually, the
relevant context should be present in a linked issue.

- [x] Addresses issue (#548)


## Test Plan

Please describe:
No automated tests. Clicked on each link to ensure I was directed to the
right page.

## Sources


## Before submitting

- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [x] Updated relevant documentation.
- [ ] ~Wrote necessary unit or integration tests.~
2024-11-29 07:43:56 -06:00
Xi Yan
b1a63df8cd
move playground ui to llama-stack repo (#536)
# What does this PR do?

- Move Llama Stack Playground UI to llama-stack repo under
llama_stack/distribution
- Original PR in llama-stack-apps:
https://github.com/meta-llama/llama-stack-apps/pull/127

## Test Plan
```
cd llama-stack/llama_stack/distribution/ui
streamlit run app.py
```


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-26 22:04:21 -08:00
Matthew Farrellee
060b4eb776
allow env NVIDIA_BASE_URL to set NVIDIAConfig.url (#531)
# What does this PR do?

this allows setting an NVIDIA_BASE_URL variable to control the
NVIDIAConfig.url option


## Test Plan

`pytest -s -v --providers inference=nvidia
llama_stack/providers/tests/inference/ --env
NVIDIA_BASE_URL=http://localhost:8000`


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-26 17:46:44 -08:00
Xi Yan
50cc165077
fixes tests & move braintrust api_keys to request headers (#535)
# What does this PR do?

- braintrust scoring provider requires OPENAI_API_KEY env variable to be
set
- move this to be able to be set as request headers (e.g. like together
/ fireworks api keys)
- fixes pytest with agents dependency

## Test Plan

**E2E**
```
llama stack run 
```
```yaml
scoring:
  - provider_id: braintrust-0
    provider_type: inline::braintrust
    config: {}
```

**Client**
```python
self.client = LlamaStackClient(
    base_url=os.environ.get("LLAMA_STACK_ENDPOINT", "http://localhost:5000"),
    provider_data={
        "openai_api_key": os.environ.get("OPENAI_API_KEY", ""),
    },
)
```
- run `llama-stack-client eval run_scoring`

**Unit Test**
```
pytest -v -s -m meta_reference_eval_together_inference eval/test_eval.py
```

```
pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py --env OPENAI_API_KEY=$OPENAI_API_KEY
```
<img width="745" alt="image"
src="https://github.com/user-attachments/assets/68f5cdda-f6c8-496d-8b4f-1b3dabeca9c2">

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-26 13:11:21 -08:00
Xi Yan
d3956a1d22 fix description 2024-11-25 22:02:45 -08:00
Xi Yan
2936133f95 precommit 2024-11-25 18:55:54 -08:00
Xi Yan
bbd81231ce add missing __init__ 2024-11-25 17:23:27 -08:00
Dinesh Yeduguru
de7af28756
Tgi fixture (#519)
# What does this PR do?

* Add a test fixture for tgi
* Fixes the logic to correctly pass the llama model for chat completion

Fixes #514

## Test Plan

pytest -k "tgi"
llama_stack/providers/tests/inference/test_text_inference.py --env
TGI_URL=http://localhost:$INFERENCE_PORT --env TGI_API_TOKEN=$HF_TOKEN
2024-11-25 13:17:02 -08:00
Xi Yan
60cb7f64af add missing __init__ 2024-11-25 09:42:46 -08:00
Ashwin Bharambe
34be07e0df Ensure model_local_dir does not mangle "C:\" on Windows 2024-11-24 14:18:59 -08:00
Ashwin Bharambe
9ddda91180 Add Safety section for Configuration 2024-11-23 21:36:41 -08:00
Matthew Farrellee
4e6c984c26
add NVIDIA NIM inference adapter (#355)
# What does this PR do?

this PR adds a basic inference adapter to NVIDIA NIMs

what it does -
 - chat completion api
   - tool calls
   - streaming
   - structured output
   - logprobs
 - support hosted NIM on integrate.api.nvidia.com
 - support downloaded NIM containers

what it does not do -
 - completion api
 - embedding api
 - vision models
 - builtin tools
 - have certainty that sampling strategies are correct

## Feature/Issue validation/testing/test plan

`pytest -s -v --providers inference=nvidia
llama_stack/providers/tests/inference/ --env NVIDIA_API_KEY=...`

all tests should pass. there are pydantic v1 warnings.


## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
- [x] Did you write any new necessary tests?

Thanks for contributing 🎉!
2024-11-23 15:59:00 -08:00
Ashwin Bharambe
2cfc41e13b Mark some pages as not-in-toctree explicitly 2024-11-23 15:27:44 -08:00
Ashwin Bharambe
358db3c5b6 No need to use os.path.relpath() when Path() knows everything anyway 2024-11-23 11:45:47 -08:00
Ashwin Bharambe
a23960663d Upgrade README a bit 2024-11-23 09:36:54 -08:00
Ashwin Bharambe
45fd73218a Bump version to 0.0.55 2024-11-23 09:03:58 -08:00
Ashwin Bharambe
359effd534 Update DirectClient docs for 0.0.55 2024-11-23 09:02:11 -08:00
Ashwin Bharambe
707da55c23 Fix TGI register_model() issue 2024-11-23 08:47:05 -08:00
Ashwin Bharambe
4b94cd313c Simplify Docs intro even further 2024-11-23 00:14:16 -08:00
Ashwin Bharambe
0758eb9cb1 Try new llama stack image 2024-11-23 00:03:42 -08:00
Ashwin Bharambe
03efc89267 Make a new llama stack image 2024-11-22 23:49:22 -08:00
Ashwin Bharambe
fc8ace50af Add stub for Building Applications 2024-11-22 23:05:17 -08:00
Ashwin Bharambe
c7bfac5382 Add a section for run.yamls 2024-11-22 22:58:39 -08:00
Ashwin Bharambe
1e6006c599 More simplification of the "Starting a Llama Stack" doc 2024-11-22 22:38:53 -08:00
Martin Hickey
76fc5d9f31
Update Ollama supported llama model list (#483)
# What does this PR do?

Update the llama model supported list for Ollama.

- [x] Addresses issue (#462)

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
2024-11-22 21:56:43 -08:00
Xi Yan
039e303707 docs fix 2024-11-22 21:15:21 -08:00
Xi Yan
988f424c9c
[docs] evals (#511)
# What does this PR do?

- add evals docs

## Test Plan



https://github.com/user-attachments/assets/7a1bcfcc-2c37-4cd2-9a72-bf43c2321022



## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-22 21:09:39 -08:00
Ashwin Bharambe
0481fa9540 Fix broken links with docs 2024-11-22 20:42:17 -08:00
Ashwin Bharambe
36938b716c broken reference link 2024-11-22 18:32:32 -08:00
Ashwin Bharambe
00c59b7e39 Add Python SDK reference 2024-11-22 18:27:30 -08:00
Dinesh Yeduguru
501e7c9d64
Fix opentelemetry adapter (#510)
# What does this PR do?

This PR fixes some of the issues with our telemetry setup to enable logs
to be delivered to opentelemetry and jaeger. Main fixes
1) Updates the open telemetry provider to use the latest oltp exports
instead of deprected ones.
2) Adds a tracing middleware, which injects traces into each HTTP
request that the server recieves and this is going to be the root trace.
Previously, we did this in the create_dynamic_route method, which is
actually not the actual exectuion flow, but more of a config and this
causes the traces to end prematurely. Through middleware, we plugin the
trace start and end at the right location.
3) We manage our own methods to create traces and spans and this does
not fit well with Opentelemetry SDK since it does not support provide a
way to take in traces and spans that are already created. it expects us
to use the SDK to create them. For now, I have a hacky approach of just
maintaining a map from our internal telemetry objects to the open
telemetry specfic ones. This is not the ideal solution. I will explore
other ways to get around this issue. for now, to have something that
works, i am going to keep this as is.

Addresses: #509
2024-11-22 18:18:11 -08:00
dltn
beab798a1d Add initial direct client docs 2024-11-22 18:04:48 -08:00
Ashwin Bharambe
31e983ab68 Simplify feature request ISSUE template 2024-11-22 18:02:39 -08:00
Xi Yan
d97cfaa9d9
[docs] add openapi spec to docs (#508)
# What does this PR do?
- modify openapi generator to add coming soon tag for unimplemented api
- sphinx-redocs extension for openapi spec to readthedocs page

## Test Plan



https://github.com/user-attachments/assets/b4c7eebc-2361-4198-a987-dbfbcff914cf






## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-22 17:54:32 -08:00
Ashwin Bharambe
1b2b32f959 Minor updates to docs 2024-11-22 17:44:05 -08:00
Ashwin Bharambe
6229562760 Organize references 2024-11-22 16:46:45 -08:00
Ashwin Bharambe
6fbf526d5c Move gitignore from docs/ to the main gitignore 2024-11-22 15:55:34 -08:00
Ashwin Bharambe
526a8dcfe0 Minor edit to zero_to_hero_guide 2024-11-22 15:52:56 -08:00