Ashwin Bharambe
ed899a5dec
Convert TGI to work with openai_compat
2024-10-08 17:23:42 -07:00
Ashwin Bharambe
05e73d12b3
introduce openai_compat with the completions (not chat-completions) API
...
This keeps the prompt encoding layer in our control (see
`chat_completion_request_to_prompt()` method)
2024-10-08 17:23:42 -07:00
Ashwin Bharambe
0c9eb3341c
Separate chat_completion stream and non-stream implementations
...
This is a pretty important requirement. The streaming response type is
an AsyncGenerator while the non-stream one is a single object. So far
this has worked _sometimes_ due to various pre-existing hacks (and in
some cases, just failed.)
2024-10-08 17:23:40 -07:00
Ashwin Bharambe
f8752ab8dc
weaviate fixes, test now passes
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
f21ad1173e
improve memory test, but it fails on chromadb :/
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
4ab6e1b81a
Add really basic testing for memory API
...
weaviate does not work; the cluster URL seems malformed
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
dba7caf1d0
Fix fireworks and update the test
...
Don't look for eom_id / eot_id sadly since providers don't return the
last token
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
bbd3a02615
Make Together inference work using the raw completions API
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
3ae2b712e8
Add inference test
...
Run it as:
```
PROVIDER_ID=test-remote \
PROVIDER_CONFIG=$PWD/llama_stack/providers/tests/inference/provider_config_example.yaml \
pytest -s llama_stack/providers/tests/inference/test_inference.py \
--tb=auto \
--disable-warnings
```
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
4fa467731e
Fix a bug in meta-reference inference when stream=False
...
Also introduce a gross hack (to cover grosser(?) hack) to ensure
non-stream requests don't send back responses in SSE format. Not sure
which of these hacks is grosser.
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
353c7dc82a
A few bug fixes for covering corner cases
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
a05599c67a
Weaviate "should" work (i.e., is code-complete) but not tested
2024-10-08 17:23:02 -07:00
Zain Hasan
118c0ef105
Partial cleanup of weaviate
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
862f8ddb8d
more memory related fixes; memory.client now works
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
3725e74906
memory bank registration fixes
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
099a95b614
slight upgrade to CLI
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
1550187cd8
cleanup
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
91e0063593
Introduce model_store, shield_store, memory_bank_store
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
e45a417543
more fixes, plug shutdown handlers
...
still, FastAPIs sigint handler is not calling ours
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
60dead6196
apis_to_serve -> apis
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
59302a86df
inference registry updates
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
4215cc9331
Push registration methods onto the backing providers
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
5a7b01d292
Significantly upgrade the interactive configuration experience
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
8d157a8197
rename
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
f3923e3f0b
Redo the { models, shields, memory_banks } typeset
2024-10-08 17:23:02 -07:00
Xi Yan
4d5f7459aa
[bugfix] Fix logprobs on meta-reference impl ( #213 )
...
* fix log probs
* add back LogProbsConfig
* error handling
* bugfix
2024-10-07 19:42:39 -07:00
Mindaugas
53d440e952
Fix ValueError in case chunks are empty ( #206 )
2024-10-07 08:55:06 -07:00
Russell Bryant
a4e775c465
download: improve help text ( #204 )
2024-10-07 08:40:04 -07:00
Ashwin Bharambe
4263764493
Fix adapter_id -> adapter_type for Weaviate
2024-10-07 06:46:32 -07:00
Zain Hasan
f4f7618120
add Weaviate memory adapter ( #95 )
2024-10-06 22:21:50 -07:00
Xi Yan
27587f32bc
fix db path
2024-10-06 11:46:08 -07:00
Xi Yan
cfe3ad33b3
fix db path
2024-10-06 11:45:35 -07:00
Prithu Dasgupta
7abab7604b
add databricks provider ( #83 )
...
* add databricks provider
* update provider and test
2024-10-05 23:35:54 -07:00
Russell Bryant
f73e247ba1
Inline vLLM inference provider ( #181 )
...
This is just like `local` using `meta-reference` for everything except
it uses `vllm` for inference.
Docker works, but So far, `conda` is a bit easier to use with the vllm
provider. The default container base image does not include all the
necessary libraries for all vllm features. More cuda dependencies are
necessary.
I started changing this base image used in this template, but it also
required changes to the Dockerfile, so it was getting too involved to
include in the first PR.
Working so far:
* `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream True`
* `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False`
Example:
```
$ python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False
User>hello world, write me a 2 sentence poem about the moon
Assistant>
The moon glows bright in the midnight sky
A beacon of light,
```
I have only tested these models:
* `Llama3.1-8B-Instruct` - across 4 GPUs (tensor_parallel_size = 4)
* `Llama3.2-1B-Instruct` - on a single GPU (tensor_parallel_size = 1)
2024-10-05 23:34:16 -07:00
Mindaugas
9d16129603
Add 'url' property to Redis KV config ( #192 )
2024-10-05 11:26:26 -07:00
Dalton Flanagan
441052b0fd
avoid jq since non-standard on macOS
2024-10-04 10:11:43 -04:00
AshleyT3
734f59d3b8
Check that the model is found before use. ( #182 )
2024-10-03 23:24:47 -07:00
Ashwin Bharambe
f913b57397
fix fp8 imports
2024-10-03 14:40:21 -07:00
Ashwin Bharambe
7f49315822
Kill a derpy import
2024-10-03 11:25:58 -07:00
Xi Yan
62d266f018
[CLI] avoid configure twice ( #171 )
...
* avoid configure twice
* cleanup tmp config
* update output msg
* address comment
* update msg
* script update
2024-10-03 11:20:54 -07:00
Russell Bryant
06db9213b1
inference: Add model option to client ( #170 )
...
I was running this client for testing purposes and being able to
specify which model to use is a convenient addition. This change makes
that possible.
2024-10-03 11:18:57 -07:00
Ashwin Bharambe
210b71b0ba
fix prompt guard ( #177 )
...
Several other fixes to configure. Add support for 1b/3b models in ollama.
2024-10-03 11:07:53 -07:00
Xi Yan
b9b1e8b08b
[bugfix] conda path lookup ( #179 )
...
* fix conda lookup
* comments
2024-10-03 10:45:16 -07:00
Ashwin Bharambe
e9f6150588
A bit cleanup to avoid breakages
2024-10-02 21:31:09 -07:00
Ashwin Bharambe
988a9cada3
Don't ask for Api.inspect in stack build
2024-10-02 21:10:56 -07:00
Ashwin Bharambe
19ce6bf009
Don't validate prompt-guard anymore
2024-10-02 20:43:57 -07:00
Xi Yan
703ab9385f
fix routing table key list
2024-10-02 18:23:31 -07:00
Ashwin Bharambe
8d049000e3
Add an introspection "Api.inspect" API
2024-10-02 15:41:14 -07:00
Adrian Cole
01d93be948
Adds markdown-link-check and fixes a broken link ( #165 )
...
Signed-off-by: Adrian Cole <adrian.cole@elastic.co>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2024-10-02 14:26:20 -07:00
Ashwin Bharambe
fe4aabd690
provider_id => provider_type, adapter_id => adapter_type
2024-10-02 14:05:59 -07:00