Ashwin Bharambe
216e7eb4d5
Move async with SEMAPHORE
inside the async methods
2024-10-08 17:23:42 -07:00
Ashwin Bharambe
4540d8bd87
move codeshield into an independent safety provider
2024-10-08 17:23:42 -07:00
Ashwin Bharambe
380b9dab90
regen openapi specs
2024-10-08 17:23:42 -07:00
Ashwin Bharambe
7f1160296c
Updates to server.py to clean up streaming vs non-streaming stuff
...
Also make sure agent turn create is correctly marked
2024-10-08 17:23:42 -07:00
Ashwin Bharambe
640c5c54f7
rename augment_messages
2024-10-08 17:23:42 -07:00
Ashwin Bharambe
336cf7a674
update vllm; not quite tested yet
2024-10-08 17:23:42 -07:00
Ashwin Bharambe
ed899a5dec
Convert TGI to work with openai_compat
2024-10-08 17:23:42 -07:00
Ashwin Bharambe
05e73d12b3
introduce openai_compat with the completions (not chat-completions) API
...
This keeps the prompt encoding layer in our control (see
`chat_completion_request_to_prompt()` method)
2024-10-08 17:23:42 -07:00
Ashwin Bharambe
0c9eb3341c
Separate chat_completion stream and non-stream implementations
...
This is a pretty important requirement. The streaming response type is
an AsyncGenerator while the non-stream one is a single object. So far
this has worked _sometimes_ due to various pre-existing hacks (and in
some cases, just failed.)
2024-10-08 17:23:40 -07:00
Ashwin Bharambe
f8752ab8dc
weaviate fixes, test now passes
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
f21ad1173e
improve memory test, but it fails on chromadb :/
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
4ab6e1b81a
Add really basic testing for memory API
...
weaviate does not work; the cluster URL seems malformed
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
dba7caf1d0
Fix fireworks and update the test
...
Don't look for eom_id / eot_id sadly since providers don't return the
last token
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
bbd3a02615
Make Together inference work using the raw completions API
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
3ae2b712e8
Add inference test
...
Run it as:
```
PROVIDER_ID=test-remote \
PROVIDER_CONFIG=$PWD/llama_stack/providers/tests/inference/provider_config_example.yaml \
pytest -s llama_stack/providers/tests/inference/test_inference.py \
--tb=auto \
--disable-warnings
```
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
4fa467731e
Fix a bug in meta-reference inference when stream=False
...
Also introduce a gross hack (to cover grosser(?) hack) to ensure
non-stream requests don't send back responses in SSE format. Not sure
which of these hacks is grosser.
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
353c7dc82a
A few bug fixes for covering corner cases
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
a05599c67a
Weaviate "should" work (i.e., is code-complete) but not tested
2024-10-08 17:23:02 -07:00
Zain Hasan
118c0ef105
Partial cleanup of weaviate
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
862f8ddb8d
more memory related fixes; memory.client now works
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
3725e74906
memory bank registration fixes
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
099a95b614
slight upgrade to CLI
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
1550187cd8
cleanup
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
91e0063593
Introduce model_store, shield_store, memory_bank_store
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
e45a417543
more fixes, plug shutdown handlers
...
still, FastAPIs sigint handler is not calling ours
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
60dead6196
apis_to_serve -> apis
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
59302a86df
inference registry updates
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
4215cc9331
Push registration methods onto the backing providers
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
5a7b01d292
Significantly upgrade the interactive configuration experience
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
8d157a8197
rename
2024-10-08 17:23:02 -07:00
Ashwin Bharambe
f3923e3f0b
Redo the { models, shields, memory_banks } typeset
2024-10-08 17:23:02 -07:00
Xi Yan
6b094b72d3
Update cli_reference.md
2024-10-08 15:32:06 -07:00
Xi Yan
ce70d21f65
Add files via upload
2024-10-08 15:29:19 -07:00
Dalton Flanagan
2d4f7d8acf
Create SECURITY.md
2024-10-08 13:30:40 -04:00
Yuan Tang
48d0d2001e
Add classifiers in setup.py ( #217 )
...
* Add classifiers in setup.py
* Update setup.py
* Update setup.py
2024-10-08 06:55:16 -07:00
Xi Yan
4d5f7459aa
[bugfix] Fix logprobs on meta-reference impl ( #213 )
...
* fix log probs
* add back LogProbsConfig
* error handling
* bugfix
2024-10-07 19:42:39 -07:00
Yuan Tang
e4ae09d090
Add .idea to .gitignore ( #216 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2024-10-07 19:38:43 -07:00
Xi Yan
16ba0fa06f
Update README.md
2024-10-07 11:24:27 -07:00
Russell Bryant
996efa9b42
README.md: Add vLLM to providers table ( #207 )
...
Signed-off-by: Russell Bryant <russell.bryant@gmail.com>
2024-10-07 10:26:52 -07:00
Xi Yan
2366e18873
refactor docs ( #209 )
2024-10-07 10:21:26 -07:00
Mindaugas
53d440e952
Fix ValueError in case chunks are empty ( #206 )
2024-10-07 08:55:06 -07:00
Russell Bryant
a4e775c465
download: improve help text ( #204 )
2024-10-07 08:40:04 -07:00
Ashwin Bharambe
4263764493
Fix adapter_id -> adapter_type for Weaviate
2024-10-07 06:46:32 -07:00
Zain Hasan
f4f7618120
add Weaviate memory adapter ( #95 )
2024-10-06 22:21:50 -07:00
Xi Yan
27587f32bc
fix db path
2024-10-06 11:46:08 -07:00
Xi Yan
cfe3ad33b3
fix db path
2024-10-06 11:45:35 -07:00
Prithu Dasgupta
7abab7604b
add databricks provider ( #83 )
...
* add databricks provider
* update provider and test
2024-10-05 23:35:54 -07:00
Russell Bryant
f73e247ba1
Inline vLLM inference provider ( #181 )
...
This is just like `local` using `meta-reference` for everything except
it uses `vllm` for inference.
Docker works, but So far, `conda` is a bit easier to use with the vllm
provider. The default container base image does not include all the
necessary libraries for all vllm features. More cuda dependencies are
necessary.
I started changing this base image used in this template, but it also
required changes to the Dockerfile, so it was getting too involved to
include in the first PR.
Working so far:
* `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream True`
* `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False`
Example:
```
$ python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False
User>hello world, write me a 2 sentence poem about the moon
Assistant>
The moon glows bright in the midnight sky
A beacon of light,
```
I have only tested these models:
* `Llama3.1-8B-Instruct` - across 4 GPUs (tensor_parallel_size = 4)
* `Llama3.2-1B-Instruct` - on a single GPU (tensor_parallel_size = 1)
2024-10-05 23:34:16 -07:00
Xi Yan
29138a5167
Update getting_started.md
2024-10-05 12:28:02 -07:00
Xi Yan
6d4013ac99
Update getting_started.md
2024-10-05 12:14:59 -07:00