llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-29 15:23:51 +00:00

Author	SHA1	Message	Date
Ashwin Bharambe	216e7eb4d5	Move `async with SEMAPHORE` inside the async methods	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	4540d8bd87	move codeshield into an independent safety provider	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	380b9dab90	regen openapi specs	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	7f1160296c	Updates to server.py to clean up streaming vs non-streaming stuff Also make sure agent turn create is correctly marked	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	640c5c54f7	rename augment_messages	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	336cf7a674	update vllm; not quite tested yet	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	ed899a5dec	Convert TGI to work with openai_compat	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	05e73d12b3	introduce openai_compat with the completions (not chat-completions) API This keeps the prompt encoding layer in our control (see `chat_completion_request_to_prompt()` method)	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	0c9eb3341c	Separate chat_completion stream and non-stream implementations This is a pretty important requirement. The streaming response type is an AsyncGenerator while the non-stream one is a single object. So far this has worked _sometimes_ due to various pre-existing hacks (and in some cases, just failed.)	2024-10-08 17:23:40 -07:00
Ashwin Bharambe	f8752ab8dc	weaviate fixes, test now passes	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	f21ad1173e	improve memory test, but it fails on chromadb :/	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	4ab6e1b81a	Add really basic testing for memory API weaviate does not work; the cluster URL seems malformed	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	dba7caf1d0	Fix fireworks and update the test Don't look for eom_id / eot_id sadly since providers don't return the last token	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	bbd3a02615	Make Together inference work using the raw completions API	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	3ae2b712e8	Add inference test Run it as: ``` PROVIDER_ID=test-remote \ PROVIDER_CONFIG=$PWD/llama_stack/providers/tests/inference/provider_config_example.yaml \ pytest -s llama_stack/providers/tests/inference/test_inference.py \ --tb=auto \ --disable-warnings ```	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	4fa467731e	Fix a bug in meta-reference inference when stream=False Also introduce a gross hack (to cover grosser(?) hack) to ensure non-stream requests don't send back responses in SSE format. Not sure which of these hacks is grosser.	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	353c7dc82a	A few bug fixes for covering corner cases	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	a05599c67a	Weaviate "should" work (i.e., is code-complete) but not tested	2024-10-08 17:23:02 -07:00
Zain Hasan	118c0ef105	Partial cleanup of weaviate	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	862f8ddb8d	more memory related fixes; memory.client now works	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	3725e74906	memory bank registration fixes	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	099a95b614	slight upgrade to CLI	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	1550187cd8	cleanup	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	91e0063593	Introduce model_store, shield_store, memory_bank_store	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	e45a417543	more fixes, plug shutdown handlers still, FastAPIs sigint handler is not calling ours	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	60dead6196	apis_to_serve -> apis	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	59302a86df	inference registry updates	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	4215cc9331	Push registration methods onto the backing providers	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	5a7b01d292	Significantly upgrade the interactive configuration experience	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	8d157a8197	rename	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	f3923e3f0b	Redo the { models, shields, memory_banks } typeset	2024-10-08 17:23:02 -07:00
Xi Yan	6b094b72d3	Update cli_reference.md	2024-10-08 15:32:06 -07:00
Xi Yan	ce70d21f65	Add files via upload	2024-10-08 15:29:19 -07:00
Dalton Flanagan	2d4f7d8acf	Create SECURITY.md	2024-10-08 13:30:40 -04:00
Yuan Tang	48d0d2001e	Add classifiers in setup.py (#217 ) * Add classifiers in setup.py * Update setup.py * Update setup.py	2024-10-08 06:55:16 -07:00
Xi Yan	4d5f7459aa	[bugfix] Fix logprobs on meta-reference impl (#213 ) * fix log probs * add back LogProbsConfig * error handling * bugfix	2024-10-07 19:42:39 -07:00
Yuan Tang	e4ae09d090	Add .idea to .gitignore (#216 ) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2024-10-07 19:38:43 -07:00
Xi Yan	16ba0fa06f	Update README.md	2024-10-07 11:24:27 -07:00
Russell Bryant	996efa9b42	README.md: Add vLLM to providers table (#207 ) Signed-off-by: Russell Bryant <russell.bryant@gmail.com>	2024-10-07 10:26:52 -07:00
Xi Yan	2366e18873	refactor docs (#209 )	2024-10-07 10:21:26 -07:00
Mindaugas	53d440e952	Fix ValueError in case chunks are empty (#206 )	2024-10-07 08:55:06 -07:00
Russell Bryant	a4e775c465	download: improve help text (#204 )	2024-10-07 08:40:04 -07:00
Ashwin Bharambe	4263764493	Fix adapter_id -> adapter_type for Weaviate	2024-10-07 06:46:32 -07:00
Zain Hasan	f4f7618120	add Weaviate memory adapter (#95 )	2024-10-06 22:21:50 -07:00
Xi Yan	27587f32bc	fix db path	2024-10-06 11:46:08 -07:00
Xi Yan	cfe3ad33b3	fix db path	2024-10-06 11:45:35 -07:00
Prithu Dasgupta	7abab7604b	add databricks provider (#83 ) * add databricks provider * update provider and test	2024-10-05 23:35:54 -07:00
Russell Bryant	f73e247ba1	Inline vLLM inference provider (#181 ) This is just like `local` using `meta-reference` for everything except it uses `vllm` for inference. Docker works, but So far, `conda` is a bit easier to use with the vllm provider. The default container base image does not include all the necessary libraries for all vllm features. More cuda dependencies are necessary. I started changing this base image used in this template, but it also required changes to the Dockerfile, so it was getting too involved to include in the first PR. Working so far: * `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream True` * `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False` Example: ``` $ python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False User>hello world, write me a 2 sentence poem about the moon Assistant> The moon glows bright in the midnight sky A beacon of light, ``` I have only tested these models: * `Llama3.1-8B-Instruct` - across 4 GPUs (tensor_parallel_size = 4) * `Llama3.2-1B-Instruct` - on a single GPU (tensor_parallel_size = 1)	2024-10-05 23:34:16 -07:00
Xi Yan	29138a5167	Update getting_started.md	2024-10-05 12:28:02 -07:00
Xi Yan	6d4013ac99	Update getting_started.md	2024-10-05 12:14:59 -07:00

1 2 3 4 5 ...

311 commits