forked from phoenix-oss/llama-stack-mirror
Cerebras Inference Integration (#265)
Adding Cerebras Inference as an API provider. ## Testing ### Conda ``` $ llama stack build --template cerebras --image-type conda $ llama stack run ~/.llama/distributions/llamastack-cerebras/cerebras-run.yaml ... Listening on ['::', '0.0.0.0']:5000 INFO: Started server process [12443] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit) ``` ### Chat Completion ``` $ curl --location 'http://localhost:5000/alpha/inference/chat-completion' --header 'Content-Type: application/json' --data '{ "model_id": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "What is the temperature in Seattle right now?" } ], "stream": false, "sampling_params": { "strategy": "top_p", "temperature": 0.5, "max_tokens": 100 }, "tool_choice": "auto", "tool_prompt_format": "json", "tools": [ { "tool_name": "getTemperature", "description": "Gets the current temperature of a location.", "parameters": { "location": { "param_type": "string", "description": "The name of the place to get the temperature from in degress celsius.", "required": true } } } ] }' ``` #### Non-Streaming Response ``` { "completion_message": { "role": "assistant", "content": "", "stop_reason": "end_of_message", "tool_calls": [ { "call_id": "6f42fdcc-6cbb-46ad-a17b-5d20ac64b678", "tool_name": "getTemperature", "arguments": { "location": "Seattle" } } ] }, "logprobs": null } ``` #### Streaming Response ``` data: {"event":{"event_type":"start","delta":"","logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"","parse_status":"started"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"{\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"type","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"function","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"name","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"get","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"Temperature","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"parameters","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" {\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"location","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"Seattle","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\"}}","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":{"call_id":"e742df1f-0ae9-40ad-a49e-18e5c905484f","tool_name":"getTemperature","arguments":{"location":"Seattle"}},"parse_status":"success"},"logprobs":null,"stop_reason":"end_of_message"}} data: {"event":{"event_type":"complete","delta":"","logprobs":null,"stop_reason":"end_of_message"}} ``` ### Completion ``` $ curl --location 'http://localhost:5000/alpha/inference/completion' --header 'Content-Type: application/json' --data '{ "model_id": "meta-llama/Llama-3.1-8B-Instruct", "content": "1,2,3,", "stream": true, "sampling_params": { "strategy": "top_p", "temperature": 0.5, "max_tokens": 10 }, "tool_choice": "auto", "tool_prompt_format": "json", "tools": [ { "tool_name": "getTemperature", "description": "Gets the current temperature of a location.", "parameters": { "location": { "param_type": "string", "description": "The name of the place to get the temperature from in degress celsius.", "required": true } } } ] }' ``` #### Non-Streaming Response ``` { "content": "4,5,6,7,8,", "stop_reason": "out_of_tokens", "logprobs": null } ``` #### Streaming Response ``` data: {"delta":"4","stop_reason":null,"logprobs":null} data: {"delta":",","stop_reason":null,"logprobs":null} data: {"delta":"5","stop_reason":null,"logprobs":null} data: {"delta":",","stop_reason":null,"logprobs":null} data: {"delta":"6","stop_reason":null,"logprobs":null} data: {"delta":",","stop_reason":null,"logprobs":null} data: {"delta":"7","stop_reason":null,"logprobs":null} data: {"delta":",","stop_reason":null,"logprobs":null} data: {"delta":"8","stop_reason":null,"logprobs":null} data: {"delta":",","stop_reason":null,"logprobs":null} data: {"delta":"","stop_reason":null,"logprobs":null} data: {"delta":"","stop_reason":"out_of_tokens","logprobs":null} ``` ### Pre-Commit Checks ``` trim trailing whitespace.................................................Passed check python ast.........................................................Passed check for merge conflicts................................................Passed check for added large files..............................................Passed fix end of files.........................................................Passed Insert license in comments...............................................Passed flake8...................................................................Passed Format files with µfmt...................................................Passed ``` ### Testing with `test_inference.py` ``` $ export CEREBRAS_API_KEY=<insert API key here> $ pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -m "cerebras and llama_8b" /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset. The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session" warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET)) =================================================== test session starts =================================================== platform linux -- Python 3.12.3, pytest-8.3.3, pluggy-1.5.0 -- /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/bin/python3.12 cachedir: .pytest_cache rootdir: /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack configfile: pyproject.toml plugins: anyio-4.6.2.post1, asyncio-0.24.0 asyncio: mode=Mode.STRICT, default_loop_scope=None collected 128 items / 120 deselected / 8 selected llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_8b-cerebras] Resolved 4 providers inner-inference => cerebras models => __routing_table__ inference => __autorouted__ inspect => __builtin__ Models: meta-llama/Llama-3.1-8B-Instruct served by cerebras PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_8b-cerebras] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_8b-cerebras] SKIPPED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_8b-cerebras] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-cerebras] SKIPPED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_8b-cerebras] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_8b-cerebras] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_8b-cerebras] PASSED ================================ 6 passed, 2 skipped, 120 deselected, 6 warnings in 3.95s ================================= ``` I ran `python llama_stack/scripts/distro_codegen.py` to run codegen.
This commit is contained in:
parent
b6500974ec
commit
64c6df8392
19 changed files with 1018 additions and 292 deletions
|
@ -66,121 +66,247 @@ llama stack build --list-templates
|
|||
```
|
||||
|
||||
```
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| Template Name | Providers | Description |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| hf-serverless | { | Like local, but use Hugging Face Inference API (serverless) for running LLM |
|
||||
| | "inference": "remote::hf::serverless", | inference. |
|
||||
| | "memory": "meta-reference", | See https://hf.co/docs/api-inference. |
|
||||
| | "safety": "meta-reference", | |
|
||||
| | "agents": "meta-reference", | |
|
||||
| | "telemetry": "meta-reference" | |
|
||||
| | } | |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| together | { | Use Together.ai for running LLM inference |
|
||||
| | "inference": "remote::together", | |
|
||||
| | "memory": [ | |
|
||||
| | "meta-reference", | |
|
||||
| | "remote::weaviate" | |
|
||||
| | ], | |
|
||||
| | "safety": "meta-reference", | |
|
||||
| | "agents": "meta-reference", | |
|
||||
| | "telemetry": "meta-reference" | |
|
||||
| | } | |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| fireworks | { | Use Fireworks.ai for running LLM inference |
|
||||
| | "inference": "remote::fireworks", | |
|
||||
| | "memory": [ | |
|
||||
| | "meta-reference", | |
|
||||
| | "remote::weaviate", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": "meta-reference", | |
|
||||
| | "agents": "meta-reference", | |
|
||||
| | "telemetry": "meta-reference" | |
|
||||
| | } | |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| databricks | { | Use Databricks for running LLM inference |
|
||||
| | "inference": "remote::databricks", | |
|
||||
| | "memory": "meta-reference", | |
|
||||
| | "safety": "meta-reference", | |
|
||||
| | "agents": "meta-reference", | |
|
||||
| | "telemetry": "meta-reference" | |
|
||||
| | } | |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| vllm | { | Like local, but use vLLM for running LLM inference |
|
||||
| | "inference": "vllm", | |
|
||||
| | "memory": "meta-reference", | |
|
||||
| | "safety": "meta-reference", | |
|
||||
| | "agents": "meta-reference", | |
|
||||
| | "telemetry": "meta-reference" | |
|
||||
| | } | |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| tgi | { | Use TGI for running LLM inference |
|
||||
| | "inference": "remote::tgi", | |
|
||||
| | "memory": [ | |
|
||||
| | "meta-reference", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": "meta-reference", | |
|
||||
| | "agents": "meta-reference", | |
|
||||
| | "telemetry": "meta-reference" | |
|
||||
| | } | |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| bedrock | { | Use Amazon Bedrock APIs. |
|
||||
| | "inference": "remote::bedrock", | |
|
||||
| | "memory": "meta-reference", | |
|
||||
| | "safety": "meta-reference", | |
|
||||
| | "agents": "meta-reference", | |
|
||||
| | "telemetry": "meta-reference" | |
|
||||
| | } | |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| meta-reference-gpu | { | Use code from `llama_stack` itself to serve all llama stack APIs |
|
||||
| | "inference": "meta-reference", | |
|
||||
| | "memory": [ | |
|
||||
| | "meta-reference", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": "meta-reference", | |
|
||||
| | "agents": "meta-reference", | |
|
||||
| | "telemetry": "meta-reference" | |
|
||||
| | } | |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| meta-reference-quantized-gpu | { | Use code from `llama_stack` itself to serve all llama stack APIs |
|
||||
| | "inference": "meta-reference-quantized", | |
|
||||
| | "memory": [ | |
|
||||
| | "meta-reference", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": "meta-reference", | |
|
||||
| | "agents": "meta-reference", | |
|
||||
| | "telemetry": "meta-reference" | |
|
||||
| | } | |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| ollama | { | Use ollama for running LLM inference |
|
||||
| | "inference": "remote::ollama", | |
|
||||
| | "memory": [ | |
|
||||
| | "meta-reference", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": "meta-reference", | |
|
||||
| | "agents": "meta-reference", | |
|
||||
| | "telemetry": "meta-reference" | |
|
||||
| | } | |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
| hf-endpoint | { | Like local, but use Hugging Face Inference Endpoints for running LLM inference. |
|
||||
| | "inference": "remote::hf::endpoint", | See https://hf.co/docs/api-endpoints. |
|
||||
| | "memory": "meta-reference", | |
|
||||
| | "safety": "meta-reference", | |
|
||||
| | "agents": "meta-reference", | |
|
||||
| | "telemetry": "meta-reference" | |
|
||||
| | } | |
|
||||
+------------------------------+--------------------------------------------+----------------------------------------------------------------------------------+
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| Template Name | Providers | Description |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| tgi | { | Use (an external) TGI server for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::tgi" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| remote-vllm | { | Use (an external) vLLM server for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::vllm" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| vllm-gpu | { | Use a built-in vLLM engine for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "inline::vllm" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| meta-reference-quantized-gpu | { | Use Meta Reference with fp8, int4 quantization for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "inline::meta-reference-quantized" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| meta-reference-gpu | { | Use Meta Reference for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-serverless | { | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::hf::serverless" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| together | { | Use Together.AI for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::together" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| ollama | { | Use (an external) Ollama server for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::ollama" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| bedrock | { | Use AWS Bedrock for running LLM inference and safety |
|
||||
| | "inference": [ | |
|
||||
| | "remote::bedrock" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "remote::bedrock" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-endpoint | { | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::hf::endpoint" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| fireworks | { | Use Fireworks.AI for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::fireworks" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| cerebras | { | Use Cerebras for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::cerebras" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
You may then pick a template to build your distribution with providers fitted to your liking.
|
||||
|
|
61
docs/source/distributions/self_hosted_distro/cerebras.md
Normal file
61
docs/source/distributions/self_hosted_distro/cerebras.md
Normal file
|
@ -0,0 +1,61 @@
|
|||
# Cerebras Distribution
|
||||
|
||||
The `llamastack/distribution-cerebras` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| inference | `remote::cerebras` |
|
||||
| memory | `inline::meta-reference` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `CEREBRAS_API_KEY`: Cerebras API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
||||
The following models are available by default:
|
||||
|
||||
- `meta-llama/Llama-3.1-8B-Instruct (llama3.1-8b)`
|
||||
- `meta-llama/Llama-3.1-70B-Instruct (llama3.1-70b)`
|
||||
|
||||
|
||||
### Prerequisite: API Keys
|
||||
|
||||
Make sure you have access to a Cerebras API Key. You can get one by visiting [cloud.cerebras.ai](https://cloud.cerebras.ai/).
|
||||
|
||||
|
||||
## Running Llama Stack with Cerebras
|
||||
|
||||
You can do this via Conda (build code) or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
docker run \
|
||||
-it \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ./run.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-cerebras \
|
||||
--yaml-config /root/my-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env CEREBRAS_API_KEY=$CEREBRAS_API_KEY
|
||||
```
|
||||
|
||||
### Via Conda
|
||||
|
||||
```bash
|
||||
llama stack build --template cerebras --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
--port 5001 \
|
||||
--env CEREBRAS_API_KEY=$CEREBRAS_API_KEY
|
||||
```
|
|
@ -45,6 +45,7 @@ Llama Stack already has a number of "adapters" available for some popular Infere
|
|||
| **API Provider** | **Environments** | **Agents** | **Inference** | **Memory** | **Safety** | **Telemetry** |
|
||||
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |
|
||||
| Meta Reference | Single Node | Y | Y | Y | Y | Y |
|
||||
| Cerebras | Single Node | | Y | | | |
|
||||
| Fireworks | Hosted | Y | Y | Y | | |
|
||||
| AWS Bedrock | Hosted | | Y | | Y | |
|
||||
| Together | Hosted | Y | Y | | Y | |
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue