forked from phoenix/litellm-mirror
docs add arch diagram
This commit is contained in:
parent
54db564529
commit
c2c63e4dbe
1 changed files with 17 additions and 11 deletions
|
@ -13,19 +13,25 @@ import TabItem from '@theme/TabItem';
|
||||||
|
|
||||||
1. **User Sends Request**: The process begins when a user sends a request to the LiteLLM Proxy Server (Gateway).
|
1. **User Sends Request**: The process begins when a user sends a request to the LiteLLM Proxy Server (Gateway).
|
||||||
|
|
||||||
2. [**Virtual Keys**](../virtual_keys): The request first passes through the Virtual Keys component
|
2. [**Virtual Keys**](../virtual_keys): At this stage the `Bearer` token in the request is checked to ensure it is valid and under it's budget
|
||||||
|
|
||||||
3. **Rate Limiting**: The MaxParallelRequestsHandler applies rate limiting to manage the flow of requests.
|
3. **Rate Limiting**: The [MaxParallelRequestsHandler](https://github.com/BerriAI/litellm/blob/main/litellm/proxy/hooks/parallel_request_limiter.py) checks the **rate limit (rpm/tpm)** for the the following components:
|
||||||
|
- Global Server Rate Limit
|
||||||
|
- Virtual Key Rate Limit
|
||||||
|
- User Rate Limit
|
||||||
|
- Team Limit
|
||||||
|
|
||||||
4. **Proxy Server Processing**: The request is then processed by the LiteLLM proxy_server.py, which handles the core logic of the proxy.
|
4. **LiteLLM `proxy_server.py`**: Contains the `/chat/completions` and `/embeddings` endpoints. Requests to these endpoints are sent through the LiteLLM Router
|
||||||
|
|
||||||
5. [**LiteLLM Router**](../routing): LiteLLM Router**: The LiteLLM Router determines where to send the request based on the configuration and request parameters.
|
5. [**LiteLLM Router**](../routing): The LiteLLM Router handles Load balancing, Fallbacks, Retries for LLM API deployments.
|
||||||
|
|
||||||
6. **Model Interaction**: The request is sent to the appropriate model API (litellm.completion() or litellm.embedding()) for processing.
|
6. [**litellm.completion() / litellm.embedding()**:](../index#litellm-python-sdk) The litellm Python SDK is used to call the LLM in the OpenAI API format (Translation and parameter mapping)
|
||||||
|
|
||||||
7. **Response**: The model's response is sent back through the same components to the user.
|
7. **Post-Request Processing**: After the response is sent back to the client, the following **asynchronous** tasks are performed:
|
||||||
|
- [Logging to LangFuse (logging destination is configurable)](./logging)
|
||||||
8. **Post-Request Processing**: After the response is sent, several asynchronous operations occur:
|
- The [MaxParallelRequestsHandler](https://github.com/BerriAI/litellm/blob/main/litellm/proxy/hooks/parallel_request_limiter.py) updates the rpm/tpm usage for the
|
||||||
- The _PROXY_track_cost_callback updates spend in the database.
|
- Global Server Rate Limit
|
||||||
- Logging to LangFuse for analytics and monitoring.
|
- Virtual Key Rate Limit
|
||||||
- The MaxParallelRequestsHandler updates virtual key usage and performs post-request cleanup.
|
- User Rate Limit
|
||||||
|
- Team Limit
|
||||||
|
- The `_PROXY_track_cost_callback` updates spend / usage in the LiteLLM database.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue