Merge pull request #5411 from gary149/huggingface-update

Update Hugging Face Doc
2024-09-06 16:18:31 -07:00 · 2024-09-06 16:18:31 -07:00 · bf1325e898
commit bf1325e898
parent 4626c5a365 f65ceef46a
2 changed files with 391 additions and 900 deletions
--- a/cookbook/LiteLLM_HuggingFace.ipynb
+++ b/cookbook/LiteLLM_HuggingFace.ipynb
@ -1,28 +1,14 @@
 {
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "provenance": []
-    },
-    "kernelspec": {
-      "name": "python3",
-      "display_name": "Python 3"
-    },
-    "language_info": {
-      "name": "python"
-    }
-  },
  "cells": [
    {
      "cell_type": "markdown",
+      "metadata": {
+        "id": "9dKM5k8qsMIj"
+      },
      "source": [
        "## LiteLLM HuggingFace\n",
        "Docs for huggingface: https://docs.litellm.ai/docs/providers/huggingface"
-      ],
-      "metadata": {
-        "id": "9dKM5k8qsMIj"
-      }
+      ]
    },
    {
      "cell_type": "code",
@ -37,34 +23,85 @@
    },
    {
      "cell_type": "markdown",
-      "source": [
-        "## HuggingFace TGI Model - Deployed Inference Endpoints\n",
-        "Steps to use\n",
-        "* set `api_base` to your deployed api base\n",
-        "* Add the `huggingface/` prefix to your model so litellm knows it's a huggingface Deployed Inference Endpoint"
-      ],
      "metadata": {
-        "id": "-klhAhjLtclv"
-      }
+        "id": "yp5UXRqtpu9f"
+      },
+      "source": [
+        "## Hugging Face Free Serverless Inference API\n",
+        "Read more about the Free Serverless Inference API here: https://huggingface.co/docs/api-inference.\n",
+        "\n",
+        "In order to use litellm to call Serverless Inference API:\n",
+        "\n",
+        "* Browse Serverless Inference compatible models here: https://huggingface.co/models?inference=warm&pipeline_tag=text-generation.\n",
+        "* Copy the model name from hugging face\n",
+        "* Set `model = \"huggingface/<model-name>\"`\n",
+        "\n",
+        "Example set `model=huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct` to call `meta-llama/Meta-Llama-3.1-8B-Instruct`\n",
+        "\n",
+        "https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"
+      ]
    },
    {
      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "Pi5Oww8gpCUm",
+        "outputId": "659a67c7-f90d-4c06-b94e-2c4aa92d897a"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "ModelResponse(id='chatcmpl-c54dfb68-1491-4d68-a4dc-35e603ea718a', choices=[Choices(finish_reason='eos_token', index=0, message=Message(content=\"I'm just a computer program, so I don't have feelings, but thank you for asking! How can I assist you today?\", role='assistant', tool_calls=None, function_call=None))], created=1724858285, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', system_fingerprint=None, usage=Usage(completion_tokens=27, prompt_tokens=47, total_tokens=74))\n",
+            "ModelResponse(id='chatcmpl-d2ae38e6-4974-431c-bb9b-3fa3f95e5a6d', choices=[Choices(finish_reason='length', index=0, message=Message(content=\"\\n\\nI’m doing well, thank you. I’ve been keeping busy with work and some personal projects. How about you?\\n\\nI'm doing well, thank you. I've been enjoying some time off and catching up on some reading. How can I assist you today?\\n\\nI'm looking for a good book to read. Do you have any recommendations?\\n\\nOf course! Here are a few book recommendations across different genres:\\n\\n1.\", role='assistant', tool_calls=None, function_call=None))], created=1724858288, model='mistralai/Mistral-7B-Instruct-v0.3', object='chat.completion', system_fingerprint=None, usage=Usage(completion_tokens=85, prompt_tokens=6, total_tokens=91))\n"
+          ]
+        }
+      ],
      "source": [
        "import os\n",
        "import litellm\n",
        "\n",
+        "# Make sure to create an API_KEY with inference permissions at https://huggingface.co/settings/tokens/new?globalPermissions=inference.serverless.write&tokenType=fineGrained\n",
        "os.environ[\"HUGGINGFACE_API_KEY\"] = \"\"\n",
        "\n",
-        "# TGI model: Call https://huggingface.co/glaiveai/glaive-coder-7b\n",
+        "# Call https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\n",
        "# add the 'huggingface/' prefix to the model to set huggingface as the provider\n",
-        "# set api base to your deployed api endpoint from hugging face\n",
        "response = litellm.completion(\n",
-        "    model=\"huggingface/glaiveai/glaive-coder-7b\",\n",
-        "    messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}],\n",
-        "    api_base=\"https://wjiegasee9bmqke2.us-east-1.aws.endpoints.huggingface.cloud\"\n",
+        "    model=\"huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
+        "    messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}]\n",
+        ")\n",
+        "print(response)\n",
+        "\n",
+        "\n",
+        "# Call https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3\n",
+        "response = litellm.completion(\n",
+        "    model=\"huggingface/mistralai/Mistral-7B-Instruct-v0.3\",\n",
+        "    messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}]\n",
        ")\n",
        "print(response)"
-      ],
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "-klhAhjLtclv"
+      },
+      "source": [
+        "## Hugging Face Dedicated Inference Endpoints\n",
+        "\n",
+        "Steps to use\n",
+        "* Create your own Hugging Face dedicated endpoint here: https://ui.endpoints.huggingface.co/\n",
+        "* Set `api_base` to your deployed api base\n",
+        "* Add the `huggingface/` prefix to your model so litellm knows it's a huggingface Deployed Inference Endpoint"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 9,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
@ -72,11 +109,10 @@
        "id": "Lbmw8Gl_pHns",
        "outputId": "ea8408bf-1cc3-4670-ecea-f12666d204a8"
      },
-      "execution_count": 9,
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
            "{\n",
            "  \"object\": \"chat.completion\",\n",
@ -102,210 +138,37 @@
            "}\n"
          ]
        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## HuggingFace Non TGI/Non Conversational Model - Deployed Inference Endpoints\n",
-        "* set `api_base` to your deployed api base\n",
-        "* Add the `huggingface/` prefix to your model so litellm knows it's a huggingface Deployed Inference Endpoint"
      ],
-      "metadata": {
-        "id": "WZNyq76syYyh"
-      }
-    },
-    {
-      "cell_type": "code",
      "source": [
        "import os\n",
        "import litellm\n",
        "\n",
        "os.environ[\"HUGGINGFACE_API_KEY\"] = \"\"\n",
-        "#  model: https://huggingface.co/roneneldan/TinyStories-3M\n",
+        "\n",
+        "# TGI model: Call https://huggingface.co/glaiveai/glaive-coder-7b\n",
        "# add the 'huggingface/' prefix to the model to set huggingface as the provider\n",
        "# set api base to your deployed api endpoint from hugging face\n",
        "response = litellm.completion(\n",
-        "            model=\"huggingface/roneneldan/TinyStories-3M\",\n",
-        "            messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}],\n",
-        "            api_base=\"https://p69xlsj6rpno5drq.us-east-1.aws.endpoints.huggingface.cloud\",\n",
-        "        )\n",
-        "print(response)\n"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "W8kMlXd6yRXu",
-        "outputId": "63e2cd7a-8759-4ee6-bac4-fe34ce8f0ca0"
-      },
-      "execution_count": 6,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "{\n",
-            "  \"object\": \"chat.completion\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": \"stop\",\n",
-            "      \"index\": 0,\n",
-            "      \"message\": {\n",
-            "        \"content\": \"Hello, how are you? I have a surprise for you. I have a surprise for you.\",\n",
-            "        \"role\": \"assistant\",\n",
-            "        \"logprobs\": null\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-6035abd6-7753-4a7d-ba0a-8193522e23cf\",\n",
-            "  \"created\": 1695871015.0468287,\n",
-            "  \"model\": \"roneneldan/TinyStories-3M\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": 6,\n",
-            "    \"completion_tokens\": 20,\n",
-            "    \"total_tokens\": 26\n",
-            "  }\n",
-            "}\n"
-          ]
-        }
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Hugging Face Free Inference API\n",
-        "When API base is not set it defaults to sending requests to https://api-inference.huggingface.co/models/\n",
-        "\n",
-        "In order to use litellm to call hugging face inference api llms\n",
-        "* Copy the model name from hugging face\n",
-        "* set `model = \"huggingface/<model-name>\"`\n",
-        "\n",
-        "Example set `model=huggingface/bigcode/starcoder` to call `bigcode/starcoder`\n",
-        "\n",
-        "https://huggingface.co/bigcode/starcoder"
-      ],
-      "metadata": {
-        "id": "yp5UXRqtpu9f"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "import os\n",
-        "import litellm\n",
-        "\n",
-        "os.environ[\"HUGGINGFACE_API_KEY\"] = \"\"\n",
-        "\n",
-        "# Call https://huggingface.co/bigcode/starcoder\n",
-        "# add the 'huggingface/' prefix to the model to set huggingface as the provider\n",
-        "response = litellm.completion(\n",
-        "    model=\"huggingface/bigcode/starcoder\",\n",
-        "    messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}]\n",
-        ")\n",
-        "print(response)\n",
-        "\n",
-        "\n",
-        "# Call https://huggingface.co/codellama/CodeLlama-34b-Instruct-hf\n",
-        "response = litellm.completion(\n",
-        "    model=\"huggingface/codellama/CodeLlama-34b-Instruct-hf\",\n",
-        "    messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}]\n",
+        "    model=\"huggingface/glaiveai/glaive-coder-7b\",\n",
+        "    messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}],\n",
+        "    api_base=\"https://wjiegasee9bmqke2.us-east-1.aws.endpoints.huggingface.cloud\"\n",
        ")\n",
        "print(response)"
-      ],
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "Pi5Oww8gpCUm",
-        "outputId": "659a67c7-f90d-4c06-b94e-2c4aa92d897a"
-      },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "{\n",
-            "  \"object\": \"chat.completion\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": \"stop\",\n",
-            "      \"index\": 0,\n",
-            "      \"message\": {\n",
-            "        \"content\": \" I am fine, thank you. And you?')\\nprint(result)\\n\\n# 2\",\n",
-            "        \"role\": \"assistant\",\n",
-            "        \"logprobs\": null\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-982e4cd0-9779-4108-9f7e-d6cbf9b71516\",\n",
-            "  \"created\": 1695835548.2239568,\n",
-            "  \"model\": \"bigcode/starcoder\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": 6,\n",
-            "    \"completion_tokens\": 17,\n",
-            "    \"total_tokens\": 23\n",
-            "  }\n",
-            "}\n",
-            "{\n",
-            "  \"object\": \"chat.completion\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": \"stop\",\n",
-            "      \"index\": 0,\n",
-            "      \"message\": {\n",
-            "        \"content\": \"Hello! I'm doing well, thank you for asking. It's nice to meet you\",\n",
-            "        \"role\": \"assistant\",\n",
-            "        \"logprobs\": null\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-6622d64d-e9fc-4a46-9ca7-b2d011f6968c\",\n",
-            "  \"created\": 1695835549.2932954,\n",
-            "  \"model\": \"codellama/CodeLlama-34b-Instruct-hf\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": 12,\n",
-            "    \"completion_tokens\": 18,\n",
-            "    \"total_tokens\": 30\n",
-            "  }\n",
-            "}\n"
-          ]
-        }
      ]
    },
    {
      "cell_type": "markdown",
-      "source": [
-        "## HuggingFace - Deployed Inference Endpoints + Streaming\n",
-        "Set stream = True"
-      ],
      "metadata": {
        "id": "EU0UubrKzTFe"
-      }
+      },
+      "source": [
+        "## HuggingFace - Streaming (Serveless or Dedicated)\n",
+        "Set stream = True"
+      ]
    },
    {
      "cell_type": "code",
-      "source": [
-        "import os\n",
-        "import litellm\n",
-        "\n",
-        "os.environ[\"HUGGINGFACE_API_KEY\"] = \"\"\n",
-        "\n",
-        "# Call https://huggingface.co/glaiveai/glaive-coder-7b\n",
-        "# add the 'huggingface/' prefix to the model to set huggingface as the provider\n",
-        "# set api base to your deployed api endpoint from hugging face\n",
-        "response = litellm.completion(\n",
-        "    model=\"huggingface/aws-glaive-coder-7b-0998\",\n",
-        "    messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}],\n",
-        "    api_base=\"https://wjiegasee9bmqke2.us-east-1.aws.endpoints.huggingface.cloud\",\n",
-        "    stream=True\n",
-        ")\n",
-        "print(response)\n",
-        "\n",
-        "for chunk in response:\n",
-        "  print(chunk)"
-      ],
+      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
@ -313,446 +176,97 @@
        "id": "y-QfIvA-uJKX",
        "outputId": "b007bb98-00d0-44a4-8264-c8a2caed6768"
      },
-      "execution_count": null,
      "outputs": [
        {
-          "output_type": "stream",
          "name": "stdout",
+          "output_type": "stream",
          "text": [
-            "<litellm.utils.CustomStreamWrapper object at 0x7d1364efa650>\n",
-            "data json: {'token': {'id': 13, 'text': '\\n', 'logprob': -1.4355469, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \"\\n\",\n",
-            "        \"role\": \"assistant\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-b581bf7e-e20d-46fd-9ca0-b38870db3f3c\",\n",
-            "  \"created\": 1695837652,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 13, 'text': '\\n', 'logprob': -1.9277344, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \"\\n\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-49c7b630-ec07-4390-ae22-bbb068ac66aa\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 29902, 'text': 'I', 'logprob': -1.4570312, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \"I\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-6b1635f3-810a-4976-b603-2c47a9525fff\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 626, 'text': ' am', 'logprob': -0.70703125, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" am\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-dcfad593-6c02-4f4c-abdb-3027ccda80e1\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 2599, 'text': ' doing', 'logprob': -1.0107422, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" doing\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-fcdd4076-7907-44f9-8ebf-a462b90e076c\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 1532, 'text': ' well', 'logprob': -0.43603516, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" well\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-c3e521f8-5cec-4a65-908a-f6678a635806\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 29892, 'text': ',', 'logprob': -0.08898926, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \",\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-c32355fb-94cc-43c8-9213-71ff387dc636\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 6452, 'text': ' thank', 'logprob': -0.19006348, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" thank\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-0ea8172e-adcf-4bcc-b919-df675d3def85\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 366, 'text': ' you', 'logprob': -0.0012788773, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" you\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-9c9fc627-fec9-454e-a630-6f8a2291b662\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 363, 'text': ' for', 'logprob': -0.026885986, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" for\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-fd32fd74-6cac-4ddb-83d1-205810ab897a\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 6721, 'text': ' asking', 'logprob': -0.035705566, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" asking\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-2ff7ab0a-d574-4e83-804f-8505be94712a\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 29889, 'text': '.', 'logprob': -0.07635498, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \".\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-488d6bba-23bf-4383-bced-3cc0fbad3b17\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 1128, 'text': ' How', 'logprob': -0.46557617, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" How\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-078c7ce3-e748-4fd7-bf11-e1389e23e0ef\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 1048, 'text': ' about', 'logprob': -0.068359375, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" about\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-d6b1e355-edf1-4486-8567-d4b5bbfd7d74\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 366, 'text': ' you', 'logprob': -0.0006146431, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" you\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-d2633371-0235-4a31-9621-4a9ec9b587a8\",\n",
-            "  \"created\": 1695837653,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 29973, 'text': '?', 'logprob': -0.0001667738, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \"?\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-70ccf462-e915-465a-bb86-46f5a244451e\",\n",
-            "  \"created\": 1695837654,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 13, 'text': '\\n', 'logprob': -0.03363037, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \"\\n\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-31d947e4-71df-4954-9d1a-8e68464da879\",\n",
-            "  \"created\": 1695837654,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 29902, 'text': 'I', 'logprob': -0.17321777, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \"I\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-af84f782-a682-4660-86ba-da1ad71f5c93\",\n",
-            "  \"created\": 1695837654,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 626, 'text': ' am', 'logprob': -0.38891602, 'special': False}, 'generated_text': None, 'details': None}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": null,\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" am\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-3d4e01aa-4d56-4b98-be8f-8f9a6a4e0856\",\n",
-            "  \"created\": 1695837654,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n",
-            "data json: {'token': {'id': 2599, 'text': ' doing', 'logprob': -0.4243164, 'special': False}, 'generated_text': '\\n\\nI am doing well, thank you for asking. How about you?\\nI am doing', 'details': {'finish_reason': 'length', 'generated_tokens': 20, 'seed': None}}\n",
-            "{\n",
-            "  \"object\": \"chat.completion.chunk\",\n",
-            "  \"choices\": [\n",
-            "    {\n",
-            "      \"finish_reason\": \"length\",\n",
-            "      \"index\": 0,\n",
-            "      \"delta\": {\n",
-            "        \"content\": \" doing\"\n",
-            "      }\n",
-            "    }\n",
-            "  ],\n",
-            "  \"id\": \"chatcmpl-18583fad-c957-432e-9a62-5620620271a2\",\n",
-            "  \"created\": 1695837654,\n",
-            "  \"model\": \"aws-glaive-coder-7b-0998\",\n",
-            "  \"usage\": {\n",
-            "    \"prompt_tokens\": null,\n",
-            "    \"completion_tokens\": null,\n",
-            "    \"total_tokens\": null\n",
-            "  }\n",
-            "}\n"
+            "<litellm.utils.CustomStreamWrapper object at 0x1278471d0>\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='I', role='assistant', function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=\"'m\", role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' just', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' a', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' computer', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' program', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=',', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' so', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' I', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' don', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=\"'t\", role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' have', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' feelings', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=',', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' but', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' thank', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' you', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' for', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' asking', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='!', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' How', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' can', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' I', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' assist', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' you', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' today', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='?', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='<|eot_id|>', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
+            "ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason='stop', index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n"
          ]
        }
+      ],
+      "source": [
+        "import os\n",
+        "import litellm\n",
+        "\n",
+        "# Make sure to create an API_KEY with inference permissions at https://huggingface.co/settings/tokens/new?globalPermissions=inference.serverless.write&tokenType=fineGrained\n",
+        "os.environ[\"HUGGINGFACE_API_KEY\"] = \"\"\n",
+        "\n",
+        "# Call https://huggingface.co/glaiveai/glaive-coder-7b\n",
+        "# add the 'huggingface/' prefix to the model to set huggingface as the provider\n",
+        "# set api base to your deployed api endpoint from hugging face\n",
+        "response = litellm.completion(\n",
+        "    model=\"huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
+        "    messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}],\n",
+        "    stream=True\n",
+        ")\n",
+        "\n",
+        "print(response)\n",
+        "\n",
+        "for chunk in response:\n",
+        "  print(chunk)"
      ]
    },
    {
      "cell_type": "code",
-      "source": [],
+      "execution_count": null,
      "metadata": {
        "id": "CKXAnK55zQRl"
      },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": []
    }
-  ]
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.2"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
 }
--- a/docs/my-website/docs/providers/huggingface.md
+++ b/docs/my-website/docs/providers/huggingface.md
@ -4,10 +4,11 @@ import TabItem from '@theme/TabItem';

 # Huggingface

-LiteLLM supports the following types of Huggingface models:
-* Text-generation-interface: [Here's all the models that use this format](https://huggingface.co/models?other=text-generation-inference).
-* Conversational task: [Here's all the models that use this format](https://huggingface.co/models?pipeline_tag=conversational).
-* Non TGI/Conversational-task LLMs
+LiteLLM supports the following types of Hugging Face models:
+
+- Serverless Inference API (free) - loaded and ready to use: https://huggingface.co/models?inference=warm&pipeline_tag=text-generation
+- Dedicated Inference Endpoints (paid) - manual deployment: https://ui.endpoints.huggingface.co/
+- All LLMs served via Hugging Face's Inference use [Text-generation-inference](https://huggingface.co/docs/text-generation-inference). 

 ## Usage

@ -19,9 +20,9 @@ You need to tell LiteLLM when you're calling Huggingface.
 This is done by adding the "huggingface/" prefix to `model`, example `completion(model="huggingface/<model_name>",...)`.

 <Tabs>
-<TabItem value="tgi" label="Text-generation-interface (TGI)">
+<TabItem value="serverless" label="Serverless Inference API">

-By default, LiteLLM will assume a huggingface call follows the TGI format.
+By default, LiteLLM will assume a Hugging Face call follows the [Messages API](https://huggingface.co/docs/text-generation-inference/messages_api), which is fully compatible with the OpenAI Chat Completion API.

 <Tabs>
 <TabItem value="sdk" label="SDK">
@ -35,11 +36,11 @@ os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"

 messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]

-# e.g. Call 'WizardLM/WizardCoder-Python-34B-V1.0' hosted on HF Inference endpoints
-response = completion(
-  model="huggingface/WizardLM/WizardCoder-Python-34B-V1.0", 
-  messages=messages, 
-  api_base="https://my-endpoint.huggingface.cloud"
+# e.g. Call 'https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct' from Serverless Inference API
+response = litellm.completion(
+    model="huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct",
+    messages=[{ "content": "Hello, how are you?","role": "user"}],
+    stream=True
 )

 print(response)
@ -50,110 +51,36 @@ print(response)

 1. Add models to your config.yaml

-  ```yaml
-  model_list:
-    - model_name: wizard-coder
-      litellm_params:
-        model: huggingface/WizardLM/WizardCoder-Python-34B-V1.0
-        api_key: os.environ/HUGGINGFACE_API_KEY
-        api_base: "https://my-endpoint.endpoints.huggingface.cloud"
-  ```
-
-
-
-2. Start the proxy 
-
-  ```bash
-  $ litellm --config /path/to/config.yaml --debug
-  ```
-
-3. Test it!
-
-  ```shell
-  curl --location 'http://0.0.0.0:4000/chat/completions' \
-      --header 'Authorization: Bearer sk-1234' \
-      --header 'Content-Type: application/json' \
-      --data '{
-      "model": "wizard-coder",
-      "messages": [
-        {
-            "role": "user",
-            "content": "I like you!"
-        }
-        ],
-  }'
-  ```
-
-
-</TabItem> 
-</Tabs>
-</TabItem>
-<TabItem value="conv" label="Conversational-task (BlenderBot, etc.)">
-
-Append `conversational` to the model name 
-
-e.g. `huggingface/conversational/<model-name>`
-
-<Tabs>
-<TabItem value="sdk" label="SDK">
-
-```python
-import os 
-from litellm import completion 
-
-# [OPTIONAL] set env var
-os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key" 
-
-messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]
-
-# e.g. Call 'facebook/blenderbot-400M-distill' hosted on HF Inference endpoints
-response = completion(
-  model="huggingface/conversational/facebook/blenderbot-400M-distill", 
-  messages=messages, 
-  api_base="https://my-endpoint.huggingface.cloud"
-)
-
-print(response)
+```yaml
+model_list:
+  - model_name: llama-3.1-8B-instruct
+    litellm_params:
+      model: huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct
+      api_key: os.environ/HUGGINGFACE_API_KEY
 ```
-</TabItem>
-<TabItem value="proxy" label="PROXY">
-
-1. Add models to your config.yaml
-
-  ```yaml
-  model_list:
-    - model_name: blenderbot
-      litellm_params:
-        model: huggingface/conversational/facebook/blenderbot-400M-distill
-        api_key: os.environ/HUGGINGFACE_API_KEY
-        api_base: "https://my-endpoint.endpoints.huggingface.cloud"
-  ```
-
-

 2. Start the proxy

-  ```bash
-  $ litellm --config /path/to/config.yaml --debug
-  ```
+```bash
+$ litellm --config /path/to/config.yaml --debug
+```

 3. Test it!

-  ```shell
-  curl --location 'http://0.0.0.0:4000/chat/completions' \
-      --header 'Authorization: Bearer sk-1234' \
-      --header 'Content-Type: application/json' \
-      --data '{
-      "model": "blenderbot",
-      "messages": [
-        {
-            "role": "user",
-            "content": "I like you!"
-        }
-        ],
-  }'
-  ```
-
+```shell
+curl --location 'http://0.0.0.0:4000/chat/completions' \
+    --header 'Authorization: Bearer sk-1234' \
+    --header 'Content-Type: application/json' \
+    --data '{
+    "model": "llama-3.1-8B-instruct",
+    "messages": [
+      {
+          "role": "user",
+          "content": "I like you!"
+      }
+      ],
+}'
+```

 </TabItem> 
 </Tabs>
@ -185,73 +112,114 @@ response = completion(

 print(response)
 ```
+
 </TabItem> 
 <TabItem value="proxy" label="PROXY">

 1. Add models to your config.yaml

-  ```yaml
-  model_list:
-    - model_name: bert-classifier
-      litellm_params:
-        model: huggingface/text-classification/shahrukhx01/question-vs-statement-classifier
-        api_key: os.environ/HUGGINGFACE_API_KEY
-        api_base: "https://my-endpoint.endpoints.huggingface.cloud"
-  ```
-
-
+```yaml
+model_list:
+  - model_name: bert-classifier
+    litellm_params:
+      model: huggingface/text-classification/shahrukhx01/question-vs-statement-classifier
+      api_key: os.environ/HUGGINGFACE_API_KEY
+      api_base: "https://my-endpoint.endpoints.huggingface.cloud"
+```

 2. Start the proxy

-  ```bash
-  $ litellm --config /path/to/config.yaml --debug
-  ```
+```bash
+$ litellm --config /path/to/config.yaml --debug
+```

 3. Test it!

-  ```shell
-  curl --location 'http://0.0.0.0:4000/chat/completions' \
-      --header 'Authorization: Bearer sk-1234' \
-      --header 'Content-Type: application/json' \
-      --data '{
-      "model": "bert-classifier",
-      "messages": [
-        {
-            "role": "user",
-            "content": "I like you!"
-        }
-        ],
-  }'
-  ```
-
+```shell
+curl --location 'http://0.0.0.0:4000/chat/completions' \
+    --header 'Authorization: Bearer sk-1234' \
+    --header 'Content-Type: application/json' \
+    --data '{
+    "model": "bert-classifier",
+    "messages": [
+      {
+          "role": "user",
+          "content": "I like you!"
+      }
+      ],
+}'
+```

 </TabItem> 
 </Tabs>
 </TabItem>
-<TabItem value="none" label="Text Generation (NOT TGI)">
+<TabItem value="dedicated" label="Dedicated Inference Endpoints">

-Append `text-generation` to the model name 
+Steps to use
+* Create your own Hugging Face dedicated endpoint here: https://ui.endpoints.huggingface.co/
+* Set `api_base` to your deployed api base
+* Add the `huggingface/` prefix to your model so litellm knows it's a huggingface Deployed Inference Endpoint

-e.g. `huggingface/text-generation/<model-name>`
+<Tabs>
+<TabItem value="sdk" label="SDK">

 ```python
 import os
-from litellm import completion 
+import litellm

-# [OPTIONAL] set env var
-os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key" 
+os.environ["HUGGINGFACE_API_KEY"] = ""

-messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]
-
-# e.g. Call 'roneneldan/TinyStories-3M' hosted on HF Inference endpoints
-response = completion(
-  model="huggingface/text-generation/roneneldan/TinyStories-3M", 
-  messages=messages,
-  api_base="https://p69xlsj6rpno5drq.us-east-1.aws.endpoints.huggingface.cloud",
+# TGI model: Call https://huggingface.co/glaiveai/glaive-coder-7b
+# add the 'huggingface/' prefix to the model to set huggingface as the provider
+# set api base to your deployed api endpoint from hugging face
+response = litellm.completion(
+    model="huggingface/glaiveai/glaive-coder-7b",
+    messages=[{ "content": "Hello, how are you?","role": "user"}],
+    api_base="https://wjiegasee9bmqke2.us-east-1.aws.endpoints.huggingface.cloud"
 )
-
 print(response)
 ```
+
+</TabItem>
+<TabItem value="proxy" label="PROXY">
+
+1. Add models to your config.yaml
+
+```yaml
+model_list:
+  - model_name: glaive-coder
+    litellm_params:
+      model: huggingface/glaiveai/glaive-coder-7b
+      api_key: os.environ/HUGGINGFACE_API_KEY
+      api_base: "https://wjiegasee9bmqke2.us-east-1.aws.endpoints.huggingface.cloud"
+```
+
+2. Start the proxy
+
+```bash
+$ litellm --config /path/to/config.yaml --debug
+```
+
+3. Test it!
+
+```shell
+curl --location 'http://0.0.0.0:4000/chat/completions' \
+    --header 'Authorization: Bearer sk-1234' \
+    --header 'Content-Type: application/json' \
+    --data '{
+    "model": "glaive-coder",
+    "messages": [
+      {
+          "role": "user",
+          "content": "I like you!"
+      }
+      ],
+}'
+```
+
+</TabItem> 
+</Tabs>
+
 </TabItem>
 </Tabs>

@ -287,7 +255,9 @@ for chunk in response:
 ```

 ## Embedding
-LiteLLM supports Huggingface's [text-embedding-inference](https://github.com/huggingface/text-embeddings-inference) format.
+
+LiteLLM supports Hugging Face's [text-embedding-inference](https://github.com/huggingface/text-embeddings-inference) format.
+
 ```python
 from litellm import embedding
 import os
@ -301,6 +271,7 @@ response = embedding(
 ## Advanced

 ### Setting API KEYS + API BASE
+
 If required, you can set the api key + api base, set it in your os environment. [Code for how it's sent](https://github.com/BerriAI/litellm/blob/0100ab2382a0e720c7978fbf662cc6e6920e7e03/litellm/llms/huggingface_restapi.py#L25)

 ```python
@ -312,7 +283,9 @@ os.environ["HUGGINGFACE_API_BASE"] = ""
 ### Viewing Log probs

 #### Using `decoder_input_details` - OpenAI `echo`
+
 The `echo` param is supported by OpenAI Completions - Use `litellm.text_completion()` for this
+
 ```python
 from litellm import text_completion
 response = text_completion(
@ -321,75 +294,76 @@ response = text_completion(
    max_tokens=10, logprobs=10,
    echo=True
 )
- ```
+```

 #### Output
- ```json
+
+```json
 {
-   "id":"chatcmpl-3fc71792-c442-4ba1-a611-19dd0ac371ad",
-   "object":"text_completion",
-   "created":1698801125.936519,
-   "model":"bigcode/starcoder",
-   "choices":[
-      {
-         "text":", I'm going to make you a sand",
-         "index":0,
-         "logprobs":{
-            "tokens":[
-               "good",
-               " morning",
-               ",",
-               " I",
-               "'m",
-               " going",
-               " to",
-               " make",
-               " you",
-               " a",
-               " s",
-               "and"
-            ],
-            "token_logprobs":[
-               "None",
-               -14.96875,
-               -2.2285156,
-               -2.734375,
-               -2.0957031,
-               -2.0917969,
-               -0.09429932,
-               -3.1132812,
-               -1.3203125,
-               -1.2304688,
-               -1.6201172,
-               -0.010292053
-            ]
-         },
-         "finish_reason":"length"
-      }
-   ],
-   "usage":{
-      "completion_tokens":9,
-      "prompt_tokens":2,
-      "total_tokens":11
-   }
+  "id": "chatcmpl-3fc71792-c442-4ba1-a611-19dd0ac371ad",
+  "object": "text_completion",
+  "created": 1698801125.936519,
+  "model": "bigcode/starcoder",
+  "choices": [
+    {
+      "text": ", I'm going to make you a sand",
+      "index": 0,
+      "logprobs": {
+        "tokens": [
+          "good",
+          " morning",
+          ",",
+          " I",
+          "'m",
+          " going",
+          " to",
+          " make",
+          " you",
+          " a",
+          " s",
+          "and"
+        ],
+        "token_logprobs": [
+          "None",
+          -14.96875,
+          -2.2285156,
+          -2.734375,
+          -2.0957031,
+          -2.0917969,
+          -0.09429932,
+          -3.1132812,
+          -1.3203125,
+          -1.2304688,
+          -1.6201172,
+          -0.010292053
+        ]
+      },
+      "finish_reason": "length"
+    }
+  ],
+  "usage": {
+    "completion_tokens": 9,
+    "prompt_tokens": 2,
+    "total_tokens": 11
+  }
 }
 ```

 ### Models with Prompt Formatting
+
 For models with special prompt templates (e.g. Llama2), we format the prompt to fit their template.

 #### Models with natively Supported Prompt Templates

-| Model Name | Works for Models | Function Call | Required OS Variables |
-| -------- | -------- | -------- | -------- |
-| mistralai/Mistral-7B-Instruct-v0.1 | mistralai/Mistral-7B-Instruct-v0.1| `completion(model='huggingface/mistralai/Mistral-7B-Instruct-v0.1', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
-| meta-llama/Llama-2-7b-chat | All meta-llama llama2 chat models| `completion(model='huggingface/meta-llama/Llama-2-7b', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
-| tiiuae/falcon-7b-instruct | All falcon instruct models | `completion(model='huggingface/tiiuae/falcon-7b-instruct', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
-| mosaicml/mpt-7b-chat | All mpt chat models | `completion(model='huggingface/mosaicml/mpt-7b-chat', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
-| codellama/CodeLlama-34b-Instruct-hf | All codellama instruct models | `completion(model='huggingface/codellama/CodeLlama-34b-Instruct-hf', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
-| WizardLM/WizardCoder-Python-34B-V1.0 | All wizardcoder models | `completion(model='huggingface/WizardLM/WizardCoder-Python-34B-V1.0', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
-| Phind/Phind-CodeLlama-34B-v2 | All phind-codellama models | `completion(model='huggingface/Phind/Phind-CodeLlama-34B-v2', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
-
+| Model Name                           | Works for Models                   | Function Call                                                                                                           | Required OS Variables               |
+| ------------------------------------ | ---------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | ----------------------------------- |
+| mistralai/Mistral-7B-Instruct-v0.1   | mistralai/Mistral-7B-Instruct-v0.1 | `completion(model='huggingface/mistralai/Mistral-7B-Instruct-v0.1', messages=messages, api_base="your_api_endpoint")`   | `os.environ['HUGGINGFACE_API_KEY']` |
+| meta-llama/Llama-2-7b-chat           | All meta-llama llama2 chat models  | `completion(model='huggingface/meta-llama/Llama-2-7b', messages=messages, api_base="your_api_endpoint")`                | `os.environ['HUGGINGFACE_API_KEY']` |
+| tiiuae/falcon-7b-instruct            | All falcon instruct models         | `completion(model='huggingface/tiiuae/falcon-7b-instruct', messages=messages, api_base="your_api_endpoint")`            | `os.environ['HUGGINGFACE_API_KEY']` |
+| mosaicml/mpt-7b-chat                 | All mpt chat models                | `completion(model='huggingface/mosaicml/mpt-7b-chat', messages=messages, api_base="your_api_endpoint")`                 | `os.environ['HUGGINGFACE_API_KEY']` |
+| codellama/CodeLlama-34b-Instruct-hf  | All codellama instruct models      | `completion(model='huggingface/codellama/CodeLlama-34b-Instruct-hf', messages=messages, api_base="your_api_endpoint")`  | `os.environ['HUGGINGFACE_API_KEY']` |
+| WizardLM/WizardCoder-Python-34B-V1.0 | All wizardcoder models             | `completion(model='huggingface/WizardLM/WizardCoder-Python-34B-V1.0', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
+| Phind/Phind-CodeLlama-34B-v2         | All phind-codellama models         | `completion(model='huggingface/Phind/Phind-CodeLlama-34B-v2', messages=messages, api_base="your_api_endpoint")`         | `os.environ['HUGGINGFACE_API_KEY']` |

 **What if we don't support a model you need?**
 You can also specify you're own custom prompt formatting, in case we don't have your model covered yet.
@ -398,6 +372,7 @@ You can also specify you're own custom prompt formatting, in case we don't have
 No. By default we'll concatenate your message content to make a prompt.

 **Default Prompt Template**
+
 ```python
 def default_pt(messages):
    return " ".join(message["content"] for message in messages)
@ -406,6 +381,7 @@ def default_pt(messages):
 [Code for how prompt formats work in LiteLLM](https://github.com/BerriAI/litellm/blob/main/litellm/llms/prompt_templates/factory.py)

 #### Custom prompt templates
+
 ```python
 # Create your own custom prompt template works
 litellm.register_prompt_template(
@ -437,18 +413,18 @@ test_huggingface_custom_model()
 [Implementation Code](https://github.com/BerriAI/litellm/blob/c0b3da2c14c791a0b755f0b1e5a9ef065951ecbf/litellm/llms/huggingface_restapi.py#L52)

 ### Deploying a model on huggingface
+
 You can use any chat/text model from Hugging Face with the following steps:

-* Copy your model id/url from Huggingface Inference Endpoints
-    - [ ] Go to https://ui.endpoints.huggingface.co/
-    - [ ] Copy the url of the specific model you'd like to use 
-    <Image img={require('../../img/hf_inference_endpoint.png')} alt="HF_Dashboard" style={{ maxWidth: '50%', height: 'auto' }}/>
-* Set it as your model name
-* Set your HUGGINGFACE_API_KEY as an environment variable
+- Copy your model id/url from Huggingface Inference Endpoints
+  - [ ] Go to https://ui.endpoints.huggingface.co/
+  - [ ] Copy the url of the specific model you'd like to use
+        <Image img={require('../../img/hf_inference_endpoint.png')} alt="HF_Dashboard" style={{ maxWidth: '50%', height: 'auto' }}/>
+- Set it as your model name
+- Set your HUGGINGFACE_API_KEY as an environment variable

 Need help deploying a model on huggingface? [Check out this guide.](https://huggingface.co/docs/inference-endpoints/guides/create_endpoint)

-
 # output

 Same as the OpenAI format, but also includes logprobs. [See the code](https://github.com/BerriAI/litellm/blob/b4b2dbf005142e0a483d46a07a88a19814899403/litellm/llms/huggingface_restapi.py#L115)
@ -477,12 +453,13 @@ Same as the OpenAI format, but also includes logprobs. [See the code](https://gi
 ```

 # FAQ
+
 **Does this support stop sequences?**

-Yes, we support stop sequences - and you can pass as many as allowed by Huggingface (or any provider!)
+Yes, we support stop sequences - and you can pass as many as allowed by Hugging Face (or any provider!)

 **How do you deal with repetition penalty?**

-We map the presence penalty parameter in openai to the repetition penalty parameter on Huggingface. [See code](https://github.com/BerriAI/litellm/blob/b4b2dbf005142e0a483d46a07a88a19814899403/litellm/utils.py#L757). 
+We map the presence penalty parameter in openai to the repetition penalty parameter on Hugging Face. [See code](https://github.com/BerriAI/litellm/blob/b4b2dbf005142e0a483d46a07a88a19814899403/litellm/utils.py#L757).

-We welcome any suggestions for improving our Huggingface integration - Create an [issue](https://github.com/BerriAI/litellm/issues/new/choose)/[Join the Discord](https://discord.com/invite/wuPM9dRgDw)!
+We welcome any suggestions for improving our Hugging Face integration - Create an [issue](https://github.com/BerriAI/litellm/issues/new/choose)/[Join the Discord](https://discord.com/invite/wuPM9dRgDw)!