From 6a5015d7d4eebf15aaa7f6de01968d5d89561a19 Mon Sep 17 00:00:00 2001
From: moritzlaurer <moritz.laurer@posteo.de>
Date: Wed, 3 Jul 2024 12:35:51 +0200
Subject: [PATCH] implemented suggestions from reviews

---
 .../en/enterprise_dedicated_endpoints.ipynb   | 368 +++++++-----------
 1 file changed, 143 insertions(+), 225 deletions(-)

diff --git a/notebooks/en/enterprise_dedicated_endpoints.ipynb b/notebooks/en/enterprise_dedicated_endpoints.ipynb
index 0c079d37..d58b4877 100644
--- a/notebooks/en/enterprise_dedicated_endpoints.ipynb
+++ b/notebooks/en/enterprise_dedicated_endpoints.ipynb
@@ -7,7 +7,7 @@
    "source": [
     "# Inference Endpoints (dedicated) \n",
     "\n",
-    "Did you ever want to create your own machine learning API? That's what we will do in this recipe with the [HF Dedicated Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index). Inference Endpoints enable you to pick any of the hundreds of thousands of models on the HF Hub and create your own API in a few clicks in a deployment you control, on hardware you choose. \n",
+    "Have you ever wanted to create your own machine learning API? That's what we will do in this recipe with the [HF Dedicated Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index). Inference Endpoints enable you to pick any of the hundreds of thousands of models on the HF Hub, create your own API on a deployment platform you control, and on hardware you choose.\n",
     "\n",
     "[Serverless Inference APIs](link-to-recipe) are great for initial testing, but they are limited to a pre-configured selection of popular models and they are rate limited, because the serverless API's hardware is used by many users at the same time. With a Dedicated Inference Endpoint, you can customize the deployment of your model and the hardware is exclusively dedicated to you. \n",
     "\n",
@@ -23,10 +23,9 @@
    "metadata": {},
    "source": [
     "## Install and login\n",
-    "In case you don't have a HF Account, you can create your account [here](https://huggingface.co/join). If you work in a larger team, you can also create a [HF Organization](https://huggingface.co/organizations) and manage all your models, datasets and endpoints via this organization. Dedicated Inference Endpoints are a paid service and you will therefore need to add a credit card the [billing settings](https://huggingface.co/settings/billing) of your personal HF account, or of your HF organization.  \n",
+    "In case you don't have a HF Account, you can create your account [here](https://huggingface.co/join). If you work in a larger team, you can also create a [HF Organization](https://huggingface.co/organizations) and manage all your models, datasets and endpoints via this organization. Dedicated Inference Endpoints are a paid service and you will therefore need to add a credit card to the [billing settings](https://huggingface.co/settings/billing) of your personal HF account, or of your HF organization.  \n",
     "\n",
-    "You can then create a user access token [here](https://huggingface.co/docs/hub/security-tokens). A token with `read` or `write` permissions will work for this guide, but we encourage the use of fine-grained tokens for increased security. For this notebook, you'll need a fine-grained token with `User Permissions > Inference > Make calls to Inference Endpoints & Manage Inference Endpoints` and `Repository permissions > google/gemma-1.1-2b-it & HuggingFaceM4/idefics2-8b-chatty`.\n",
-    " "
+    "You can then create a user access token [here](https://huggingface.co/docs/hub/security-tokens). A token with `read` or `write` permissions will work for this guide, but we encourage the use of fine-grained tokens for increased security. For this notebook, you'll need a fine-grained token with `User Permissions > Inference > Make calls to Inference Endpoints & Manage Inference Endpoints` and `Repository permissions > google/gemma-1.1-2b-it & HuggingFaceM4/idefics2-8b-chatty`."
    ]
   },
   {
@@ -69,7 +68,7 @@
     "- **Model Repository**: Here you can insert the identifier of any model on the HF Hub. For this initial demonstration, we use [google/gemma-1.1-2b-it](https://huggingface.co/google/gemma-1.1-2b-it), a small generative LLM (2.5B parameters). \n",
     "- **Endpoint Name**: The Endpoint Name is automatically generated based on the model identifier, but you are free to change the name. Valid Endpoint names must only contain lower-case characters, numbers or hyphens (\"-\") and are between 4 to 32 characters long.\n",
     "- **Instance Configuration**: Here you can choose from a wide range of CPUs or GPUs from all major cloud platforms. You can also adjust the region, for example if you need to host your endpoint in the EU. \n",
-    "- **Automatic Scale-to-Zero**: You can configure your Endpoint to scale to zero GPUs/CPUs after a certain amount of time. Scaled-to-zero Endpoints are not billed anymore. Note that restarting the endpoint requires the model to be re-loaded into memory (and potentially re-downloaded), which can take saveral minutes for large models. \n",
+    "- **Automatic Scale-to-Zero**: You can configure your Endpoint to scale to zero GPUs/CPUs after a certain amount of time. Scaled-to-zero Endpoints are not billed anymore. Note that restarting the endpoint requires the model to be re-loaded into memory (and potentially re-downloaded), which can take several minutes for large models. \n",
     "- **Endpoint Security Level**: The standard security level is `Protected`, which requires an authorized HF token for accessing the endpoint. `Public` Endpoints are accessible by anyone without token authentification. `Private` Endpoints are only available through an intra-region secured AWS or Azure PrivateLink connection.\n",
     "- **Advanced configuration**: Here you can select some advanced options like the Docker container type. As Gemma is compatible with [Text Generation Inference (TGI)](https://huggingface.co/docs/text-generation-inference/index) containers, the system automatically selects TGI as the container type and other good default values.\n",
     "\n",
@@ -100,6 +99,132 @@
     "</div>\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "2612aa56",
+   "metadata": {},
+   "source": [
+    "### Creating and managing endpoints programmatically\n",
+    "\n",
+    "When moving into production, you don't always want to manually start, stop and modify your Endpoints. The `huggingface_hub` library provides good functionality for managing your endpoints programmatically. See the docs [here](https://huggingface.co/docs/huggingface_hub/guides/inference_endpoints) and details on all functions [here](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_endpoints). Here are some key functions:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1c7cd4fd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# list all your inference endpoints\n",
+    "huggingface_hub.list_inference_endpoints()\n",
+    "\n",
+    "# get an existing endpoint and check it's status\n",
+    "endpoint = huggingface_hub.get_inference_endpoint(\n",
+    "    name=\"gemma-1-1-2b-it-yci\",  # the name of the endpoint \n",
+    "    namespace=\"MoritzLaurer\"  # your user name or organization name\n",
+    ")\n",
+    "print(endpoint)\n",
+    "\n",
+    "# Pause endpoint to stop billing\n",
+    "endpoint.pause()\n",
+    "\n",
+    "# Resume and wait until the endpoint is ready\n",
+    "#endpoint.resume()\n",
+    "#endpoint.wait()\n",
+    "\n",
+    "# Update the endpoint to a different GPU\n",
+    "# You can find the correct arguments for different hardware types in this table: https://huggingface.co/docs/inference-endpoints/pricing#gpu-instances\n",
+    "#endpoint.update(\n",
+    "#    instance_size=\"x1\",\n",
+    "#    instance_type=\"nvidia-a100\",  # nvidia-a10g\n",
+    "#)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b62976e",
+   "metadata": {},
+   "source": [
+    "You can also create an inference endpoint programmatically. Let's recreate the same `gemma` LLM endpoint as the one created with the UI."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "67ebf5c6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import create_inference_endpoint\n",
+    "\n",
+    "\n",
+    "model_id = \"google/gemma-1.1-2b-it\"\n",
+    "endpoint_name = \"gemma-1-1-2b-it-001\"  # Valid Endpoint names must only contain lower-case characters, numbers or hyphens (\"-\") and are between 4 to 32 characters long.\n",
+    "namespace = \"MoritzLaurer\"  # your user or organization name\n",
+    "\n",
+    "\n",
+    "# check if endpoint with this name already exists from previous tests\n",
+    "available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]\n",
+    "if endpoint_name in available_endpoints_names:\n",
+    "    endpoint_exists = True\n",
+    "else: \n",
+    "    endpoint_exists = False\n",
+    "print(\"Does the endpoint already exist?\", endpoint_exists)\n",
+    "    \n",
+    "\n",
+    "# create new endpoint\n",
+    "if not endpoint_exists:\n",
+    "    endpoint = create_inference_endpoint(\n",
+    "        endpoint_name,\n",
+    "        repository=model_id,\n",
+    "        namespace=namespace,\n",
+    "        framework=\"pytorch\",\n",
+    "        task=\"text-generation\",\n",
+    "        # see the available hardware options here: https://huggingface.co/docs/inference-endpoints/pricing#pricing\n",
+    "        accelerator=\"gpu\",\n",
+    "        vendor=\"aws\",\n",
+    "        region=\"us-east-1\",\n",
+    "        instance_size=\"x1\",\n",
+    "        instance_type=\"nvidia-a10g\",\n",
+    "        min_replica=0,\n",
+    "        max_replica=1,\n",
+    "        type=\"protected\",\n",
+    "        # since the LLM is compatible with TGI, we specify that we want to use the latest TGI image\n",
+    "        custom_image={\n",
+    "            \"health_route\": \"/health\",\n",
+    "            \"env\": {\n",
+    "                \"MODEL_ID\": \"/repository\"\n",
+    "            },\n",
+    "            \"url\": \"ghcr.io/huggingface/text-generation-inference:latest\",\n",
+    "        },\n",
+    "    )\n",
+    "    print(\"Waiting for endpoint to be created\")\n",
+    "    endpoint.wait()\n",
+    "    print(\"Endpoint ready\")\n",
+    "\n",
+    "# if endpoint with this name already exists, get and resume existing endpoint\n",
+    "else:\n",
+    "    endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)\n",
+    "    if endpoint.status in [\"paused\", \"scaledToZero\"]:\n",
+    "        print(\"Resuming endpoint\")\n",
+    "        endpoint.resume()\n",
+    "    print(\"Waiting for endpoint to start\")\n",
+    "    endpoint.wait()\n",
+    "    print(\"Endpoint ready\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2552904e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# access the endpoint url for API calls\n",
+    "print(endpoint.url)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "70921600-a20f-455e-b53a-561311b6beda",
@@ -109,7 +234,7 @@
    "source": [
     "## Querying your Endpoint\n",
     "\n",
-    "Now let's query this endpoint like any other LLM API. First copy the Endpoint URL from the interface and assign it to `API_URL` below. We then use the standardised messages format for the text inputs, i.e. a dictionary of user and assistant messages, which you might know from other LLM API services. We then need to apply the chat template to the messages, which LLMs like Gemma, Llama-3 etc. have been trained to expect (see details on in the [docs](https://huggingface.co/docs/transformers/main/en/chat_templating)). For most recent generative LLMs, it is essential to apply this chat template, otherwise the model's performance will degrade without throwing an error. \n"
+    "Now let's query this endpoint like any other LLM API. First copy the Endpoint URL from the interface (or use `endpoint.url`) and assign it to `API_URL` below. We then use the standardised messages format for the text inputs, i.e. a dictionary of user and assistant messages, which you might know from other LLM API services. We then need to apply the chat template to the messages, which LLMs like Gemma, Llama-3 etc. have been trained to expect (see details on in the [docs](https://huggingface.co/docs/transformers/main/en/chat_templating)). For most recent generative LLMs, it is essential to apply this chat template, otherwise the model's performance will degrade without throwing an error. "
    ]
   },
   {
@@ -136,8 +261,8 @@
     "import requests\n",
     "from transformers import AutoTokenizer\n",
     "\n",
-    "# paste your endpoint URL here\n",
-    "API_URL = \"https://dz07884a53qjqb98.us-east-1.aws.endpoints.huggingface.cloud\" \n",
+    "# paste your endpoint URL here or reuse endpoint.url if you created the endpoint programmatically\n",
+    "API_URL = endpoint.url  # or paste link like \"https://dz07884a53qjqb98.us-east-1.aws.endpoints.huggingface.cloud\" \n",
     "HEADERS = {\"Authorization\": f\"Bearer {huggingface_hub.get_token()}\"}\n",
     "\n",
     "# function for standard http requests\n",
@@ -201,7 +326,7 @@
    "source": [
     "That's it, you've made the first request to your Endpoint - your very own API!\n",
     "\n",
-    "If you want the endpoint to handle the chat template automatically and if your LLM runs on a TGI container, you can also use the [messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) by appending the `/v1/chat/completions` path to the URL. With the `/v1/chat/completions` path, the [TGI](https://huggingface.co/docs/text-generation-inference/index) container running on the endpoint endpoint applies the chat template automatically and is fully compatible with OpenAI's API structure for easier interoperability. See the [TGI Swagger UI](https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/chat_completions) for all available parameters. Note that the parameters accepted by the default `/` path and by the `/v1/chat/completions` path are slightly different.  Here is the slighly modified code for using the messages API:"
+    "If you want the endpoint to handle the chat template automatically and if your LLM runs on a TGI container, you can also use the [messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) by appending the `/v1/chat/completions` path to the URL. With the `/v1/chat/completions` path, the [TGI](https://huggingface.co/docs/text-generation-inference/index) container running on the endpoint applies the chat template automatically and is fully compatible with OpenAI's API structure for easier interoperability. See the [TGI Swagger UI](https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/chat_completions) for all available parameters. Note that the parameters accepted by the default `/` path and by the `/v1/chat/completions` path are slightly different. Here is the slightly modified code for using the messages API:"
    ]
   },
   {
@@ -244,7 +369,8 @@
    "metadata": {},
    "source": [
     "### Simplified Endpoint usage with the InferenceClient\n",
-    "To simplify the sending of requests to your endpoint, you can take advantage of the [`InferenceClient`](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#huggingface_hub.InferenceClient), a convenient utility available in the `huggingface_hub` Python library that allows you to easily make calls to both [Dedicated Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) and the [Serverless Inference API](https://huggingface.co/docs/api-inference/index). See the [docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#inference) for details. \n",
+    "\n",
+    "You can also use the [`InferenceClient`](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#huggingface_hub.InferenceClient) to easily send requests to your endpoint. The client is a convenient utility available in the `huggingface_hub` Python library that allows you to easily make calls to both [Dedicated Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) and the [Serverless Inference API](https://huggingface.co/docs/api-inference/index). See the [docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#inference) for details. \n",
     "\n",
     "This is the most succinct way of sending requests to your endpoint:"
    ]
@@ -272,222 +398,15 @@
     "print(output)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "be56dc50-8bea-4561-80ae-a578f93efb50",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "## Creating and managing endpoints programmatically\n",
-    "\n",
-    "When moving into production, you don't always want to manually start, stop and modify your Endpoints. The `huggingface_hub` library provides good functionality for managing your endpoints programmatically. See the docs [here](https://huggingface.co/docs/huggingface_hub/guides/inference_endpoints) and details on all functions [here](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_endpoints). Here are some key functions:\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "id": "ecbb6541-4542-43e3-95f1-96e9ff8e8d22",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "InferenceEndpoint(name='gemma-1-1-2b-it-001', namespace='MoritzLaurer', repository='google/gemma-1.1-2b-it', status='running', url='https://dz07884a53qjqb98.us-east-1.aws.endpoints.huggingface.cloud')\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "InferenceEndpoint(name='gemma-1-1-2b-it-001', namespace='MoritzLaurer', repository='google/gemma-1.1-2b-it', status='paused', url=None)"
-      ]
-     },
-     "execution_count": 10,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "# list all your inference endpoints\n",
-    "huggingface_hub.list_inference_endpoints()\n",
-    "\n",
-    "# get an existing endpoint and check it's status\n",
-    "endpoint = huggingface_hub.get_inference_endpoint(\n",
-    "    name=\"gemma-1-1-2b-it-yci\",  # the name of the endpoint \n",
-    "    namespace=\"MoritzLaurer\"  # your user name or organization name\n",
-    ")\n",
-    "print(endpoint)\n",
-    "\n",
-    "# Pause endpoint to stop billing\n",
-    "endpoint.pause()\n",
-    "\n",
-    "# Resume and wait until the endpoint is ready\n",
-    "#endpoint.resume()\n",
-    "#endpoint.wait()\n",
-    "\n",
-    "# Update the endpoint to a different GPU\n",
-    "# You can find the correct arguments for different hardware types in this table: https://huggingface.co/docs/inference-endpoints/pricing#gpu-instances\n",
-    "#endpoint.update(\n",
-    "#    instance_size=\"x1\",\n",
-    "#    instance_type=\"nvidia-a100\",  # nvidia-a10g\n",
-    "#)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "63e08d86-9642-4129-bafc-f30e46b328de",
-   "metadata": {},
-   "source": [
-    "You can also create an inference endpoint programmatically. Let's recreate the same `gemma` LLM endpoint as the one created with the UI.\n"
-   ]
-  },
   {
    "cell_type": "code",
-   "execution_count": 11,
-   "id": "f84839f7-d336-4ea7-b241-bed640f718b7",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Does the endpoint already exist? True\n",
-      "Resuming endpoint\n",
-      "Waiting for endpoint to start\n",
-      "Endpoint ready\n"
-     ]
-    }
-   ],
-   "source": [
-    "from huggingface_hub import create_inference_endpoint\n",
-    "\n",
-    "\n",
-    "model_id = \"google/gemma-1.1-2b-it\"\n",
-    "endpoint_name = \"gemma-1-1-2b-it-001\"  # Valid Endpoint names must only contain lower-case characters, numbers or hyphens (\"-\") and are between 4 to 32 characters long.\n",
-    "namespace = \"MoritzLaurer\"  # your user or organization name\n",
-    "\n",
-    "\n",
-    "# check if endpoint with this name already exists from previous tests\n",
-    "available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]\n",
-    "if endpoint_name in available_endpoints_names:\n",
-    "    endpoint_exists = True\n",
-    "else: \n",
-    "    endpoint_exists = False\n",
-    "print(\"Does the endpoint already exist?\", endpoint_exists)\n",
-    "    \n",
-    "\n",
-    "# create new endpoint\n",
-    "if not endpoint_exists:\n",
-    "    endpoint = create_inference_endpoint(\n",
-    "        endpoint_name,\n",
-    "        repository=model_id,\n",
-    "        namespace=namespace,\n",
-    "        framework=\"pytorch\",\n",
-    "        task=\"text-generation\",\n",
-    "        # see the available hardware options here: https://huggingface.co/docs/inference-endpoints/pricing#pricing\n",
-    "        accelerator=\"gpu\",\n",
-    "        vendor=\"aws\",\n",
-    "        region=\"us-east-1\",\n",
-    "        instance_size=\"x1\",\n",
-    "        instance_type=\"nvidia-a10g\",\n",
-    "        min_replica=0,\n",
-    "        max_replica=1,\n",
-    "        type=\"protected\",\n",
-    "        # since the LLM is compatible with TGI, we specify that we want to use the latest TGI image\n",
-    "        custom_image={\n",
-    "            \"health_route\": \"/health\",\n",
-    "            \"env\": {\n",
-    "                \"MODEL_ID\": \"/repository\"\n",
-    "            },\n",
-    "            \"url\": \"ghcr.io/huggingface/text-generation-inference:latest\",\n",
-    "        },\n",
-    "    )\n",
-    "    print(\"Waiting for endpoint to be created\")\n",
-    "    endpoint.wait()\n",
-    "    print(\"Endpoint ready\")\n",
-    "\n",
-    "# if endpoint with this name already exists, get existing endpoint\n",
-    "else:\n",
-    "    endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)\n",
-    "    if endpoint.status in [\"paused\", \"scaledToZero\"]:\n",
-    "        print(\"Resuming endpoint\")\n",
-    "        endpoint.resume()\n",
-    "    print(\"Waiting for endpoint to start\")\n",
-    "    endpoint.wait()\n",
-    "    print(\"Endpoint ready\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "edd5acbb-7c44-48f3-a2ea-8d7e483dcaf0",
-   "metadata": {},
-   "source": [
-    "Once the new endpoint is ready, we can make requests just like before"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "id": "7032eb44-8caa-4820-ba5a-f947160b7da7",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "The output from your API/Endpoint call with the InferenceClient:\n",
-      "\n",
-      "ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content=\"**Open Source**\\n\\nA codebase freely given,\\nShared with all, a helping hand.\\nOpen source, a shining light,\\nGuiding progress, shining bright.\\n\\nContributions welcome, big and small,\\nIdeas shared, knowledge enthrall.\\nCollaboration thrives, a vibrant scene,\\nInnovation's seeds, readily seen.\\n\\nFrom servers vast to software fine,\\nOpen source empowers one and all.\\nA tapestry of knowledge shared,\\nConnecting minds, a vibrant parade.\", name=None, tool_calls=None), logprobs=None)], created=1718284548, id='', model='/repository', object='text_completion', system_fingerprint='2.0.5-dev0-sha-90184df', usage=ChatCompletionOutputUsage(completion_tokens=100, prompt_tokens=20, total_tokens=120))\n"
-     ]
-    }
-   ],
-   "source": [
-    "output = client.chat_completion(\n",
-    "    messages,  # the chat template is applied automatically, if your endpoint uses a TGI container\n",
-    "    model=endpoint.url, \n",
-    "    temperature=0.2, max_tokens=100, seed=42,\n",
-    ")\n",
-    "\n",
-    "print(\"The output from your API/Endpoint call with the InferenceClient:\\n\")\n",
-    "print(output)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "feb3e272-803f-4dd8-b70b-051cd854947d",
+   "execution_count": null,
+   "id": "56fdd862",
    "metadata": {},
+   "outputs": [],
    "source": [
-    "Let's pause the endpoint for the rest of the guide to stop billing."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "id": "c3f4db29-d51a-4a9c-a7fd-223275f03e5a",
-   "metadata": {
-    "tags": []
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "InferenceEndpoint(name='gemma-1-1-2b-it-001', namespace='MoritzLaurer', repository='google/gemma-1.1-2b-it', status='paused', url=None)"
-      ]
-     },
-     "execution_count": 13,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "endpoint.pause()"
+    "# pause the endpoint to stop billing\n",
+    "#endpoint.pause()"
    ]
   },
   {
@@ -667,7 +586,7 @@
    "source": [
     "### Vision Language Models: Reasoning over text and images\n",
     "\n",
-    "Now let's try a vision language models (VLMs). VLMs are very similar to LLMs, only that they can take both text and images as input simultaneously. Their output is then autoregressively generated text, just like for a standard LLM. This allows them to tackle many tasks from visual question answering to image captioning. For this example, we use [Idefics2](https://huggingface.co/blog/idefics2), a powerful 8B parameter VLM. \n",
+    "Now let's create an endpoint for a vision language model (VLM). VLMs are very similar to LLMs, only that they can take both text and images as input simultaneously. Their output is autoregressively generated text, just like for a standard LLM. VLMs can tackle many tasks from visual question answering to document understanding. For this example, we use [Idefics2](https://huggingface.co/blog/idefics2), a powerful 8B parameter VLM. \n",
     "\n",
     "We first need to convert our PIL image generated with Stable Diffusion to a `base64` encoded string so that we can send it to the model over the network."
    ]
@@ -875,8 +794,7 @@
    "id": "9ad607d1-e214-4f7b-8d07-d406980af2b8",
    "metadata": {},
    "source": [
-    "### Additional information\n",
-    "\n",
+    "## Additional information\n",
     "- When creating several endpoints, you will probably get an error message that your GPU quota has been reached. Don't hesitate to send a message to the email address in the error message and we will most likely increase your GPU quota.\n",
     "- What is the difference between `paused` and `scaled-to-zero` endpoints? `scaled-to-zero` endpoints can be flexibly woken up and scaled up by user requests, while `paused` endpoints need to be unpaused by the creator of the endpoint. Moreover, `scaled-to-zero` endpoints count towards your GPU quota (with the maximum possible replica it could be scaled up to), while `paused` endpoints do not. A simple way of freeing up your GPU quota is therefore to pause some endpoints.  "
    ]