Skip to content

Commit

Permalink
harmonized "Endpoint" capitalization
Browse files Browse the repository at this point in the history
  • Loading branch information
MoritzLaurer committed Jul 3, 2024
1 parent 6a5015d commit e08dc90
Showing 1 changed file with 30 additions and 24 deletions.
54 changes: 30 additions & 24 deletions notebooks/en/enterprise_dedicated_endpoints.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
"metadata": {},
"source": [
"## Install and login\n",
"In case you don't have a HF Account, you can create your account [here](https://huggingface.co/join). If you work in a larger team, you can also create a [HF Organization](https://huggingface.co/organizations) and manage all your models, datasets and endpoints via this organization. Dedicated Inference Endpoints are a paid service and you will therefore need to add a credit card to the [billing settings](https://huggingface.co/settings/billing) of your personal HF account, or of your HF organization. \n",
"In case you don't have a HF Account, you can create your account [here](https://huggingface.co/join). If you work in a larger team, you can also create a [HF Organization](https://huggingface.co/organizations) and manage all your models, datasets and Endpoints via this organization. Dedicated Inference Endpoints are a paid service and you will therefore need to add a credit card to the [billing settings](https://huggingface.co/settings/billing) of your personal HF account, or of your HF organization. \n",
"\n",
"You can then create a user access token [here](https://huggingface.co/docs/hub/security-tokens). A token with `read` or `write` permissions will work for this guide, but we encourage the use of fine-grained tokens for increased security. For this notebook, you'll need a fine-grained token with `User Permissions > Inference > Make calls to Inference Endpoints & Manage Inference Endpoints` and `Repository permissions > google/gemma-1.1-2b-it & HuggingFaceM4/idefics2-8b-chatty`."
]
Expand Down Expand Up @@ -61,15 +61,15 @@
"id": "2b0d609d-60eb-42d9-9524-7be8917509e6",
"metadata": {},
"source": [
"## Creating your first endpoint\n",
"## Creating your first Endpoint\n",
"\n",
"With this initial setup out of the way, we can now create our first Endpoint. Navigate to https://ui.endpoints.huggingface.co/ and click on `+ New` next to `Dedicated Endpoints`. You will then see the interface for creating a new Endpoint with the following options (see image below):\n",
"\n",
"- **Model Repository**: Here you can insert the identifier of any model on the HF Hub. For this initial demonstration, we use [google/gemma-1.1-2b-it](https://huggingface.co/google/gemma-1.1-2b-it), a small generative LLM (2.5B parameters). \n",
"- **Endpoint Name**: The Endpoint Name is automatically generated based on the model identifier, but you are free to change the name. Valid Endpoint names must only contain lower-case characters, numbers or hyphens (\"-\") and are between 4 to 32 characters long.\n",
"- **Instance Configuration**: Here you can choose from a wide range of CPUs or GPUs from all major cloud platforms. You can also adjust the region, for example if you need to host your endpoint in the EU. \n",
"- **Automatic Scale-to-Zero**: You can configure your Endpoint to scale to zero GPUs/CPUs after a certain amount of time. Scaled-to-zero Endpoints are not billed anymore. Note that restarting the endpoint requires the model to be re-loaded into memory (and potentially re-downloaded), which can take several minutes for large models. \n",
"- **Endpoint Security Level**: The standard security level is `Protected`, which requires an authorized HF token for accessing the endpoint. `Public` Endpoints are accessible by anyone without token authentification. `Private` Endpoints are only available through an intra-region secured AWS or Azure PrivateLink connection.\n",
"- **Instance Configuration**: Here you can choose from a wide range of CPUs or GPUs from all major cloud platforms. You can also adjust the region, for example if you need to host your Endpoint in the EU. \n",
"- **Automatic Scale-to-Zero**: You can configure your Endpoint to scale to zero GPUs/CPUs after a certain amount of time. Scaled-to-zero Endpoints are not billed anymore. Note that restarting the Endpoint requires the model to be re-loaded into memory (and potentially re-downloaded), which can take several minutes for large models. \n",
"- **Endpoint Security Level**: The standard security level is `Protected`, which requires an authorized HF token for accessing the Endpoint. `Public` Endpoints are accessible by anyone without token authentification. `Private` Endpoints are only available through an intra-region secured AWS or Azure PrivateLink connection.\n",
"- **Advanced configuration**: Here you can select some advanced options like the Docker container type. As Gemma is compatible with [Text Generation Inference (TGI)](https://huggingface.co/docs/text-generation-inference/index) containers, the system automatically selects TGI as the container type and other good default values.\n",
"\n",
"For this guide, select the options in the image below and click on `Create Endpoint`. \n"
Expand All @@ -90,7 +90,7 @@
"id": "cdb85ae5-316d-4628-aa12-0e1628485c27",
"metadata": {},
"source": [
"After roughly one minute, your endpoint will be created and you will see a page similar to the image below. \n",
"After roughly one minute, your Endpoint will be created and you will see a page similar to the image below. \n",
"\n",
"On the Endpoint's `Overview` page, will find the URL for querying the Endpoint, a Playground for testing the model and additional tabs on `Analytics`, `Usage & Cost`, `Logs`and `Settings`. \n",
"\n",
Expand All @@ -104,9 +104,9 @@
"id": "2612aa56",
"metadata": {},
"source": [
"### Creating and managing endpoints programmatically\n",
"### Creating and managing Endpoints programmatically\n",
"\n",
"When moving into production, you don't always want to manually start, stop and modify your Endpoints. The `huggingface_hub` library provides good functionality for managing your endpoints programmatically. See the docs [here](https://huggingface.co/docs/huggingface_hub/guides/inference_endpoints) and details on all functions [here](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_endpoints). Here are some key functions:\n"
"When moving into production, you don't always want to manually start, stop and modify your Endpoints. The `huggingface_hub` library provides good functionality for managing your Endpoints programmatically. See the docs [here](https://huggingface.co/docs/huggingface_hub/guides/inference_endpoints) and details on all functions [here](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_endpoints). Here are some key functions:\n"
]
},
{
Expand Down Expand Up @@ -146,7 +146,7 @@
"id": "3b62976e",
"metadata": {},
"source": [
"You can also create an inference endpoint programmatically. Let's recreate the same `gemma` LLM endpoint as the one created with the UI."
"You can also create an inference Endpoint programmatically. Let's recreate the same `gemma` LLM Endpoint as the one created with the UI."
]
},
{
Expand Down Expand Up @@ -234,7 +234,7 @@
"source": [
"## Querying your Endpoint\n",
"\n",
"Now let's query this endpoint like any other LLM API. First copy the Endpoint URL from the interface (or use `endpoint.url`) and assign it to `API_URL` below. We then use the standardised messages format for the text inputs, i.e. a dictionary of user and assistant messages, which you might know from other LLM API services. We then need to apply the chat template to the messages, which LLMs like Gemma, Llama-3 etc. have been trained to expect (see details on in the [docs](https://huggingface.co/docs/transformers/main/en/chat_templating)). For most recent generative LLMs, it is essential to apply this chat template, otherwise the model's performance will degrade without throwing an error. "
"Now let's query this Endpoint like any other LLM API. First copy the Endpoint URL from the interface (or use `endpoint.url`) and assign it to `API_URL` below. We then use the standardised messages format for the text inputs, i.e. a dictionary of user and assistant messages, which you might know from other LLM API services. We then need to apply the chat template to the messages, which LLMs like Gemma, Llama-3 etc. have been trained to expect (see details on in the [docs](https://huggingface.co/docs/transformers/main/en/chat_templating)). For most recent generative LLMs, it is essential to apply this chat template, otherwise the model's performance will degrade without throwing an error. "
]
},
{
Expand Down Expand Up @@ -326,7 +326,7 @@
"source": [
"That's it, you've made the first request to your Endpoint - your very own API!\n",
"\n",
"If you want the endpoint to handle the chat template automatically and if your LLM runs on a TGI container, you can also use the [messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) by appending the `/v1/chat/completions` path to the URL. With the `/v1/chat/completions` path, the [TGI](https://huggingface.co/docs/text-generation-inference/index) container running on the endpoint applies the chat template automatically and is fully compatible with OpenAI's API structure for easier interoperability. See the [TGI Swagger UI](https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/chat_completions) for all available parameters. Note that the parameters accepted by the default `/` path and by the `/v1/chat/completions` path are slightly different. Here is the slightly modified code for using the messages API:"
"If you want the Endpoint to handle the chat template automatically and if your LLM runs on a TGI container, you can also use the [messages API](https://huggingface.co/docs/text-generation-inference/en/messages_api) by appending the `/v1/chat/completions` path to the URL. With the `/v1/chat/completions` path, the [TGI](https://huggingface.co/docs/text-generation-inference/index) container running on the Endpoint applies the chat template automatically and is fully compatible with OpenAI's API structure for easier interoperability. See the [TGI Swagger UI](https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/chat_completions) for all available parameters. Note that the parameters accepted by the default `/` path and by the `/v1/chat/completions` path are slightly different. Here is the slightly modified code for using the messages API:"
]
},
{
Expand Down Expand Up @@ -370,9 +370,9 @@
"source": [
"### Simplified Endpoint usage with the InferenceClient\n",
"\n",
"You can also use the [`InferenceClient`](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#huggingface_hub.InferenceClient) to easily send requests to your endpoint. The client is a convenient utility available in the `huggingface_hub` Python library that allows you to easily make calls to both [Dedicated Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) and the [Serverless Inference API](https://huggingface.co/docs/api-inference/index). See the [docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#inference) for details. \n",
"You can also use the [`InferenceClient`](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#huggingface_hub.InferenceClient) to easily send requests to your Endpoint. The client is a convenient utility available in the `huggingface_hub` Python library that allows you to easily make calls to both [Dedicated Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) and the [Serverless Inference API](https://huggingface.co/docs/api-inference/index). See the [docs](https://huggingface.co/docs/huggingface_hub/en/package_reference/inference_client#inference) for details. \n",
"\n",
"This is the most succinct way of sending requests to your endpoint:"
"This is the most succinct way of sending requests to your Endpoint:"
]
},
{
Expand Down Expand Up @@ -414,8 +414,8 @@
"id": "a2f92df9-9ea9-4831-b167-9e7bce5e4179",
"metadata": {},
"source": [
"## Creating endpoints for a wide variety of models\n",
"Following the same process, you can create endpoints for any of the models on the HF Hub. Let's illustrate some other use-cases."
"## Creating Endpoints for a wide variety of models\n",
"Following the same process, you can create Endpoints for any of the models on the HF Hub. Let's illustrate some other use-cases."
]
},
{
Expand All @@ -424,7 +424,7 @@
"metadata": {},
"source": [
"### Image generation with Stable Diffusion\n",
"We can create an image generation endpoint with almost the exact same code as for the LLM. The only difference is that we do not use the TGI container in this case, as TGI is only designed for LLMs (and vision LMs). "
"We can create an image generation Endpoint with almost the exact same code as for the LLM. The only difference is that we do not use the TGI container in this case, as TGI is only designed for LLMs (and vision LMs). "
]
},
{
Expand Down Expand Up @@ -566,7 +566,7 @@
"id": "0c5558d0-b21e-4b04-bb81-910d398d8f4d",
"metadata": {},
"source": [
"We pause the endpoint again to stop billing. "
"We pause the Endpoint again to stop billing. "
]
},
{
Expand All @@ -586,7 +586,7 @@
"source": [
"### Vision Language Models: Reasoning over text and images\n",
"\n",
"Now let's create an endpoint for a vision language model (VLM). VLMs are very similar to LLMs, only that they can take both text and images as input simultaneously. Their output is autoregressively generated text, just like for a standard LLM. VLMs can tackle many tasks from visual question answering to document understanding. For this example, we use [Idefics2](https://huggingface.co/blog/idefics2), a powerful 8B parameter VLM. \n",
"Now let's create an Endpoint for a vision language model (VLM). VLMs are very similar to LLMs, only that they can take both text and images as input simultaneously. Their output is autoregressively generated text, just like for a standard LLM. VLMs can tackle many tasks from visual question answering to document understanding. For this example, we use [Idefics2](https://huggingface.co/blog/idefics2), a powerful 8B parameter VLM. \n",
"\n",
"We first need to convert our PIL image generated with Stable Diffusion to a `base64` encoded string so that we can send it to the model over the network."
]
Expand Down Expand Up @@ -673,7 +673,7 @@
"id": "296b4569-bf7c-4f72-85fb-cfcf05c9bca9",
"metadata": {},
"source": [
"Several VLMs like Idefics2 are also supported by TGI (see [list of supported models](https://huggingface.co/docs/text-generation-inference/supported_models)), so we use the TGI container again when creating the endpoint. \n"
"Several VLMs like Idefics2 are also supported by TGI (see [list of supported models](https://huggingface.co/docs/text-generation-inference/supported_models)), so we use the TGI container again when creating the Endpoint. \n"
]
},
{
Expand Down Expand Up @@ -795,8 +795,8 @@
"metadata": {},
"source": [
"## Additional information\n",
"- When creating several endpoints, you will probably get an error message that your GPU quota has been reached. Don't hesitate to send a message to the email address in the error message and we will most likely increase your GPU quota.\n",
"- What is the difference between `paused` and `scaled-to-zero` endpoints? `scaled-to-zero` endpoints can be flexibly woken up and scaled up by user requests, while `paused` endpoints need to be unpaused by the creator of the endpoint. Moreover, `scaled-to-zero` endpoints count towards your GPU quota (with the maximum possible replica it could be scaled up to), while `paused` endpoints do not. A simple way of freeing up your GPU quota is therefore to pause some endpoints. "
"- When creating several Endpoints, you will probably get an error message that your GPU quota has been reached. Don't hesitate to send a message to the email address in the error message and we will most likely increase your GPU quota.\n",
"- What is the difference between `paused` and `scaled-to-zero` Endpoints? `scaled-to-zero` Endpoints can be flexibly woken up and scaled up by user requests, while `paused` Endpoints need to be unpaused by the creator of the Endpoint. Moreover, `scaled-to-zero` Endpoints count towards your GPU quota (with the maximum possible replica it could be scaled up to), while `paused` Endpoints do not. A simple way of freeing up your GPU quota is therefore to pause some Endpoints. "
]
},
{
Expand All @@ -806,16 +806,22 @@
"source": [
"## Conclusion and next steps\n",
"\n",
"That's it, you've created three different endpoints (your own APIs!) for text-to-text, text-to-image, and image-to-text generation and the same is possible for many other models and tasks. \n",
"That's it, you've created three different Endpoints (your own APIs!) for text-to-text, text-to-image, and image-to-text generation and the same is possible for many other models and tasks. \n",
"\n",
"We encourage you to read the Dedicated Inference Endpoint [docs](https://huggingface.co/docs/inference-endpoints/index) to learn more. If you are using generative LLMs and VLMs, we also recommend reading the TGI [docs](https://huggingface.co/docs/text-generation-inference/index), as the most popular LLMs/VLMs are also supported by TGI, which makes your endpoints significantly more efficient. \n",
"We encourage you to read the Dedicated Inference Endpoint [docs](https://huggingface.co/docs/inference-endpoints/index) to learn more. If you are using generative LLMs and VLMs, we also recommend reading the TGI [docs](https://huggingface.co/docs/text-generation-inference/index), as the most popular LLMs/VLMs are also supported by TGI, which makes your Endpoints significantly more efficient. \n",
"\n",
"You can, for example, use **JSON-mode or function calling** with open-source models via [TGI Guidance](https://huggingface.co/docs/text-generation-inference/basic_tutorials/using_guidance) (see also this [recipe](https://huggingface.co/learn/cookbook/structured_generation) for an example for RAG with structured generation). \n",
"\n",
"When moving your endpoints into production, you will want to make several additional improvements to make your setup more efficient. When using TGI, you should send batches of requests to the endpoint with asynchronous function calls to fully utilize the endpoint's hardware and you can adapt several container parameters to optimize latency and throughput for your use-case. We will cover these optimizations in another recipe. \n",
"When moving your Endpoints into production, you will want to make several additional improvements to make your setup more efficient. When using TGI, you should send batches of requests to the Endpoint with asynchronous function calls to fully utilize the Endpoint's hardware and you can adapt several container parameters to optimize latency and throughput for your use-case. We will cover these optimizations in another recipe. \n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "fbe4dfb7",
"metadata": {},
"source": []
}
],
"metadata": {
Expand Down

0 comments on commit e08dc90

Please sign in to comment.