Deploy LLMs with Hugging Face Inference Endpoints

Open-source Large Language Models (LLMs) such as Falcon, (Open-)LLaMA, X-Gen, StarCoder, and RedPajama have advanced significantly, now rivaling proprietary models like ChatGPT or GPT4 for specific applications. Despite their capabilities, deploying these models efficiently and optimally remains a complex task.

This article demonstrates how to deploy open-source LLMs using Hugging Face Inference Endpoints, a managed SaaS solution designed to simplify model deployment. It also covers streaming responses and testing endpoint performance.

How to deploy Falcon 40B instruct
Test the LLM endpoint
Stream responses in Javascript and Python

Before proceeding, a brief overview of Inference Endpoints is helpful.

What is Hugging Face Inference Endpoints

Hugging Face Inference Endpoints provides a straightforward and secure method for deploying Machine Learning models into production environments. This service enables developers and data scientists to build AI applications without the burden of infrastructure management. It streamlines the deployment process to just a few clicks, incorporating features like autoscaling to manage high request volumes, scale-to-zero for cost reduction, and robust security measures.

Key features relevant to LLM deployment include:

Easy Deployment: Models can be deployed as production-ready APIs with minimal effort, removing the need for infrastructure or MLOps management.
Cost Efficiency: Automatic scale-to-zero functionality helps reduce costs by scaling down infrastructure when an endpoint is inactive, with billing based on endpoint uptime.
Enterprise Security: Models can be deployed in secure offline endpoints, accessible only via direct VPC connections. The service is SOC2 Type 2 certified and offers BAA and GDPR data processing agreements for enhanced security and compliance.
LLM Optimization: The platform is optimized for LLMs, offering high throughput with Paged Attention and low latency through custom transformers code and Flash Attention, powered by Text Generation Inference.
Comprehensive Task Support: Out-of-the-box support is available for 🤗 Transformers, Sentence-Transformers, and Diffusers tasks and models, with easy customization for advanced tasks like speaker diarization or other Machine Learning tasks and libraries.

1. How to deploy Falcon 40B instruct

To begin, access Inference Endpoints at https://ui.endpoints.huggingface.co after logging in with a User or Organization account that has a payment method on file (which can be added here).

Next, select “New endpoint”. Choose the desired repository, cloud provider, and region. Adjust the instance and security settings, then proceed to deploy a model, such as tiiuae/falcon-40b-instruct in this example.

Inference Endpoints recommends an instance type suitable for the model size, typically 4x NVIDIA T4 GPUs for this model. For optimal LLM performance, it is recommended to change the instance to GPU [xlarge] · 1x Nvidia A100.

Note: If the desired instance type is unavailable, a quota increase request can be submitted by contacting the support team.

The model can then be deployed by clicking “Create Endpoint”. The endpoint should become online and ready to serve requests approximately 10 minutes after creation.

2. Test the LLM endpoint

The Endpoint overview includes an Inference Widget for sending manual requests. This feature allows for quick testing of the endpoint with various inputs and easy sharing with team members. These widgets do not support parameters, which can result in shorter generations.

The widget also provides a cURL command for testing. Users can add their hf_xxx token and test the endpoint.

curl https://j4xhm53fxl9ussm8.us-east-1.aws.endpoints.huggingface.cloud \
-X POST \
-d '{"inputs":"Once upon a time,"}' \
-H "Authorization: Bearer <hf_token>" \
-H "Content-Type: application/json"

Generation can be controlled using various parameters defined within the payload’s parameters attribute. Currently, the following parameters are supported:

temperature: Adjusts the randomness of the model’s output. Lower values lead to more deterministic results, while higher values increase randomness. The default is 1.0.
max_new_tokens: Sets the maximum number of tokens to generate. The default is 20, with a maximum of 512.
repetition_penalty: Influences the likelihood of token repetition. The default is null.
seed: Specifies the seed for random generation. The default is null.
stop: A list of tokens that will halt the generation process if encountered.
top_k: Determines the number of highest probability vocabulary tokens to consider for top-k-filtering. The default is null, which disables this filtering.
top_p: Represents the cumulative probability of the highest probability vocabulary tokens to retain for nucleus sampling. The default is null.
do_sample: A boolean indicating whether to use sampling or greedy decoding. The default is false.
best_of: Generates a specified number of sequences and returns the one with the highest token logprobs. The default is null.
details: A boolean indicating whether to return generation details. The default is false.
return_full_text: A boolean indicating whether to return the complete text or only the generated portion. The default is false.
truncate: A boolean indicating whether to truncate the input to the model’s maximum length. The default is true.
typical_p: The typical probability of a token. The default is null.
watermark: A boolean indicating whether to use a watermark for generation. The default is false.

3. Stream responses in Javascript and Python

Generating text with LLMs can be an iterative and time-consuming process. Streaming tokens to the user as they are generated significantly enhances the user experience. The following examples demonstrate how to stream tokens using Python and JavaScript. For Python, the client from Text Generation Inference is utilized, while JavaScript uses the HuggingFace.js library.

Streaming requests with Python

First, install the huggingface_hub library:

pip install -U huggingface_hub

An InferenceClient can be created by providing the endpoint URL and credentials, along with the desired hyperparameters.

from huggingface_hub import InferenceClient

# HF Inference Endpoints parameter
endpoint_url = "https://YOUR_ENDPOINT.endpoints.huggingface.cloud"
hf_token = "hf_YOUR_TOKEN"

# Streaming Client
client = InferenceClient(endpoint_url, token=hf_token)

# generation parameter
gen_kwargs = dict(
    max_new_tokens=512,
    top_k=30,
    top_p=0.9,
    temperature=0.2,
    repetition_penalty=1.02,
    stop_sequences=["\nUser:", "<|endoftext|>", "</s>"],
)
# prompt
prompt = "What can you do in Nuremberg, Germany? Give me 3 Tips"

stream = client.text_generation(prompt, stream=True, details=True, **gen_kwargs)

# yield each generated token
for r in stream:
    # skip special tokens
    if r.token.special:
        continue
    # stop if we encounter a stop sequence
    if r.token.text in gen_kwargs["stop_sequences"]:
        break
    # yield the generated token
    print(r.token.text, end = "")
    # yield r.token.text

Replace the print command with a yield statement or a function designed to stream the tokens.

Streaming requests with JavaScript

First, install the @huggingface/inference library:

npm install @huggingface/inference

A HfInferenceEndpoint can be created by providing the endpoint URL and credentials, along with the desired hyperparameters.

import { HfInferenceEndpoint } from '@huggingface/inference'

const hf = new HfInferenceEndpoint('https://YOUR_ENDPOINT.endpoints.huggingface.cloud', 'hf_YOUR_TOKEN')

//generation parameter
const gen_kwargs = {
  max_new_tokens: 512,
  top_k: 30,
  top_p: 0.9,
  temperature: 0.2,
  repetition_penalty: 1.02,
  stop_sequences: ['\nUser:', '<|endoftext|>', '</s>'],
}
// prompt
const prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips'

const stream = hf.textGenerationStream({ inputs: prompt, parameters: gen_kwargs })
for await (const r of stream) {
  // # skip special tokens
  if (r.token.special) {
    continue
  }
  // stop if we encounter a stop sequence
  if (gen_kwargs['stop_sequences'].includes(r.token.text)) {
    break
  }
  // yield the generated token
  process.stdout.write(r.token.text)
}

Replace the process.stdout call with a yield statement or a function designed to stream the tokens.

Conclusion

This article demonstrated how to deploy open-source LLMs using Hugging Face Inference Endpoints, control text generation with advanced parameters, and stream responses to Python or JavaScript clients for an improved user experience. Hugging Face Inference Endpoints enable the deployment of models as production-ready APIs with ease, offer cost reduction through automatic scale-to-zero, and provide secure offline endpoints backed by SOC2 Type 2 certification.

Latest Post

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

Anthropic Introduces Embedded Security Scanning for Claude AI

SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

Latest Post

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

Latest Post

Deploy LLMs with Hugging Face Inference Endpoints

What is Hugging Face Inference Endpoints

1. How to deploy Falcon 40B instruct

2. Test the LLM endpoint

3. Stream responses in Javascript and Python

Streaming requests with Python

Streaming requests with JavaScript

Conclusion

Related Posts