Open-source Large Language Models (LLMs) such as Falcon, (Open-)LLaMA, X-Gen, StarCoder, and RedPajama have advanced significantly, now rivaling proprietary models like ChatGPT or GPT4 for specific applications. Despite their capabilities, deploying these models efficiently and optimally remains a complex task.
This article demonstrates how to deploy open-source LLMs using Hugging Face Inference Endpoints, a managed SaaS solution designed to simplify model deployment. It also covers streaming responses and testing endpoint performance.
Before proceeding, a brief overview of Inference Endpoints is helpful.
What is Hugging Face Inference Endpoints
Hugging Face Inference Endpoints provides a straightforward and secure method for deploying Machine Learning models into production environments. This service enables developers and data scientists to build AI applications without the burden of infrastructure management. It streamlines the deployment process to just a few clicks, incorporating features like autoscaling to manage high request volumes, scale-to-zero for cost reduction, and robust security measures.
Key features relevant to LLM deployment include:
- Easy Deployment: Models can be deployed as production-ready APIs with minimal effort, removing the need for infrastructure or MLOps management.
- Cost Efficiency: Automatic scale-to-zero functionality helps reduce costs by scaling down infrastructure when an endpoint is inactive, with billing based on endpoint uptime.
- Enterprise Security: Models can be deployed in secure offline endpoints, accessible only via direct VPC connections. The service is SOC2 Type 2 certified and offers BAA and GDPR data processing agreements for enhanced security and compliance.
- LLM Optimization: The platform is optimized for LLMs, offering high throughput with Paged Attention and low latency through custom transformers code and Flash Attention, powered by Text Generation Inference.
- Comprehensive Task Support: Out-of-the-box support is available for 🤗 Transformers, Sentence-Transformers, and Diffusers tasks and models, with easy customization for advanced tasks like speaker diarization or other Machine Learning tasks and libraries.
1. How to deploy Falcon 40B instruct
To begin, access Inference Endpoints at https://ui.endpoints.huggingface.co after logging in with a User or Organization account that has a payment method on file (which can be added here).
Next, select “New endpoint”. Choose the desired repository, cloud provider, and region. Adjust the instance and security settings, then proceed to deploy a model, such as tiiuae/falcon-40b-instruct in this example.

Inference Endpoints recommends an instance type suitable for the model size, typically 4x NVIDIA T4 GPUs for this model. For optimal LLM performance, it is recommended to change the instance to GPU [xlarge] · 1x Nvidia A100.
Note: If the desired instance type is unavailable, a quota increase request can be submitted by contacting the support team.
The model can then be deployed by clicking “Create Endpoint”. The endpoint should become online and ready to serve requests approximately 10 minutes after creation.
2. Test the LLM endpoint
The Endpoint overview includes an Inference Widget for sending manual requests. This feature allows for quick testing of the endpoint with various inputs and easy sharing with team members. These widgets do not support parameters, which can result in shorter generations.
The widget also provides a cURL command for testing. Users can add their hf_xxx token and test the endpoint.
curl https://j4xhm53fxl9ussm8.us-east-1.aws.endpoints.huggingface.cloud \
-X POST \
-d '{"inputs":"Once upon a time,"}' \
-H "Authorization: Bearer <hf_token>" \
-H "Content-Type: application/json"
Generation can be controlled using various parameters defined within the payload’s parameters attribute. Currently, the following parameters are supported:
- temperature: Adjusts the randomness of the model’s output. Lower values lead to more deterministic results, while higher values increase randomness. The default is 1.0.
- max_new_tokens: Sets the maximum number of tokens to generate. The default is 20, with a maximum of 512.
- repetition_penalty: Influences the likelihood of token repetition. The default is null.
- seed: Specifies the seed for random generation. The default is null.
- stop: A list of tokens that will halt the generation process if encountered.
- top_k: Determines the number of highest probability vocabulary tokens to consider for top-k-filtering. The default is null, which disables this filtering.
- top_p: Represents the cumulative probability of the highest probability vocabulary tokens to retain for nucleus sampling. The default is null.
- do_sample: A boolean indicating whether to use sampling or greedy decoding. The default is false.
- best_of: Generates a specified number of sequences and returns the one with the highest token logprobs. The default is null.
- details: A boolean indicating whether to return generation details. The default is false.
- return_full_text: A boolean indicating whether to return the complete text or only the generated portion. The default is false.
- truncate: A boolean indicating whether to truncate the input to the model’s maximum length. The default is true.
- typical_p: The typical probability of a token. The default is null.
- watermark: A boolean indicating whether to use a watermark for generation. The default is false.
3. Stream responses in Javascript and Python
Generating text with LLMs can be an iterative and time-consuming process. Streaming tokens to the user as they are generated significantly enhances the user experience. The following examples demonstrate how to stream tokens using Python and JavaScript. For Python, the client from Text Generation Inference is utilized, while JavaScript uses the HuggingFace.js library.
Streaming requests with Python
First, install the huggingface_hub library:
pip install -U huggingface_hub
An InferenceClient can be created by providing the endpoint URL and credentials, along with the desired hyperparameters.
from huggingface_hub import InferenceClient
# HF Inference Endpoints parameter
endpoint_url = "https://YOUR_ENDPOINT.endpoints.huggingface.cloud"
hf_token = "hf_YOUR_TOKEN"
# Streaming Client
client = InferenceClient(endpoint_url, token=hf_token)
# generation parameter
gen_kwargs = dict(
max_new_tokens=512,
top_k=30,
top_p=0.9,
temperature=0.2,
repetition_penalty=1.02,
stop_sequences=["\nUser:", "<|endoftext|>", "</s>"],
)
# prompt
prompt = "What can you do in Nuremberg, Germany? Give me 3 Tips"
stream = client.text_generation(prompt, stream=True, details=True, **gen_kwargs)
# yield each generated token
for r in stream:
# skip special tokens
if r.token.special:
continue
# stop if we encounter a stop sequence
if r.token.text in gen_kwargs["stop_sequences"]:
break
# yield the generated token
print(r.token.text, end = "")
# yield r.token.text
Replace the print command with a yield statement or a function designed to stream the tokens.
Streaming requests with JavaScript
First, install the @huggingface/inference library:
npm install @huggingface/inference
A HfInferenceEndpoint can be created by providing the endpoint URL and credentials, along with the desired hyperparameters.
import { HfInferenceEndpoint } from '@huggingface/inference'
const hf = new HfInferenceEndpoint('https://YOUR_ENDPOINT.endpoints.huggingface.cloud', 'hf_YOUR_TOKEN')
//generation parameter
const gen_kwargs = {
max_new_tokens: 512,
top_k: 30,
top_p: 0.9,
temperature: 0.2,
repetition_penalty: 1.02,
stop_sequences: ['\nUser:', '<|endoftext|>', '</s>'],
}
// prompt
const prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips'
const stream = hf.textGenerationStream({ inputs: prompt, parameters: gen_kwargs })
for await (const r of stream) {
// # skip special tokens
if (r.token.special) {
continue
}
// stop if we encounter a stop sequence
if (gen_kwargs['stop_sequences'].includes(r.token.text)) {
break
}
// yield the generated token
process.stdout.write(r.token.text)
}
Replace the process.stdout call with a yield statement or a function designed to stream the tokens.
Conclusion
This article demonstrated how to deploy open-source LLMs using Hugging Face Inference Endpoints, control text generation with advanced parameters, and stream responses to Python or JavaScript clients for an improved user experience. Hugging Face Inference Endpoints enable the deployment of models as production-ready APIs with ease, offer cost reduction through automatic scale-to-zero, and provide secure offline endpoints backed by SOC2 Type 2 certification.

