Close Menu
    Latest Post

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 2026

    How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

    February 21, 2026

    Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

    February 21, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    • How to Cancel Your Google Pixel Watch Fitbit Premium Trial
    • GHD Speed Hair Dryer Review: Powerful Performance and User-Friendly Design
    • An FBI ‘Asset’ Helped Run a Dark Web Site That Sold Fentanyl-Laced Drugs for Years
    • The Next Next Job, a framework for making big career decisions
    • Google Introduces Lyria 3: A Free AI Music Generator for Gemini
    Facebook X (Twitter) Instagram Pinterest Vimeo
    NodeTodayNodeToday
    • Home
    • AI
    • Dev
    • Guides
    • Products
    • Security
    • Startups
    • Tech
    • Tools
    NodeTodayNodeToday
    Home»AI»Deploy LLMs with Hugging Face Inference Endpoints
    AI

    Deploy LLMs with Hugging Face Inference Endpoints

    Samuel AlejandroBy Samuel AlejandroFebruary 12, 2026No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    src q255ds featured
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Open-source Large Language Models (LLMs) such as Falcon, (Open-)LLaMA, X-Gen, StarCoder, and RedPajama have advanced significantly, now rivaling proprietary models like ChatGPT or GPT4 for specific applications. Despite their capabilities, deploying these models efficiently and optimally remains a complex task.

    This article demonstrates how to deploy open-source LLMs using Hugging Face Inference Endpoints, a managed SaaS solution designed to simplify model deployment. It also covers streaming responses and testing endpoint performance.

    1. How to deploy Falcon 40B instruct
    2. Test the LLM endpoint
    3. Stream responses in Javascript and Python

    Before proceeding, a brief overview of Inference Endpoints is helpful.

    What is Hugging Face Inference Endpoints

    Hugging Face Inference Endpoints provides a straightforward and secure method for deploying Machine Learning models into production environments. This service enables developers and data scientists to build AI applications without the burden of infrastructure management. It streamlines the deployment process to just a few clicks, incorporating features like autoscaling to manage high request volumes, scale-to-zero for cost reduction, and robust security measures.

    Key features relevant to LLM deployment include:

    1. Easy Deployment: Models can be deployed as production-ready APIs with minimal effort, removing the need for infrastructure or MLOps management.
    2. Cost Efficiency: Automatic scale-to-zero functionality helps reduce costs by scaling down infrastructure when an endpoint is inactive, with billing based on endpoint uptime.
    3. Enterprise Security: Models can be deployed in secure offline endpoints, accessible only via direct VPC connections. The service is SOC2 Type 2 certified and offers BAA and GDPR data processing agreements for enhanced security and compliance.
    4. LLM Optimization: The platform is optimized for LLMs, offering high throughput with Paged Attention and low latency through custom transformers code and Flash Attention, powered by Text Generation Inference.
    5. Comprehensive Task Support: Out-of-the-box support is available for 🤗 Transformers, Sentence-Transformers, and Diffusers tasks and models, with easy customization for advanced tasks like speaker diarization or other Machine Learning tasks and libraries.

    1. How to deploy Falcon 40B instruct

    To begin, access Inference Endpoints at https://ui.endpoints.huggingface.co after logging in with a User or Organization account that has a payment method on file (which can be added here).

    Next, select “New endpoint”. Choose the desired repository, cloud provider, and region. Adjust the instance and security settings, then proceed to deploy a model, such as tiiuae/falcon-40b-instruct in this example.

    Select Hugging Face Repository

    Inference Endpoints recommends an instance type suitable for the model size, typically 4x NVIDIA T4 GPUs for this model. For optimal LLM performance, it is recommended to change the instance to GPU [xlarge] · 1x Nvidia A100.

    Note: If the desired instance type is unavailable, a quota increase request can be submitted by contacting the support team.

    The model can then be deployed by clicking “Create Endpoint”. The endpoint should become online and ready to serve requests approximately 10 minutes after creation.

    2. Test the LLM endpoint

    The Endpoint overview includes an Inference Widget for sending manual requests. This feature allows for quick testing of the endpoint with various inputs and easy sharing with team members. These widgets do not support parameters, which can result in shorter generations.

    The widget also provides a cURL command for testing. Users can add their hf_xxx token and test the endpoint.

    curl https://j4xhm53fxl9ussm8.us-east-1.aws.endpoints.huggingface.cloud \
    -X POST \
    -d '{"inputs":"Once upon a time,"}' \
    -H "Authorization: Bearer <hf_token>" \
    -H "Content-Type: application/json"
    

    Generation can be controlled using various parameters defined within the payload’s parameters attribute. Currently, the following parameters are supported:

    • temperature: Adjusts the randomness of the model’s output. Lower values lead to more deterministic results, while higher values increase randomness. The default is 1.0.
    • max_new_tokens: Sets the maximum number of tokens to generate. The default is 20, with a maximum of 512.
    • repetition_penalty: Influences the likelihood of token repetition. The default is null.
    • seed: Specifies the seed for random generation. The default is null.
    • stop: A list of tokens that will halt the generation process if encountered.
    • top_k: Determines the number of highest probability vocabulary tokens to consider for top-k-filtering. The default is null, which disables this filtering.
    • top_p: Represents the cumulative probability of the highest probability vocabulary tokens to retain for nucleus sampling. The default is null.
    • do_sample: A boolean indicating whether to use sampling or greedy decoding. The default is false.
    • best_of: Generates a specified number of sequences and returns the one with the highest token logprobs. The default is null.
    • details: A boolean indicating whether to return generation details. The default is false.
    • return_full_text: A boolean indicating whether to return the complete text or only the generated portion. The default is false.
    • truncate: A boolean indicating whether to truncate the input to the model’s maximum length. The default is true.
    • typical_p: The typical probability of a token. The default is null.
    • watermark: A boolean indicating whether to use a watermark for generation. The default is false.

    3. Stream responses in Javascript and Python

    Generating text with LLMs can be an iterative and time-consuming process. Streaming tokens to the user as they are generated significantly enhances the user experience. The following examples demonstrate how to stream tokens using Python and JavaScript. For Python, the client from Text Generation Inference is utilized, while JavaScript uses the HuggingFace.js library.

    Streaming requests with Python

    First, install the huggingface_hub library:

    pip install -U huggingface_hub
    

    An InferenceClient can be created by providing the endpoint URL and credentials, along with the desired hyperparameters.

    from huggingface_hub import InferenceClient
    
    # HF Inference Endpoints parameter
    endpoint_url = "https://YOUR_ENDPOINT.endpoints.huggingface.cloud"
    hf_token = "hf_YOUR_TOKEN"
    
    # Streaming Client
    client = InferenceClient(endpoint_url, token=hf_token)
    
    # generation parameter
    gen_kwargs = dict(
        max_new_tokens=512,
        top_k=30,
        top_p=0.9,
        temperature=0.2,
        repetition_penalty=1.02,
        stop_sequences=["\nUser:", "<|endoftext|>", "</s>"],
    )
    # prompt
    prompt = "What can you do in Nuremberg, Germany? Give me 3 Tips"
    
    stream = client.text_generation(prompt, stream=True, details=True, **gen_kwargs)
    
    # yield each generated token
    for r in stream:
        # skip special tokens
        if r.token.special:
            continue
        # stop if we encounter a stop sequence
        if r.token.text in gen_kwargs["stop_sequences"]:
            break
        # yield the generated token
        print(r.token.text, end = "")
        # yield r.token.text
    

    Replace the print command with a yield statement or a function designed to stream the tokens.

    Streaming requests with JavaScript

    First, install the @huggingface/inference library:

    npm install @huggingface/inference
    

    A HfInferenceEndpoint can be created by providing the endpoint URL and credentials, along with the desired hyperparameters.

    import { HfInferenceEndpoint } from '@huggingface/inference'
    
    const hf = new HfInferenceEndpoint('https://YOUR_ENDPOINT.endpoints.huggingface.cloud', 'hf_YOUR_TOKEN')
    
    //generation parameter
    const gen_kwargs = {
      max_new_tokens: 512,
      top_k: 30,
      top_p: 0.9,
      temperature: 0.2,
      repetition_penalty: 1.02,
      stop_sequences: ['\nUser:', '<|endoftext|>', '</s>'],
    }
    // prompt
    const prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips'
    
    const stream = hf.textGenerationStream({ inputs: prompt, parameters: gen_kwargs })
    for await (const r of stream) {
      // # skip special tokens
      if (r.token.special) {
        continue
      }
      // stop if we encounter a stop sequence
      if (gen_kwargs['stop_sequences'].includes(r.token.text)) {
        break
      }
      // yield the generated token
      process.stdout.write(r.token.text)
    }
    

    Replace the process.stdout call with a yield statement or a function designed to stream the tokens.

    Conclusion

    This article demonstrated how to deploy open-source LLMs using Hugging Face Inference Endpoints, control text generation with advanced parameters, and stream responses to Python or JavaScript clients for an improved user experience. Hugging Face Inference Endpoints enable the deployment of models as production-ready APIs with ease, offer cost reduction through automatic scale-to-zero, and provide secure offline endpoints backed by SOC2 Type 2 certification.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleReliCSS
    Next Article The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It
    Samuel Alejandro

    Related Posts

    AI

    Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

    February 21, 2026
    Security

    Anthropic Introduces Embedded Security Scanning for Claude AI

    February 20, 2026
    AI

    SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

    February 19, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Latest Post

    ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

    December 21, 202513 Views

    Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

    December 21, 202511 Views

    Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

    December 21, 202510 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    About

    Welcome to NodeToday, your trusted source for the latest updates in Technology, Artificial Intelligence, and Innovation. We are dedicated to delivering accurate, timely, and insightful content that helps readers stay ahead in a fast-evolving digital world.

    At NodeToday, we cover everything from AI breakthroughs and emerging technologies to product launches, software tools, developer news, and practical guides. Our goal is to simplify complex topics and present them in a clear, engaging, and easy-to-understand way for tech enthusiasts, professionals, and beginners alike.

    Latest Post

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 20260 Views

    How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

    February 21, 20260 Views

    Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

    February 21, 20260 Views
    Recent Posts
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    • How to Cancel Your Google Pixel Watch Fitbit Premium Trial
    • GHD Speed Hair Dryer Review: Powerful Performance and User-Friendly Design
    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer
    • Cookie Policy
    © 2026 NodeToday.

    Type above and press Enter to search. Press Esc to cancel.