Close Menu
    Latest Post

    Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

    February 22, 2026

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 2026

    How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

    February 21, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    • How to Cancel Your Google Pixel Watch Fitbit Premium Trial
    • GHD Speed Hair Dryer Review: Powerful Performance and User-Friendly Design
    • An FBI ‘Asset’ Helped Run a Dark Web Site That Sold Fentanyl-Laced Drugs for Years
    • The Next Next Job, a framework for making big career decisions
    Facebook X (Twitter) Instagram Pinterest Vimeo
    NodeTodayNodeToday
    • Home
    • AI
    • Dev
    • Guides
    • Products
    • Security
    • Startups
    • Tech
    • Tools
    NodeTodayNodeToday
    Home»AI»Welcome PaliGemma 2 – New vision language models by Google
    AI

    Welcome PaliGemma 2 – New vision language models by Google

    Samuel AlejandroBy Samuel AlejandroFebruary 11, 2026No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    src 585vwz featured
    Share
    Facebook Twitter LinkedIn Pinterest Email
    • Table of Content

    • Introducing PaliGemma 2

    • Model Capabilities

    • Demo

    • How to Use with Transformers

    • Fine-tuning

    • Conclusion

    • Resources

    Google has introduced PaliGemma 2, a new generation of vision language models, building upon the original PaliGemma. This updated model integrates the robust SigLIP for its vision component, while upgrading to the advanced Gemma 2 for its text decoding capabilities.

    PaliGemma 2 offers new pre-trained (pt) models in 3B, 10B, and 28B parameter sizes. These models support multiple input resolutions, including 224×224, 448×448, and 896×896. This range of options allows users to balance quality and efficiency for diverse applications, a significant improvement over the previous PaliGemma, which was limited to a 3B variant.

    The pre-trained models are designed for straightforward fine-tuning on various downstream tasks. The initial PaliGemma model saw broad community adoption for many applications. With the enhanced flexibility from new variants and improved pre-trained quality, the potential for community innovation is considerable.

    For instance, Google has also released fine-tuned versions of PaliGemma 2 based on the DOCCI dataset. These models exhibit versatile and robust captioning abilities, producing detailed and nuanced descriptions. The fine-tuned DOCCI models are available for the 3B and 10B variants, supporting an input resolution of 448×448.

    This release encompasses open model repositories, transformers integration, fine-tuning scripts, and a demonstration of a model fine-tuned for visual question answering on the VQAv2 dataset.

    • Release collection

    • Fine-tuning Script

    • Demo for Fine-tuned Model

    • The technical report

    Table of Content

    • Introducing PaliGemma 2

    • Model Capabilities

    • Demo

    • How to Use with transformers

    • Fine-tuning

    • Resources

    Introducing PaliGemma 2

    PaliGemma 2 represents an updated version of the PaliGemma vision language model, initially released by Google in May. This model integrates the robust SigLIP image encoder with the Gemma 2 language model.

    PaliGemma2 Architecture

    These new models are built upon the Gemma 2 language models (2B, 9B, and 27B), leading to PaliGemma 2 variants of 3B, 10B, and 28B parameters, respectively. These names reflect the inclusion of the compact image encoder’s parameters. As previously noted, the models support three distinct resolutions, offering significant adaptability for fine-tuning on various downstream tasks.

    PaliGemma 2 is released under the Gemma license, permitting redistribution, commercial use, fine-tuning, and the development of model derivatives.

    The release includes the following checkpoints, provided in bfloat16 precision:

    • 9 pre-trained models: 3B, 10B, and 28B with resolutions of 224×224, 448×448, and 896×896.

    • 2 models fine-tuned on DOCCI: Two models fine-tuned on the DOCCI dataset (image-text caption pairs), supporting the 3B and 10B PaliGemma 2 variants and input resolution of 448×448.

    Model Capabilities

    Similar to the original PaliGemma release, the pre-trained (pt) models demonstrate strong performance for subsequent fine-tuning on downstream tasks.

    The pre-trained models utilize a diverse data mixture for training. This varied dataset enables fine-tuning on related downstream tasks with fewer examples.

    • WebLI: A web-scale multilingual image-text dataset built from the public web. A wide range of WebLI splits is used to acquire versatile model capabilities, such as visual semantic understanding, object localization, visually-situated text understanding, and multilinguality.

    • CC3M-35L: Curated English image-alt_text pairs from webpages (Sharma et al., 2018). To label this dataset, the authors used Google Cloud Translation API to translate into 34 additional languages.

    • Visual Question Generation with Question Answering Validation (VQ2A): An improved dataset for question answering. The dataset is translated into the same additional 34 languages, using the Google Cloud Translation API.

    • OpenImages: Detection and object-aware questions and answers (Piergiovanni et al. 2022) generated by handcrafted rules on the OpenImages dataset.

    • WIT: Images and texts collected from Wikipedia (Srinivasan et al., 2021).

    The PaliGemma 2 development team fine-tuned the pre-trained models on numerous visual-language understanding tasks. Benchmarks for these fine-tuned models are available in the model card and the technical report.

    When fine-tuned on the DOCCI dataset, PaliGemma 2 can perform diverse captioning tasks, such as rendering text, identifying spatial relationships, and incorporating real-world knowledge into captions.

    The performance of DOCCI fine-tuned PaliGemma 2 checkpoints, in comparison to other models, is presented below (data sourced from Table 6 in the technical report).

    #par #char #sent NES↓

    MiniGPT-4 7B 484 5.6 52.3

    mPLUG-Owl2 8B 459 4.4 48.4

    InstructBLIP 7B 510 4.0 42.6

    LLaVA-1.5 7B 395 4.2 40.6

    VILA 7B 871 8.6 28.6

    PaliGemma 3B 535 8.9 34.3

    PaLI-5B 5B 1065 11.3 32.9

    PaliGemma 2 3B 529 7.7 28.4

    PaliGemma 2 10B 521 7.5 20.3

    • #char: Average number of characters in the generated caption.
    • #sent: Average number of sentences.
    • NES: Non entailment sentences (lower is better) that measure factual inaccuracies.

    Examples of model outputs from the DOCCI checkpoint are provided below, illustrating the model’s versatility.

    Input ImageCaption Image 13

    A line graph shows the top-1 accuracy of the ImageNet model after fine-tuning. The graph shows four lines that are colored blue, orange, green, and black. The blue line is the lowest of the four lines, and it is

    Image 14

    A close up view of a white piece of paper with black text on it. The paper is curved in the middle. The text on the paper is in a typewriter font. The top of the paper has the words “Ashley Hotel West Coast” on it. Underneath that is “WiFi Internet Service”. Underneath that is “Username: fqpp”. Underneath that is “Password: aaeu

    Image 15

    A mural of David Bowie’s Ziggy Stardust look is painted on a white wall. The mural is of three faces side by side, each with red hair and blue lightning bolts painted over their eyes. The faces have blue eyeshadow, pink blush, and red lips. The face in the middle has a black square window above it with white text that reads “JAM” in blue. A silver car

    Image 16

    A top-down view of a white marble counter with four coffee mugs on it. There are two gray ones on the left, and one is white on the bottom left. The one on the right is gray. There is a metal wire fruit basket on a wood stand in the top right corner with oranges in it. There is a clear glass pitcher with water in it on the left, and part

    Image 17

    A close up view of a white book with a blue strip at the bottom of it. The top half of the book is white. Black text is printed on the white portion of the book. The text reads “Visual Concept Learning from User-tagged Web Video”. Underneath the black text is a white box with five small images inside of it. The image on the far left is of a person standing in a field of grass. The image to the right of that one is of a blue ocean

    Demo

    For demonstration, a PaliGemma 2 3B model with 448×448 resolution was fine-tuned on a subset of the VQAv2 dataset. This process utilized LoRA fine-tuning and PEFT, detailed in the fine-tuning section. The following demo presents the outcome. The code in the Space can be reviewed to understand its functionality or cloned for adaptation to other fine-tuning projects.

    How to Use with Transformers

    Inference on PaliGemma 2 models can be performed using 🤗 transformers, specifically with the PaliGemmaForConditionalGeneration and AutoProcessor APIs. Ensure that transformers version 4.47 or newer is installed:

    pip install --upgrade transformers
    

    Subsequently, inference can be executed as shown. It is important to adhere to the prompt format utilized during the model’s training for the specific task.

    from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
    from PIL import Image
    import requests
    
    model_id = "google/paligemma2-10b-ft-docci-448"
    model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
    model = model.to("cuda")
    processor = AutoProcessor.from_pretrained(model_id)
    
    prompt = "<image>caption en"
    image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"
    raw_image = Image.open(requests.get(image_file, stream=True).raw).convert("RGB")
    
    inputs = processor(prompt, raw_image, return_tensors="pt").to("cuda")
    output = model.generate(**inputs, max_new_tokens=200)
    
    input_len = inputs["input_ids"].shape[-1]
    print(processor.decode(output[0][input_len:], skip_special_tokens=True))
    # A medium shot of two cats laying on a pile of brown fishing nets. The cat in the foreground is a gray tabby cat with white on its chest and paws. The cat is laying on its side with its head facing the bottom right corner of the image. The cat in the background is laying on its side with its head facing the top left corner of the image. The cat's body is curled up, its head is slightly turned to the right, and its front paws are tucked underneath its body. There is a teal rope hanging from the fishing net in the top right corner of the image.
    

    The transformers bitsandbytes integration also allows for loading models with quantization. The subsequent example demonstrates 4-bit nf4 usage:

    from transformers import BitsAndBytesConfig
    
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    )
    model = PaligemmaForConditionalGeneration.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map={"":0}
    )
    

    Performance degradation due to quantization was assessed by evaluating a 3B fine-tuned checkpoint on the textvqa dataset, using 224×224 input images. The results obtained from the 5,000 entries of the validation set are:

    • bfloat16, no quantization: 60.04% accuracy.
    • 8-bit: 59.78%.
    • 4-bit, using the configuration from the snippet above: 58.72%.

    These figures are promising. Quantization is particularly beneficial for larger checkpoints; it is advisable to always measure results on the specific domains and tasks being utilized.

    Fine-tuning

    For those who have previously fine-tuned PaliGemma, the API for PaliGemma 2 remains consistent, allowing existing code to be used directly. A fine-tuning script and a notebook are available to facilitate model fine-tuning, partial model freezing, or the application of memory-efficient techniques such as LoRA or QLoRA.

    A PaliGemma 2 model was LoRA-fine-tuned on half of the VQAv2 validation split for demonstration. This process required half an hour on 3 A100 GPUs with 80GB VRAM. The model is accessible here, and a Gradio demo illustrates its capabilities.

    Conclusion

    The latest PaliGemma 2 release offers enhanced capabilities compared to its predecessor, featuring diverse model sizes to suit different requirements and more powerful pre-trained models. The community’s future creations with this release are anticipated.

    Appreciation is extended to the Google team for making this impressive and open model family available. Special thanks go to Pablo Montalvo for integrating the model into transformers, and to Lysandre, Raushan, Arthur, Yieh-Dar, and the broader team for their prompt review, testing, and merging efforts.

    Resources

    • Release collection
    • PaliGemma blog post
    • Fine-tuning Script
    • Fine-tuned Model on VQAv2
    • Demo for Fine-tuned Model
    • The technical report
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleStack Gives Back 2025 Campaign Highlights
    Next Article Data War goes digital: Firefox’s card game is now online
    Samuel Alejandro

    Related Posts

    AI

    Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

    February 21, 2026
    Guides

    How to Cancel Your Google Pixel Watch Fitbit Premium Trial

    February 21, 2026
    Tech

    Google Introduces Lyria 3: A Free AI Music Generator for Gemini

    February 21, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Latest Post

    ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

    December 21, 202513 Views

    Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

    December 21, 202511 Views

    Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

    December 21, 202510 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    About

    Welcome to NodeToday, your trusted source for the latest updates in Technology, Artificial Intelligence, and Innovation. We are dedicated to delivering accurate, timely, and insightful content that helps readers stay ahead in a fast-evolving digital world.

    At NodeToday, we cover everything from AI breakthroughs and emerging technologies to product launches, software tools, developer news, and practical guides. Our goal is to simplify complex topics and present them in a clear, engaging, and easy-to-understand way for tech enthusiasts, professionals, and beginners alike.

    Latest Post

    Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

    February 22, 20260 Views

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 20260 Views

    How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic

    February 21, 20260 Views
    Recent Posts
    • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    • How to Cancel Your Google Pixel Watch Fitbit Premium Trial
    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer
    • Cookie Policy
    © 2026 NodeToday.

    Type above and press Enter to search. Press Esc to cancel.