Google has introduced PaliGemma 2, a new generation of vision language models, building upon the original PaliGemma. This updated model integrates the robust SigLIP for its vision component, while upgrading to the advanced Gemma 2 for its text decoding capabilities.
PaliGemma 2 offers new pre-trained (pt) models in 3B, 10B, and 28B parameter sizes. These models support multiple input resolutions, including 224×224, 448×448, and 896×896. This range of options allows users to balance quality and efficiency for diverse applications, a significant improvement over the previous PaliGemma, which was limited to a 3B variant.
The pre-trained models are designed for straightforward fine-tuning on various downstream tasks. The initial PaliGemma model saw broad community adoption for many applications. With the enhanced flexibility from new variants and improved pre-trained quality, the potential for community innovation is considerable.
For instance, Google has also released fine-tuned versions of PaliGemma 2 based on the DOCCI dataset. These models exhibit versatile and robust captioning abilities, producing detailed and nuanced descriptions. The fine-tuned DOCCI models are available for the 3B and 10B variants, supporting an input resolution of 448×448.
This release encompasses open model repositories, transformers integration, fine-tuning scripts, and a demonstration of a model fine-tuned for visual question answering on the VQAv2 dataset.
Table of Content
Introducing PaliGemma 2
PaliGemma 2 represents an updated version of the PaliGemma vision language model, initially released by Google in May. This model integrates the robust SigLIP image encoder with the Gemma 2 language model.

These new models are built upon the Gemma 2 language models (2B, 9B, and 27B), leading to PaliGemma 2 variants of 3B, 10B, and 28B parameters, respectively. These names reflect the inclusion of the compact image encoder’s parameters. As previously noted, the models support three distinct resolutions, offering significant adaptability for fine-tuning on various downstream tasks.
PaliGemma 2 is released under the Gemma license, permitting redistribution, commercial use, fine-tuning, and the development of model derivatives.
The release includes the following checkpoints, provided in bfloat16 precision:
-
9 pre-trained models: 3B, 10B, and 28B with resolutions of 224×224, 448×448, and 896×896.
-
2 models fine-tuned on DOCCI: Two models fine-tuned on the DOCCI dataset (image-text caption pairs), supporting the 3B and 10B PaliGemma 2 variants and input resolution of 448×448.
Model Capabilities
Similar to the original PaliGemma release, the pre-trained (pt) models demonstrate strong performance for subsequent fine-tuning on downstream tasks.
The pre-trained models utilize a diverse data mixture for training. This varied dataset enables fine-tuning on related downstream tasks with fewer examples.
-
WebLI: A web-scale multilingual image-text dataset built from the public web. A wide range of WebLI splits is used to acquire versatile model capabilities, such as visual semantic understanding, object localization, visually-situated text understanding, and multilinguality.
-
CC3M-35L: Curated English image-alt_text pairs from webpages (Sharma et al., 2018). To label this dataset, the authors used Google Cloud Translation API to translate into 34 additional languages.
-
Visual Question Generation with Question Answering Validation (VQ2A): An improved dataset for question answering. The dataset is translated into the same additional 34 languages, using the Google Cloud Translation API.
-
OpenImages: Detection and object-aware questions and answers (Piergiovanni et al. 2022) generated by handcrafted rules on the OpenImages dataset.
-
WIT: Images and texts collected from Wikipedia (Srinivasan et al., 2021).
The PaliGemma 2 development team fine-tuned the pre-trained models on numerous visual-language understanding tasks. Benchmarks for these fine-tuned models are available in the model card and the technical report.
When fine-tuned on the DOCCI dataset, PaliGemma 2 can perform diverse captioning tasks, such as rendering text, identifying spatial relationships, and incorporating real-world knowledge into captions.
The performance of DOCCI fine-tuned PaliGemma 2 checkpoints, in comparison to other models, is presented below (data sourced from Table 6 in the technical report).
#par #char #sent NES↓
MiniGPT-4 7B 484 5.6 52.3
mPLUG-Owl2 8B 459 4.4 48.4
InstructBLIP 7B 510 4.0 42.6
LLaVA-1.5 7B 395 4.2 40.6
VILA 7B 871 8.6 28.6
PaliGemma 3B 535 8.9 34.3
PaLI-5B 5B 1065 11.3 32.9
PaliGemma 2 3B 529 7.7 28.4
PaliGemma 2 10B 521 7.5 20.3
- #char: Average number of characters in the generated caption.
- #sent: Average number of sentences.
- NES: Non entailment sentences (lower is better) that measure factual inaccuracies.
Examples of model outputs from the DOCCI checkpoint are provided below, illustrating the model’s versatility.
Input ImageCaption 
A line graph shows the top-1 accuracy of the ImageNet model after fine-tuning. The graph shows four lines that are colored blue, orange, green, and black. The blue line is the lowest of the four lines, and it is

A close up view of a white piece of paper with black text on it. The paper is curved in the middle. The text on the paper is in a typewriter font. The top of the paper has the words “Ashley Hotel West Coast” on it. Underneath that is “WiFi Internet Service”. Underneath that is “Username: fqpp”. Underneath that is “Password: aaeu

A mural of David Bowie’s Ziggy Stardust look is painted on a white wall. The mural is of three faces side by side, each with red hair and blue lightning bolts painted over their eyes. The faces have blue eyeshadow, pink blush, and red lips. The face in the middle has a black square window above it with white text that reads “JAM” in blue. A silver car

A top-down view of a white marble counter with four coffee mugs on it. There are two gray ones on the left, and one is white on the bottom left. The one on the right is gray. There is a metal wire fruit basket on a wood stand in the top right corner with oranges in it. There is a clear glass pitcher with water in it on the left, and part

A close up view of a white book with a blue strip at the bottom of it. The top half of the book is white. Black text is printed on the white portion of the book. The text reads “Visual Concept Learning from User-tagged Web Video”. Underneath the black text is a white box with five small images inside of it. The image on the far left is of a person standing in a field of grass. The image to the right of that one is of a blue ocean
Demo
For demonstration, a PaliGemma 2 3B model with 448×448 resolution was fine-tuned on a subset of the VQAv2 dataset. This process utilized LoRA fine-tuning and PEFT, detailed in the fine-tuning section. The following demo presents the outcome. The code in the Space can be reviewed to understand its functionality or cloned for adaptation to other fine-tuning projects.
How to Use with Transformers
Inference on PaliGemma 2 models can be performed using 🤗 transformers, specifically with the PaliGemmaForConditionalGeneration and AutoProcessor APIs. Ensure that transformers version 4.47 or newer is installed:
pip install --upgrade transformers
Subsequently, inference can be executed as shown. It is important to adhere to the prompt format utilized during the model’s training for the specific task.
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
model_id = "google/paligemma2-10b-ft-docci-448"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
model = model.to("cuda")
processor = AutoProcessor.from_pretrained(model_id)
prompt = "<image>caption en"
image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"
raw_image = Image.open(requests.get(image_file, stream=True).raw).convert("RGB")
inputs = processor(prompt, raw_image, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200)
input_len = inputs["input_ids"].shape[-1]
print(processor.decode(output[0][input_len:], skip_special_tokens=True))
# A medium shot of two cats laying on a pile of brown fishing nets. The cat in the foreground is a gray tabby cat with white on its chest and paws. The cat is laying on its side with its head facing the bottom right corner of the image. The cat in the background is laying on its side with its head facing the top left corner of the image. The cat's body is curled up, its head is slightly turned to the right, and its front paws are tucked underneath its body. There is a teal rope hanging from the fishing net in the top right corner of the image.
The transformers bitsandbytes integration also allows for loading models with quantization. The subsequent example demonstrates 4-bit nf4 usage:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = PaligemmaForConditionalGeneration.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map={"":0}
)
Performance degradation due to quantization was assessed by evaluating a 3B fine-tuned checkpoint on the textvqa dataset, using 224×224 input images. The results obtained from the 5,000 entries of the validation set are:
- bfloat16, no quantization: 60.04% accuracy.
- 8-bit: 59.78%.
- 4-bit, using the configuration from the snippet above: 58.72%.
These figures are promising. Quantization is particularly beneficial for larger checkpoints; it is advisable to always measure results on the specific domains and tasks being utilized.
Fine-tuning
For those who have previously fine-tuned PaliGemma, the API for PaliGemma 2 remains consistent, allowing existing code to be used directly. A fine-tuning script and a notebook are available to facilitate model fine-tuning, partial model freezing, or the application of memory-efficient techniques such as LoRA or QLoRA.
A PaliGemma 2 model was LoRA-fine-tuned on half of the VQAv2 validation split for demonstration. This process required half an hour on 3 A100 GPUs with 80GB VRAM. The model is accessible here, and a Gradio demo illustrates its capabilities.
Conclusion
The latest PaliGemma 2 release offers enhanced capabilities compared to its predecessor, featuring diverse model sizes to suit different requirements and more powerful pre-trained models. The community’s future creations with this release are anticipated.
Appreciation is extended to the Google team for making this impressive and open model family available. Special thanks go to Pablo Montalvo for integrating the model into transformers, and to Lysandre, Raushan, Arthur, Yieh-Dar, and the broader team for their prompt review, testing, and merging efforts.

