VaultGemma: The world's most capable differentially private LLM

As AI becomes more integrated into daily life, developing it with privacy as a fundamental principle is essential. Differential privacy (DP) provides a mathematically robust method by introducing calibrated noise to prevent models from memorizing specific data points. However, implementing DP in Large Language Models (LLMs) involves compromises. It is important to understand these trade-offs, as applying DP noise changes conventional scaling laws, impacting training stability and significantly increasing batch size and computational expenses.

New research, titled “Scaling Laws for Differentially Private Language Models,” developed in collaboration with Google DeepMind, defines laws that precisely model these complexities, offering a full view of the trade-offs between compute, privacy, and utility. This research has led to the introduction of VaultGemma, the largest (1B-parameters) open model trained from scratch using differential privacy. Its weights are available on Hugging Face and Kaggle, along with a technical report, aiming to foster advancements in private AI development.

Understanding the scaling laws

Through a meticulously designed experimental approach, the aim was to quantify the advantages of increasing model sizes, batch sizes, and iterations within DP training. This work necessitated simplifying assumptions to manage the vast number of potential combinations. It was assumed that a model’s learning effectiveness primarily relies on the “noise-batch ratio,” which compares the random noise added for privacy to the size of data batches used for training. This assumption is valid because the privacy noise significantly outweighs any inherent randomness from data sampling.

To define a DP scaling law, a comprehensive series of experiments assessed performance across various model sizes and noise-batch ratios. The empirical data gathered, combined with established deterministic relationships between other variables, enables addressing various scaling-law-style questions, such as: “Given specific compute, privacy, and data budgets, what is the optimal training setup to achieve the lowest possible training loss?”

The structure of DP scaling laws indicates that predicted loss can be accurately modeled primarily using model size, iterations, and the noise-batch ratio, which simplifies the complex interplay between compute, privacy, and data budgets.

Key findings: A powerful synergy

Before exploring the complete scaling laws, it is beneficial to grasp the dynamics and synergies among the compute, privacy, and data budgets from a privacy accounting standpoint. This involves understanding how these elements affect the noise-batch ratio for a fixed model size and number of iterations. This analysis is considerably less expensive as it does not require model training, yet it provides valuable insights. For example, increasing the privacy budget alone results in diminishing returns, unless it is accompanied by a proportional increase in either the compute budget (FLOPs) or the data budget (tokens).

Marginal benefit of increasing the privacy budget (epsilon) and the compute budget (batch size) in terms of their effect on the noise-batch ratio.

To further investigate this synergy, the visualization below illustrates how the optimal training configuration adapts to varying constraints. As privacy and compute budgets fluctuate, the recommendation shifts between prioritizing a larger model or opting for training with larger batch sizes or more iterations.

Predicted training loss for different settings of data/privacy/compute budget, and a further detailed breakdown by the number of iterations, batch size, and model size. The plots show both the minimum achievable loss for different budget settings, along with the optimal hyper-parameter configurations.

This data offers numerous valuable insights for practitioners. While all findings are detailed in the paper, a significant discovery is the recommendation to train a considerably smaller model with a much larger batch size compared to non-DP training. This general observation is expected by DP experts due to the importance of large batch sizes. Although this principle applies broadly, optimal training configurations vary with privacy and data budgets. Grasping the precise trade-off is vital for judiciously utilizing both compute and privacy budgets in practical training scenarios. The visualizations also indicate flexibility in training configurations, meaning a range of model sizes can yield similar utility when combined with appropriate iterations and/or batch sizes.

Applying the scaling laws to build VaultGemma

The Gemma models are fundamentally designed with responsibility and safety. This makes them a suitable basis for creating a production-ready, DP-trained model such as VaultGemma.

Algorithmic advancements: Training at scale

The scaling laws derived represent a crucial initial step towards training an effective Gemma model with DP. These laws were utilized to ascertain the necessary compute for training a compute-optimal 1B-parameter Gemma 2-based model with DP, and to distribute that compute across batch size, iterations, and sequence length to maximize utility.

A notable difference between the research supporting the scaling laws and the actual training of VaultGemma involved the management of Poisson sampling, a key element of DP-SGD. Initially, a simple method of loading data in uniform batches was used, but this was later changed to Poisson sampling to achieve optimal privacy guarantees with minimal noise. This approach presented two primary difficulties: it generated batches of varying sizes and demanded a specific, randomized data processing order. This was addressed by leveraging recent work on Scalable DP-SGD, which enables processing data in fixed-size batches—either through padding or trimming—while preserving robust privacy safeguards.

Results

Utilizing new scaling laws and advanced training algorithms, VaultGemma was developed. It is currently the largest (1B-parameters) open model fully pre-trained with differential privacy, employing an approach capable of producing high-utility models.

During VaultGemma’s training, the scaling laws proved highly accurate. The model’s final training loss closely matched predictions from the equations, confirming the research and offering the community a dependable roadmap for developing future private models.

Performance comparison of VaultGemma 1B (differentially private) against its non-private counterpart (Gemma3 1B) and an older baseline (GPT-2 1.5B). The results quantify the current resource investment required for privacy and demonstrate that modern DP training yields utility comparable to non-private models from roughly five years ago.

Downstream performance of the model was also compared against its non-private equivalent across various standard academic benchmarks (e.g., HellaSwag, BoolQ, PIQA, SocialIQA, TriviaQA, ARC-C, ARC-E). To contextualize this performance and quantify the current resource investment for privacy, a comparison to an older, similarly sized GPT-2 model, which performs comparably on these benchmarks, was included. This comparison demonstrates that contemporary private training methods yield models with utility akin to non-private models from approximately five years ago, underscoring a significant gap that this work aims to help the community systematically bridge.

Ultimately, the model offers robust theoretical and empirical privacy protections.

Formal privacy guarantee

Generally, both privacy parameters (ε, δ) and the privacy unit are crucial in DP training, as they collectively dictate what the trained model can learn. VaultGemma was trained with a sequence-level DP guarantee of (ε ≤ 2.0, δ ≤ 1.1e-10), where a sequence comprises 1024 consecutive tokens from diverse data sources. Specifically, the same training mixture used for the Gemma 2 model was employed, containing documents of various lengths. During pre-processing, lengthy documents were divided and tokenized into multiple sequences, while shorter ones were combined into single sequences. Although the sequence-level privacy unit was a suitable choice for this training mixture, user-level differential privacy would be more appropriate in scenarios with a clear data-to-user mapping.

In practical terms, since protection is provided at the sequence level, if information concerning any (potentially private) fact or inference appears in a single sequence, VaultGemma effectively remains unaware of that fact: any query’s response will be statistically comparable to that from a model never trained on that specific sequence. Conversely, if numerous training sequences contain information relevant to a particular fact, VaultGemma will generally be able to furnish that information.

Empirical memorization

To supplement the sequence-level DP guarantee, further tests were conducted on the empirical privacy properties of the trained model. This involved prompting the model with a 50-token prefix from a training document to observe if it would generate the corresponding 50-token suffix. VaultGemma 1B exhibited no detectable memorization of its training data, effectively demonstrating the success of DP training.

Conclusion

VaultGemma marks a substantial advancement in the pursuit of creating AI that is inherently powerful and private. Through the development and application of a new, robust comprehension of DP scaling laws, the largest open, DP-trained language model to date has been successfully trained and released.

Although a utility gap persists between DP-trained and non-DP-trained models, it is believed this gap can be systematically reduced with further research into mechanism design for DP training. VaultGemma and its accompanying research are expected to enable the community to develop the next generation of safe, responsible, and private AI.

Acknowledgements

Thanks are extended to the entire Gemma and Google Privacy teams for their contributions and support throughout this project. Special appreciation goes to Peter Kairouz, Brendan McMahan, and Dan Ramage for feedback on the blog post; Mark Simborg and Kimberly Schwede for assistance with visualizations; and the Google teams involved in algorithm design, infrastructure implementation, and production maintenance. The following individuals directly contributed to the work presented (ordered alphabetically): Borja Balle, Zachary Charles, Christopher A. Choquette-Choo, Lynn Chua, Prem Eruvbetine, Badih Ghazi, Steve He, Yangsibo Huang, Armand Joulin, George Kaissis, Pritish Kamath, Ravi Kumar, Daogao Liu, Ruibo Liu, Pasin Manurangsi, Thomas Mesnard, Andreas Terzis, Tris Warkentin, Da Yu, and Chiyuan Zhang.

Latest Post

Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

Sarvam AI Unveils New Open-Source Models, Betting on Efficiency and Local Relevance

ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

Latest Post

Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

Latest Post

VaultGemma: The world’s most capable differentially private LLM

Understanding the scaling laws

Key findings: A powerful synergy

Applying the scaling laws to build VaultGemma

Algorithmic advancements: Training at scale

Results

Formal privacy guarantee

Empirical memorization

Conclusion

Acknowledgements

Related Posts