The Mozilla AI Guide is now available for exploration. It can be accessed here.
The AI Guide aims to be a foundational resource for new developers in the field and a valuable reference for experienced practitioners, promoting AI innovations that enhance daily life. Its initial focus is on language models, with plans to evolve into a community-driven platform encompassing various model types.
The initial sections of the Mozilla AI Guide delve into frequently asked questions about Large Language Models (LLMs). AI Basics explains core concepts like AI, ML, and LLMs, their definitions, and interrelationships, alongside their advantages and disadvantages. Building on this foundation, Language Models 101 explores advanced topics, addressing questions such as the meaning of ‘training’ an ML model or the ‘human in the loop’ approach.
This article will focus on the Choosing ML Models section, demonstrating how open-source models can be used for text summarization. The accompanying Colab Notebook is available here, or readers can continue below:
First Steps with Language Models
This guide offers a distinct approach, designed to assist in selecting the appropriate model for specific tasks by:
- teaching how to remain at the forefront of published AI research
- broadening perspectives on current open options for any given task
- avoiding ties to closed-source / closed-data large language models (e.g., OpenAI, Anthropic)
- creating a data-led system for consistently identifying and utilizing state-of-the-art (SOTA) models for particular tasks.
The initial task to be explored is text summarization.
So… why are popular large language models not being used?
Many prominent LLMs are capable of various tasks, including summarization. However, their suitability for specific requirements varies, necessitating proper evaluation.
Furthermore, numerous popular LLMs are proprietary, trained on undisclosed datasets, and may exhibit biases. Responsible AI implementation demands informed decisions.
Lastly, most large LLMs demand significant GPU resources. While service-based models exist, they often incur costs per API call, which can be avoided for common tasks achievable with high-quality open models on standard hardware.
Why do using open models matter?
For decades, engineers have benefited from starting with open-source projects and deploying them to production. This established practice is now facing challenges.
While numerous effective open models are available, many guides tend to favor proprietary APIs over providing straightforward methods for utilizing open-source alternatives.
Commercial AI projects often receive substantial funding, enabling greater marketing efforts than open-source initiatives. This trend can lead engineers to prioritize proprietary solutions, resulting in costly production deployments.
The First Project – Summarization
This project will involve:
- Finding text to summarize.
- Determining how to summarize it using current state-of-the-art open-source models.
- Writing code to perform the summarization.
- Evaluating the quality of results using relevant metrics.
For demonstration purposes, Mozilla’s Trustworthy AI Guidelines will be used as a string.
In practical applications, extracting content from various file types typically requires additional libraries.
import textwrap
content = """Mozilla's "Trustworthy AI" Thinking Points:
PRIVACY: How is data collected, stored, and shared? Our personal data powers everything from traffic maps to targeted advertising. Trustworthy AI should enable people to decide how their data is used and what decisions are made with it.
FAIRNESS: We’ve seen time and again how bias shows up in computational models, data, and frameworks behind automated decision making. The values and goals of a system should be power aware and seek to minimize harm. Further, AI systems that depend on human workers should protect people from exploitation and overwork.
TRUST: People should have agency and control over their data and algorithmic outputs, especially considering the high stakes for individuals and societies. For instance, when online recommendation systems push people towards extreme, misleading content, potentially misinforming or radicalizing them.
SAFETY: AI systems can carry high risk for exploitation by bad actors. Developers need to implement strong measures to protect our data and personal security. Further, excessive energy consumption and extraction of natural resources for computing and machine learning accelerates the climate crisis.
TRANSPARENCY: Automated decisions can have huge personal impacts, yet the reasons for decisions are often opaque. We need to mandate transparency so that we can fully understand these systems and their potential for harm."""
With the text prepared, summarization can begin.
A brief pause for context
The rapid pace of the AI field necessitates continuous review of scientific papers to stay informed about current advancements and state-of-the-art techniques.
For engineers new to AI, it can be challenging to:
- discover which open models are available
- identify which models are appropriate for particular tasks
- understand which benchmarks are used to evaluate those models
- determine which models are performing well based on evaluations
- assess which models can run on available hardware
This presents a challenge for engineers working under deadlines, as centralized resources for open-source AI model discussions are scarce, often scattered across social media, private groups, and informal networks.
Establishing a workflow to address these challenges can enable continuous engagement with the forefront of AI research.
How to get a list of available open summarization models?
Currently, Huggingface is a recommended resource, offering an extensive directory of open models categorized by task. This serves as an excellent starting point, though filtering may be necessary to exclude larger LLMs.
From this extensive list of summarization models, how does one make a selection?
The training data for these models is often unknown. For instance, a summarizer trained on news articles will likely outperform one trained on Reddit posts when summarizing news.
A standardized set of metrics and benchmarks is required for objective model comparisons.
How to evaluate summarization models?
The following steps outline a method for evaluating any model for any task. While it currently involves consulting multiple data sources, future improvements aim to streamline this process.
Steps:
- Find the most common datasets used to train models for summarization.
- Find the most common metrics used to evaluate models for summarization across those datasets.
- Do a quick audit on training data provenance, quality and any exhibited biases, to keep in line with Responsible AI usage.
Finding datasets
An effective method for this is utilizing Papers With Code, a valuable resource for discovering recent scientific papers categorized by task, often including associated code repositories.
Begin by filtering Papers With Code’s ‘Text Summarization’ datasets to display most cited text-based English datasets.
The ‘CNN/DailyMail’ dataset, currently the most cited, will be selected as an example of popularity.
Downloading this dataset is not necessary; instead, the information provided by Papers With Code will be reviewed for further understanding. This dataset is also accessible on Huggingface.
Three key aspects to verify include:
- license
- recent papers
- whether the data is traceable and the methods are transparent
First, examine the license. For instance, an MIT license permits both commercial and personal use.
Next, ascertain the recency of papers utilizing this dataset by sorting them in descending order. A dataset with numerous recent papers, such as from 2023, indicates active use.
Finally, verify the data’s credibility. For example, a dataset created by IBM in collaboration with the University of Montréal suggests a reliable source.
The next step involves exploring how to evaluate models that utilize this dataset.
Evaluating models
The next step is to identify common metrics used across datasets for summarization tasks. However, without familiarity with summarization literature, these metrics may be unknown.
To discover these, select a ‘Subtask’ that aligns with the desired outcome. For summarizing the CNN article, ‘Abstractive Text Summarization’ is a suitable choice.
This page provides substantial new information.
Three terms are introduced: ROUGE-1, ROUGE-2, and ROUGE-L. These metrics are employed to assess summarization performance.
A list of models and their scores across these three metrics is also provided, which is precisely what is needed.
Focusing on ROUGE-1 as the metric, the top three models can be further evaluated. Scores near 50 indicate promising ROUGE performance.
Testing out a model
With several candidates identified, a model suitable for local machines will be selected. While many models perform optimally on GPUs, some can generate summaries quickly on CPUs. Google’s Pegasus will be chosen as the starting point.
# first we install huggingface's transformers library
%pip install transformers sentencepiece
Next, Pegasus is located on Huggingface. It is notable that Pegasus was partially trained on the CNN/DailyMail dataset, which is advantageous for article summarization. A specific variant of Pegasus, trained exclusively on this dataset, is available and will be utilized.
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
import torch
# Set the seed, this will help reproduce results. Changing the seed will
# generate new results
from transformers import set_seed
set_seed(248602)
# We're using the version of Pegasus specifically trained for summarization
# using the CNN/DailyMail dataset
model_name = "google/pegasus-cnn_dailymail"
# If you're following along in Colab, switch your runtime to a
# T4 GPU or other CUDA-compliant device for a speedup
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load the tokenizer
tokenizer = PegasusTokenizer.from_pretrained(model_name)
# Load the model
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
# Tokenize the entire content
batch = tokenizer(content, padding="longest", return_tensors="pt").to(device)
# Generate the summary as tokens
summarized = model.generate(**batch)
# Decode the tokens back into text
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
# Compare
def compare(original, summarized_text):
print(f"Article text length: {len(original)}\n")
print(textwrap.fill(summarized_text, 100))
print()
print(f"Summarized length: {len(summarized_text)}")
compare(content, summarized_text)
Article text length: 1427 Trustworthy AI should enable people to decide how their data is used.<n>values and goals of a system should be power aware and seek to minimize harm.<n>People should have agency and control over their data and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and personal security. Summarized length: 320
A summary is generated, though it is somewhat brief. Attempts will be made to produce a longer summary…
set_seed(860912)
# Generate the summary as tokens, with a max_new_tokens
summarized = model.generate(**batch, max_new_tokens=800)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427 Trustworthy AI should enable people to decide how their data is used.<n>values and goals of a system should be power aware and seek to minimize harm.<n>People should have agency and control over their data and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and personal security. Summarized length: 320
The previous attempt was not effective. A different approach, known as ‘sampling’, will be employed. This method enables the model to select the next word based on its conditional probability distribution, considering the likelihood of a word following the preceding one.
The ‘temperature’ variable will also be adjusted, which influences the randomness and creativity of the generated output.
set_seed(118511)
summarized = model.generate(**batch, do_sample=True, temperature=0.8, top_k=0)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427 Mozilla's "Trustworthy AI" Thinking Points:.<n>People should have agency and control over their data and algorithmic outputs.<n>Developers need to implement strong measures to protect our data. Summarized length: 193
The output is shorter but exhibits higher quality. Increasing the temperature may further improve results.
set_seed(108814)
summarized = model.generate(**batch, do_sample=True, temperature=1.0, top_k=0)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427 Mozilla's "Trustworthy AI" Thinking Points:.<n>People should have agency and control over their data and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and personal security.<n>We need to mandate transparency so that we can fully understand these systems and their potential for harm. Summarized length: 325
Another generation approach, top_k sampling, will now be explored. This method limits the model’s consideration to only the top ‘k’ most probable next words, rather than the entire vocabulary.
This technique helps guide the model towards plausible continuations, minimizing the generation of irrelevant or nonsensical text.
It balances creativity and coherence by restricting the selection of next words without making the output entirely deterministic.
set_seed(226012)
summarized = model.generate(**batch, do_sample=True, top_k=50)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
Article text length: 1427 Mozilla's "Trustworthy AI" Thinking Points look at ethical issues surrounding automated decision making.<n>values and goals of a system should be power aware and seek to minimize harm.People should have agency and control over their data and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and personal security. Summarized length: 355
Finally, top_p sampling, also known as nucleus sampling, will be attempted. This strategy involves the model considering only the smallest set of top words whose cumulative probability surpasses a defined threshold ‘p’.
In contrast to top_k, which uses a fixed number of words, top_p dynamically adjusts based on the probability distribution of the next word, offering greater flexibility. This approach fosters diverse and coherent text by permitting the selection of less probable words when the most probable ones do not meet the ‘p’ threshold.
set_seed(21420041)
summarized = model.generate(**batch, do_sample=True, top_p=0.9, top_k=50)
summarized_decoded = tokenizer.batch_decode(summarized, skip_special_tokens=True)
summarized_text = summarized_decoded[0]
compare(content, summarized_text)
# saving this for later.
pegasus_summarized_text = summarized_text
Article text length: 1427 Mozilla's "Trustworthy AI" Thinking Points:.<n>People should have agency and control over their data and algorithmic outputs.<n>Developers need to implement strong measures to protect our data and personal security.<n>We need to mandate transparency so that we can fully understand these systems and their potential for harm. Summarized length: 325
To further explore the code example, test another model, and understand how to evaluate ML model results (a separate section), access the Python Notebook and select ‘Open in Colab’ to experiment with custom code.
This guide will undergo continuous updates, with upcoming sections planned for Data Retrieval, Image Generation, and Fine Tuning.
Developer Contributions Are Vital
Following the launch of the Mozilla AI Guide, community contribution guidelines will be released. These guidelines will detail the types of content developers can contribute and how to share it, encouraging submissions of open-source AI projects, implementations, and video/audio models.
This initiative aims to foster a cohesive, collaborative, and responsible AI community.

