Visual Document Retrieval Goes Multilingual

A new multilingual embedding model, vdr-2b-multi-v1, has been introduced for visual document retrieval across various languages and domains. This model converts document page screenshots into dense single-vector representations, enabling efficient searching and querying of visually rich multilingual documents without requiring OCR, data extraction pipelines, or chunking.

The vdr-2b-multi-v1 model builds upon MrLight/dse-qwen2-2b-mrl-v1 and was developed in collaboration with LlamaIndex. It is trained on a comprehensive, custom-built dataset of multilingual query-image pairs, representing an advanced iteration of mcdse-2b-v1. This new model incorporates enhanced learning techniques and methodologies, leading to improved performance.

Trained on Italian, Spanish, English, French, and German: The model utilizes a new large, open-source, multilingual training dataset comprising 500,000 high-quality samples.
Low VRAM and Faster Inference: The English-only model demonstrates lower VRAM usage and faster inference. On synthetic Visual Document Retrieval (ViDoRe) benchmarks, it outperforms the base model while using only 768 image patches compared to 2560, leading to three times faster inference and significantly reduced VRAM consumption.
Cross-lingual Retrieval: Cross-lingual retrieval capabilities are significantly enhanced for real-world applications, allowing users to search for documents in one language (e.g., German) using queries in another (e.g., Italian).
Matryoshka Representation Learning: This enables a three-fold reduction in vector size while maintaining 98% of the embedding quality. This feature facilitates faster retrieval speeds and lowers storage expenses.

Usage

The vdr-2b-multi-v1 model can be tested on its Hugging Face Space.

Generating embeddings with vdr-2b-multi-v1 is streamlined through direct integrations with SentenceTransformers and LlamaIndex. Implementation requires minimal code:

via LlamaIndex

pip install -U llama-index-embeddings-huggingface

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

model = HuggingFaceEmbedding(
    model_name="llamaindex/vdr-2b-multi-v1",
    device="cpu",  # "mps" for mac, "cuda" for nvidia GPUs
    trust_remote_code=True,
)

image_embedding = model.get_image_embedding("image.png")
query_embedding = model.get_query_embedding("Chi ha inventato Bitcoin?")

via SentenceTransformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    model_name_or_path="llamaindex/vdr-2b-multi-v1",
    device="cuda",
    trust_remote_code=True,
    # These are the recommended kwargs for the model, but change them as needed if you don't have CUDA
    model_kwargs={
        "torch_dtype": torch.bfloat16, 
        "device_map": "cuda:0", 
        "attn_implementation": "flash_attention_2"
    },
)

embeddings = model.encode("image.png")

Training Dataset

Developing effective single-vector models for visual document retrieval necessitates high-quality data. However, existing multimodal datasets are limited and often lack multilingual support.

To address this, a new dataset was created from public internet PDFs, comprising 500,000 multilingual query-image samples. The queries for each image are synthetically generated using Visual Language Models (VLMs). This dataset is ten times larger than the previous largest open-source synthetic dataset for multimodal visual document retrieval, the ColPali training dataset.

Data Gathering

For each language, a comprehensive list of search queries covering diverse topics was generated and used to find PDFs. Language filtering capabilities of search engines were employed to ensure documents were specific to the target language. This topic-based search approach ensures the model is exposed to a wide range of subjects and domains, enhancing its real-world performance.

Approximately 50,000 multilingual documents were collected. Unlike the random page extraction method used for the previous mcdse-2b-v1 model, each PDF page underwent document layout analysis. This analysis classified pages as text-only, visual-only, or mixed, allowing for the sampling of approximately 100,000 pages with an even distribution across these types.

Synthetic Generation

Queries were generated using `gemini-1.5-pro` and `Qwen2-VL-72B`, with instructions to create both specific and general questions. While only specific questions were used for model training, the process of distinguishing between question types often yielded more robust specific questions for information retrieval.

Following generation, a cleaning process was applied to ensure query quality for training. This involved:

Ensuring the language is correct
Fix formatting problems
Remove markdown
Ensuring that only one question is posed
Removing grounding phrases (e.g. “according to Figure 1”, “this document”, …)

Filtering and Hard-Negative Mining

While the initial cleaning ensured syntactic correctness and adherence to guidelines, it did not guarantee the queries’ suitability for information retrieval.

To refine the queries, each broad query was embedded and indexed using the `voyage-3` embedding model. For every specific question, the index was searched. A query was deemed ‘good’ if its corresponding broad question appeared within the top 100 results. This technique eliminated low-entropy, duplicate, or overly similar questions, leading to an average removal of 40% of queries from each language dataset.

Hard negatives were subsequently mined using `voyage-3` on specific questions, applying a fixed threshold of 0.75. Although experiments with positive-aware negative mining (as detailed in nvidia/NV-Retriever-v1) were conducted, this approach appeared to generate negatives that were too simple or distant for this particular dataset.

Download

The vdr-multilingual-train training dataset is now open-source and accessible on Hugging Face. It contains 496,167 PDF pages; however, only 280,679 of these are linked to the filtered queries. Pages without queries are utilized as hard negatives.

Language	# filtered queries	# unfiltered queries
English	53,512	94,225
Spanish	58,738	102,685
Italian	54,942	98,747
German	58,217	100,713
French	55,270	99,797
TOTAL	280,679	496,167

Individual languages can be downloaded by specifying the language subset within the load_dataset function:

from datasets import load_dataset

italian_dataset = load_dataset("llamaindex/vdr-multilingual-train", "it", split="train")

english_dataset = load_dataset("llamaindex/vdr-multilingual-train", "en", split="train")

french_dataset = load_dataset("llamaindex/vdr-multilingual-train", "fr", split="train")

german_dataset = load_dataset("llamaindex/vdr-multilingual-train", "de", split="train")

spanish_dataset = load_dataset("llamaindex/vdr-multilingual-train", "es", split="train")

Evaluations

The model’s performance was assessed using the ViDoRe benchmark and custom evaluation sets designed to test its multilingual abilities across text-only, visual-only, and mixed page screenshots. This evaluation dataset is also publicly available on Hugging Face as vdr-multilingual-test.

Care was taken to ensure no overlap between the evaluation and training datasets, preventing contamination. The evaluation datasets were compiled using similar methods to the training dataset, albeit with a smaller sample size. The filtering process for evaluation queries was performed manually, with each query being assessed, curated, and refined to guarantee high data quality.

All evaluations calculate NDCG@5 scores using 1536-dimension vectors and an image resolution represented by a maximum of 768 tokens.

	Avg	French (text)	French (visual)	French (mix)
dse-qwen2-2b-mrl-v1	93.5	94.7	90.8	95.1
vdr-2b-multi-v1	95.6	95.6	93.3	97.9
	+2.2%

	Avg	German (text)	German (visual)	German (mix)
dse-qwen2-2b-mrl-v1	93.0	93.4	90.0	95.5
vdr-2b-multi-v1	96.2	94.8	95.7	98.1
	+3.4%

	Avg	Italian (text)	Italian (visual)	Italian (mix)
dse-qwen2-2b-mrl-v1	95.1	95.1	94.0	96.2
vdr-2b-multi-v1	97.0	96.4	96.3	98.4
	+2%

	Avg	Spanish (text)	Spanish (visual)	Spanish (mix)
dse-qwen2-2b-mrl-v1	96.7	97.2	94.7	98.2
vdr-2b-multi-v1	98.1	98.3	96.9	99.1
	+1.4%

	Avg	English (text)	English (visual)	English (mix)
dse-qwen2-2b-mrl-v1	98.0	98.3	98.5	97.1
vdr-2b-multi-v1	98.1	97.9	99.1	97.3
	+0.1%

The multilingual model consistently surpasses the base model across all languages and page types, with an average improvement of +2.3%. It also shows a slight improvement (+0.5%) on the ViDoRe benchmark. The fine-tuned vdr-2b-multi-v1 demonstrates significant performance gains, particularly for non-English visual-only or mixed pages, such as a +6.33% NDCG@5 improvement for German visual-only retrieval compared to the base model.

An English-only version, vdr-2b-v1, was also trained. When evaluated on the complete ViDoRe benchmark using 768 image tokens, both the multilingual and English-only models surpassed the base model.

	Avg	shiftproject	government	healthcare	energy	ai	docvqa	arxivqa	tatdqa	infovqa	tabfquad
dse-qwen2-2b-mrl-v1	83.6	79.8	95.7	96.9	92.0	98.2	56.3	85.2	53.9	87.5	90.3
vdr-2b-multi-v1	84.0	82.4	95.5	96.5	91.2	98.5	58.5	84.7	53.6	87.1	92.2
vdr-2b-v1	84.3	83.4	96.9	97.2	92.6	96.8	57.4	85.1	54.1	87.9	91.3

Faster Inference

The English-only vdr-2b-v1 model achieves performance comparable to the base model on ViDoRe benchmark synthetic datasets, but with only 30% of the image tokens (768 versus 2560). This leads to three times faster inference and significantly reduced VRAM consumption.

	Avg	shiftproject	government	healthcare	energy	ai
dse-qwen2-2b-mrl-v1 (2560 image tokens)	93.0	82	96	96.4	92.9	97.5
vdr-2b-v1 (768 image tokens)	93.4	83.4	96.9	97.2	92.6	96.8

Cross-Lingual Retrieval

Despite being trained on each language independently, the model also shows improvements in cross-lingual retrieval. To assess this, German evaluation set queries were translated into Italian using DeepL, while the document page screenshots remained in their original German language.

	Avg	Italian -> German (text)	Italian -> German (visual)	Italian -> German (mix)
dse-qwen2-2b-mrl-v1	93.1	92.6	93.5	93.3
vdr-2b-multi-v1	95.3	95.0	95.8	95.1
	+2.3%

The model demonstrates notable superiority across all document types, with an average improvement of +2.3%. Such retrieval capabilities are crucial for practical applications, particularly in regions with diverse languages like Europe. This enables language-independent searches across complex multilingual documents, including European binding decisions, instruction manuals, financial asset KIDs, and pharmaceutical package leaflets.

MRL and Binary Embeddings

This model incorporates Matryoshka Representation Learning (MRL). The training loss function is designed to monitor performance across various dimensions, prompting the model to prioritize essential identifying information. This allows for scaling down embedding dimensions based on specific requirements and budget. Further details on MRL can be found in this Hugging Face blog post.

To evaluate the model’s retrieval performance with varying vector dimensions, tests were conducted using the Italian-to-German cross-lingual benchmark.

NDCG@5 (float)

	Avg	Italian -> German (text)	Italian -> German (visual)	Italian -> German (mix)
*1536 dimensions*
dse-qwen2-2b-mrl-v1	93.1	92.6	93.5	93.3
vdr-2b-multi-v1	95.3	95.0	95.9	95.1
	+2.3%
*1024 dimensions*
dse-qwen2-2b-mrl-v1	92.2	90.9	92.3	93.5
vdr-2b-multi-v1	94.6	93.1	95.7	95.1
	+2.5%
*512 dimensions*
dse-qwen2-2b-mrl-v1	89.8	87.9	89.4	92.2
vdr-2b-multi-v1	93.0	91.1	93.4	94.5
	+3.4%

NDCG@5 (binary)

	Avg	Italian -> German (text)	Italian -> German (visual)	Italian -> German (mix)
*1536 dimensions*
dse-qwen2-2b-mrl-v1	89.8	88.2	90.3	90.8
vdr-2b-multi-v1	92.3	89.6	94.1	93.3
	+2.8%
*1024 dimensions*
dse-qwen2-2b-mrl-v1	86.7	84.9	88.2	86.9
vdr-2b-multi-v1	90.8	87.0	92.6	92.8
	+4.6%
*512 dimensions*
dse-qwen2-2b-mrl-v1	79.2	80.6	81.7	75.4
vdr-2b-multi-v1	82.6	77.7	86.7	83.3
	+4.0%

Float vectors with 1024 dimensions provide an excellent balance of quality and size, being approximately 30% smaller while retaining 99% of retrieval performance. Similarly, 1536-dimension binary vectors, despite having ten times fewer bytes per vector, maintain 97% of their retrieval quality. Notably, 1536 binary vectors nearly achieve the performance of the base model’s 1536 float vectors.

Conclusions and Next Steps

The vdr-2b-multi-v1 and vdr-2b-v1 models are expected to be highly beneficial for a wide range of users.

This multilingual model, a pioneering development, substantially enhances performance in multilingual and cross-lingual contexts. With MRL and Binary Quantization, retrieval becomes more efficient and faster. This advancement is anticipated to open new applications and opportunities, particularly in linguistically diverse regions like Europe.

The English-only counterpart offers a significant upgrade over the base model, embedding documents three times faster with reduced VRAM usage, while maintaining or improving retrieval quality.

These advancements are made possible by the new vdr-multilingual-train dataset, which, with 500,000 high-quality samples, stands as the largest open-source synthetic dataset for visual document retrieval.

Future research will investigate the models’ performance when adapted to new and specialized domains. While still in early development, initial tests indicate substantial retrieval improvements with very minimal data and computational resources.

Latest Post

Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

Sarvam AI Unveils New Open-Source Models, Betting on Efficiency and Local Relevance

ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

Latest Post

Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

Latest Post

Visual Document Retrieval Goes Multilingual

Usage

Training Dataset

Data Gathering

Synthetic Generation

Filtering and Hard-Negative Mining

Download

Evaluations

Faster Inference

Cross-Lingual Retrieval

MRL and Binary Embeddings

NDCG@5 (float)

NDCG@5 (binary)

Conclusions and Next Steps

Links

Related Posts