Close Menu
    Latest Post

    Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

    February 22, 2026

    Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

    February 22, 2026

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Anker’s X1 Pro shouldn’t exist, but I’m so glad it does
    • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    • How to Cancel Your Google Pixel Watch Fitbit Premium Trial
    • GHD Speed Hair Dryer Review: Powerful Performance and User-Friendly Design
    • An FBI ‘Asset’ Helped Run a Dark Web Site That Sold Fentanyl-Laced Drugs for Years
    Facebook X (Twitter) Instagram Pinterest Vimeo
    NodeTodayNodeToday
    • Home
    • AI
    • Dev
    • Guides
    • Products
    • Security
    • Startups
    • Tech
    • Tools
    NodeTodayNodeToday
    Home»AI»Visual Document Retrieval Goes Multilingual
    AI

    Visual Document Retrieval Goes Multilingual

    Samuel AlejandroBy Samuel AlejandroFebruary 5, 2026No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    src 1yav6ih featured
    Share
    Facebook Twitter LinkedIn Pinterest Email

    A new multilingual embedding model, vdr-2b-multi-v1, has been introduced for visual document retrieval across various languages and domains. This model converts document page screenshots into dense single-vector representations, enabling efficient searching and querying of visually rich multilingual documents without requiring OCR, data extraction pipelines, or chunking.

    image/png

    The vdr-2b-multi-v1 model builds upon MrLight/dse-qwen2-2b-mrl-v1 and was developed in collaboration with LlamaIndex. It is trained on a comprehensive, custom-built dataset of multilingual query-image pairs, representing an advanced iteration of mcdse-2b-v1. This new model incorporates enhanced learning techniques and methodologies, leading to improved performance.

    • Trained on Italian, Spanish, English, French, and German: The model utilizes a new large, open-source, multilingual training dataset comprising 500,000 high-quality samples.

    • Low VRAM and Faster Inference: The English-only model demonstrates lower VRAM usage and faster inference. On synthetic Visual Document Retrieval (ViDoRe) benchmarks, it outperforms the base model while using only 768 image patches compared to 2560, leading to three times faster inference and significantly reduced VRAM consumption.

    • Cross-lingual Retrieval: Cross-lingual retrieval capabilities are significantly enhanced for real-world applications, allowing users to search for documents in one language (e.g., German) using queries in another (e.g., Italian).

    • Matryoshka Representation Learning: This enables a three-fold reduction in vector size while maintaining 98% of the embedding quality. This feature facilitates faster retrieval speeds and lowers storage expenses.

    Usage

    The vdr-2b-multi-v1 model can be tested on its Hugging Face Space.

    Generating embeddings with vdr-2b-multi-v1 is streamlined through direct integrations with SentenceTransformers and LlamaIndex. Implementation requires minimal code:

    via LlamaIndex

    pip install -U llama-index-embeddings-huggingface
    
    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
    
    model = HuggingFaceEmbedding(
        model_name="llamaindex/vdr-2b-multi-v1",
        device="cpu",  # "mps" for mac, "cuda" for nvidia GPUs
        trust_remote_code=True,
    )
    
    image_embedding = model.get_image_embedding("image.png")
    query_embedding = model.get_query_embedding("Chi ha inventato Bitcoin?")
    

    via SentenceTransformers

    from sentence_transformers import SentenceTransformer
    
    model = SentenceTransformer(
        model_name_or_path="llamaindex/vdr-2b-multi-v1",
        device="cuda",
        trust_remote_code=True,
        # These are the recommended kwargs for the model, but change them as needed if you don't have CUDA
        model_kwargs={
            "torch_dtype": torch.bfloat16, 
            "device_map": "cuda:0", 
            "attn_implementation": "flash_attention_2"
        },
    )
    
    embeddings = model.encode("image.png")
    

    Training Dataset

    Developing effective single-vector models for visual document retrieval necessitates high-quality data. However, existing multimodal datasets are limited and often lack multilingual support.

    To address this, a new dataset was created from public internet PDFs, comprising 500,000 multilingual query-image samples. The queries for each image are synthetically generated using Visual Language Models (VLMs). This dataset is ten times larger than the previous largest open-source synthetic dataset for multimodal visual document retrieval, the ColPali training dataset.

    image/png

    Data Gathering

    For each language, a comprehensive list of search queries covering diverse topics was generated and used to find PDFs. Language filtering capabilities of search engines were employed to ensure documents were specific to the target language. This topic-based search approach ensures the model is exposed to a wide range of subjects and domains, enhancing its real-world performance.

    Approximately 50,000 multilingual documents were collected. Unlike the random page extraction method used for the previous mcdse-2b-v1 model, each PDF page underwent document layout analysis. This analysis classified pages as text-only, visual-only, or mixed, allowing for the sampling of approximately 100,000 pages with an even distribution across these types.

    Synthetic Generation

    Queries were generated using `gemini-1.5-pro` and `Qwen2-VL-72B`, with instructions to create both specific and general questions. While only specific questions were used for model training, the process of distinguishing between question types often yielded more robust specific questions for information retrieval.

    Following generation, a cleaning process was applied to ensure query quality for training. This involved:

    • Ensuring the language is correct
    • Fix formatting problems
    • Remove markdown
    • Ensuring that only one question is posed
    • Removing grounding phrases (e.g. “according to Figure 1”, “this document”, …)

    Filtering and Hard-Negative Mining

    While the initial cleaning ensured syntactic correctness and adherence to guidelines, it did not guarantee the queries’ suitability for information retrieval.

    To refine the queries, each broad query was embedded and indexed using the `voyage-3` embedding model. For every specific question, the index was searched. A query was deemed ‘good’ if its corresponding broad question appeared within the top 100 results. This technique eliminated low-entropy, duplicate, or overly similar questions, leading to an average removal of 40% of queries from each language dataset.

    Hard negatives were subsequently mined using `voyage-3` on specific questions, applying a fixed threshold of 0.75. Although experiments with positive-aware negative mining (as detailed in nvidia/NV-Retriever-v1) were conducted, this approach appeared to generate negatives that were too simple or distant for this particular dataset.

    Download

    The vdr-multilingual-train training dataset is now open-source and accessible on Hugging Face. It contains 496,167 PDF pages; however, only 280,679 of these are linked to the filtered queries. Pages without queries are utilized as hard negatives.

    Language # filtered queries # unfiltered queries
    English 53,512 94,225
    Spanish 58,738 102,685
    Italian 54,942 98,747
    German 58,217 100,713
    French 55,270 99,797
    TOTAL 280,679 496,167

    Individual languages can be downloaded by specifying the language subset within the load_dataset function:

    from datasets import load_dataset
    
    italian_dataset = load_dataset("llamaindex/vdr-multilingual-train", "it", split="train")
    
    english_dataset = load_dataset("llamaindex/vdr-multilingual-train", "en", split="train")
    
    french_dataset = load_dataset("llamaindex/vdr-multilingual-train", "fr", split="train")
    
    german_dataset = load_dataset("llamaindex/vdr-multilingual-train", "de", split="train")
    
    spanish_dataset = load_dataset("llamaindex/vdr-multilingual-train", "es", split="train")
    

    Evaluations

    image/png

    The model’s performance was assessed using the ViDoRe benchmark and custom evaluation sets designed to test its multilingual abilities across text-only, visual-only, and mixed page screenshots. This evaluation dataset is also publicly available on Hugging Face as vdr-multilingual-test.

    Care was taken to ensure no overlap between the evaluation and training datasets, preventing contamination. The evaluation datasets were compiled using similar methods to the training dataset, albeit with a smaller sample size. The filtering process for evaluation queries was performed manually, with each query being assessed, curated, and refined to guarantee high data quality.

    All evaluations calculate NDCG@5 scores using 1536-dimension vectors and an image resolution represented by a maximum of 768 tokens.

    Avg French (text) French (visual) French (mix)
    dse-qwen2-2b-mrl-v1 93.5 94.7 90.8 95.1
    vdr-2b-multi-v1 95.6 95.6 93.3 97.9
    +2.2%
    Avg German (text) German (visual) German (mix)
    dse-qwen2-2b-mrl-v1 93.0 93.4 90.0 95.5
    vdr-2b-multi-v1 96.2 94.8 95.7 98.1
    +3.4%
    Avg Italian (text) Italian (visual) Italian (mix)
    dse-qwen2-2b-mrl-v1 95.1 95.1 94.0 96.2
    vdr-2b-multi-v1 97.0 96.4 96.3 98.4
    +2%
    Avg Spanish (text) Spanish (visual) Spanish (mix)
    dse-qwen2-2b-mrl-v1 96.7 97.2 94.7 98.2
    vdr-2b-multi-v1 98.1 98.3 96.9 99.1
    +1.4%
    Avg English (text) English (visual) English (mix)
    dse-qwen2-2b-mrl-v1 98.0 98.3 98.5 97.1
    vdr-2b-multi-v1 98.1 97.9 99.1 97.3
    +0.1%

    The multilingual model consistently surpasses the base model across all languages and page types, with an average improvement of +2.3%. It also shows a slight improvement (+0.5%) on the ViDoRe benchmark. The fine-tuned vdr-2b-multi-v1 demonstrates significant performance gains, particularly for non-English visual-only or mixed pages, such as a +6.33% NDCG@5 improvement for German visual-only retrieval compared to the base model.

    An English-only version, vdr-2b-v1, was also trained. When evaluated on the complete ViDoRe benchmark using 768 image tokens, both the multilingual and English-only models surpassed the base model.

    Avg shiftproject government healthcare energy ai docvqa arxivqa tatdqa infovqa tabfquad
    dse-qwen2-2b-mrl-v1 83.6 79.8 95.7 96.9 92.0 98.2 56.3 85.2 53.9 87.5 90.3
    vdr-2b-multi-v1 84.0 82.4 95.5 96.5 91.2 98.5 58.5 84.7 53.6 87.1 92.2
    vdr-2b-v1 84.3 83.4 96.9 97.2 92.6 96.8 57.4 85.1 54.1 87.9 91.3

    Faster Inference

    image/png

    The English-only vdr-2b-v1 model achieves performance comparable to the base model on ViDoRe benchmark synthetic datasets, but with only 30% of the image tokens (768 versus 2560). This leads to three times faster inference and significantly reduced VRAM consumption.

    Avg shiftproject government healthcare energy ai
    dse-qwen2-2b-mrl-v1 (2560 image tokens) 93.0 82 96 96.4 92.9 97.5
    vdr-2b-v1 (768 image tokens) 93.4 83.4 96.9 97.2 92.6 96.8

    Cross-Lingual Retrieval

    Despite being trained on each language independently, the model also shows improvements in cross-lingual retrieval. To assess this, German evaluation set queries were translated into Italian using DeepL, while the document page screenshots remained in their original German language.

    Avg Italian -> German (text) Italian -> German (visual) Italian -> German (mix)
    dse-qwen2-2b-mrl-v1 93.1 92.6 93.5 93.3
    vdr-2b-multi-v1 95.3 95.0 95.8 95.1
    +2.3%

    The model demonstrates notable superiority across all document types, with an average improvement of +2.3%. Such retrieval capabilities are crucial for practical applications, particularly in regions with diverse languages like Europe. This enables language-independent searches across complex multilingual documents, including European binding decisions, instruction manuals, financial asset KIDs, and pharmaceutical package leaflets.

    MRL and Binary Embeddings

    This model incorporates Matryoshka Representation Learning (MRL). The training loss function is designed to monitor performance across various dimensions, prompting the model to prioritize essential identifying information. This allows for scaling down embedding dimensions based on specific requirements and budget. Further details on MRL can be found in this Hugging Face blog post.

    To evaluate the model’s retrieval performance with varying vector dimensions, tests were conducted using the Italian-to-German cross-lingual benchmark.

    NDCG@5 (float)

    Avg Italian -> German (text) Italian -> German (visual) Italian -> German (mix)
    1536 dimensions
    dse-qwen2-2b-mrl-v1 93.1 92.6 93.5 93.3
    vdr-2b-multi-v1 95.3 95.0 95.9 95.1
    +2.3%
    1024 dimensions
    dse-qwen2-2b-mrl-v1 92.2 90.9 92.3 93.5
    vdr-2b-multi-v1 94.6 93.1 95.7 95.1
    +2.5%
    512 dimensions
    dse-qwen2-2b-mrl-v1 89.8 87.9 89.4 92.2
    vdr-2b-multi-v1 93.0 91.1 93.4 94.5
    +3.4%

    NDCG@5 (binary)

    Avg Italian -> German (text) Italian -> German (visual) Italian -> German (mix)
    1536 dimensions
    dse-qwen2-2b-mrl-v1 89.8 88.2 90.3 90.8
    vdr-2b-multi-v1 92.3 89.6 94.1 93.3
    +2.8%
    1024 dimensions
    dse-qwen2-2b-mrl-v1 86.7 84.9 88.2 86.9
    vdr-2b-multi-v1 90.8 87.0 92.6 92.8
    +4.6%
    512 dimensions
    dse-qwen2-2b-mrl-v1 79.2 80.6 81.7 75.4
    vdr-2b-multi-v1 82.6 77.7 86.7 83.3
    +4.0%

    Float vectors with 1024 dimensions provide an excellent balance of quality and size, being approximately 30% smaller while retaining 99% of retrieval performance. Similarly, 1536-dimension binary vectors, despite having ten times fewer bytes per vector, maintain 97% of their retrieval quality. Notably, 1536 binary vectors nearly achieve the performance of the base model’s 1536 float vectors.

    Conclusions and Next Steps

    The vdr-2b-multi-v1 and vdr-2b-v1 models are expected to be highly beneficial for a wide range of users.

    This multilingual model, a pioneering development, substantially enhances performance in multilingual and cross-lingual contexts. With MRL and Binary Quantization, retrieval becomes more efficient and faster. This advancement is anticipated to open new applications and opportunities, particularly in linguistically diverse regions like Europe.

    The English-only counterpart offers a significant upgrade over the base model, embedding documents three times faster with reduced VRAM usage, while maintaining or improving retrieval quality.

    These advancements are made possible by the new vdr-multilingual-train dataset, which, with 500,000 high-quality samples, stands as the largest open-source synthetic dataset for visual document retrieval.

    Future research will investigate the models’ performance when adapted to new and specialized domains. While still in early development, initial tests indicate substantial retrieval improvements with very minimal data and computational resources.

    Links

    • 🎲 Model demo: Hugging Face Space
    • 🤗 Multilingual model: vdr-2b-multi-v1
    • 🤗 English-only model: vdr-2b-v1
    • 📂 Training dataset: vdr-multilingual-train
    • 📂 Evaluation dataset: vdr-multilingual-test
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleWeb Development Insights: Videos, View Transitions, Media Queries, and Browser Mechanics
    Next Article GitHub Mobile: Enhanced Pull Request Comments on Unchanged Lines
    Samuel Alejandro

    Related Posts

    AI

    Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

    February 21, 2026
    AI

    SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

    February 19, 2026
    AI

    Sarvam AI Unveils New Open-Source Models, Betting on Efficiency and Local Relevance

    February 18, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Latest Post

    ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

    December 21, 202513 Views

    Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

    December 21, 202511 Views

    Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

    December 21, 202510 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    About

    Welcome to NodeToday, your trusted source for the latest updates in Technology, Artificial Intelligence, and Innovation. We are dedicated to delivering accurate, timely, and insightful content that helps readers stay ahead in a fast-evolving digital world.

    At NodeToday, we cover everything from AI breakthroughs and emerging technologies to product launches, software tools, developer news, and practical guides. Our goal is to simplify complex topics and present them in a clear, engaging, and easy-to-understand way for tech enthusiasts, professionals, and beginners alike.

    Latest Post

    Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

    February 22, 20260 Views

    Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

    February 22, 20260 Views

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 20260 Views
    Recent Posts
    • Anker’s X1 Pro shouldn’t exist, but I’m so glad it does
    • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer
    • Cookie Policy
    © 2026 NodeToday.

    Type above and press Enter to search. Press Esc to cancel.