Close Menu
    Latest Post

    Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

    February 22, 2026

    Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

    February 22, 2026

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Anker’s X1 Pro shouldn’t exist, but I’m so glad it does
    • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    • How to Cancel Your Google Pixel Watch Fitbit Premium Trial
    • GHD Speed Hair Dryer Review: Powerful Performance and User-Friendly Design
    • An FBI ‘Asset’ Helped Run a Dark Web Site That Sold Fentanyl-Laced Drugs for Years
    Facebook X (Twitter) Instagram Pinterest Vimeo
    NodeTodayNodeToday
    • Home
    • AI
    • Dev
    • Guides
    • Products
    • Security
    • Startups
    • Tech
    • Tools
    NodeTodayNodeToday
    Home»AI»BigCodeBench: The Next Generation of HumanEval
    AI

    BigCodeBench: The Next Generation of HumanEval

    Samuel AlejandroBy Samuel AlejandroFebruary 7, 2026No Comments10 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    src m8u94v featured
    Share
    Facebook Twitter LinkedIn Pinterest Email

    HumanEval is a reference benchmark for evaluating large language models (LLMs) on code generation tasks, making the evaluation of compact function-level code snippets easy. However, there are growing concerns about its effectiveness in evaluating the programming capabilities of LLMs, primarily because tasks in HumanEval are often considered too simple and may not represent real-world programming challenges. Compared to the algorithm-oriented tasks in HumanEval, real-world software development frequently involves diverse libraries and function calls. Furthermore, LLMs’ performance on HumanEval is subject to contamination and overfitting issues, which reduces its reliability for evaluating the generalization of LLMs.

    While some efforts have been made to address these issues, they are often domain-specific, deterministic, or agent-centric (e.g., DS-1000, ODEX, and SWE-bench). The community still lacks an easy-to-use benchmark that can broadly evaluate the programming capabilities of LLMs, which is the focus of BigCodeBench.

    BigCodeBench has been released, evaluating LLMs on solving practical and challenging programming tasks without contamination. Specifically, BigCodeBench contains 1,140 function-level tasks to challenge LLMs to follow instructions and compose multiple function calls as tools from 139 libraries. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%.

    What do the tasks in BigCodeBench look like? 🕵️‍♂️

    task

    BigCodeBench features complex, user-oriented instructions for each task, including clear functionality descriptions, input/output formats, error handling, and verified interactive examples. Step-by-step task instructions are avoided; capable LLMs are expected to understand and solve tasks from the user’s perspective in an open-ended manner. Specific features are verified using test cases.

    # We elaborate the above task with some test cases:
    
    # Requirements SetUp
    import unittest
    from unittest.mock import patch
    import http.client
    import ssl
    import socket
    
    # Start the test
    class TestCases(unittest.TestCase):
    
        # Mock the successful connection and assess the response content
        @patch('http.client.HTTPSConnection')
        def test_response_content(self, mock_conn):
            """ Test the content of the response. """
            mock_conn.return_value.getresponse.return_value.read.return_value = b'Expected Content'
            result = task_func('www.example.com', 443, '/content/path')
            self.assertEqual(result, 'Expected Content')
    
        # Mock the failed connection and assess the error handling
        @patch('socket.create_connection')
        @patch('http.client.HTTPSConnection')
        def test_ssl_handshake_error_handling(self, mock_conn, mock_socket):
            """ Test handling of SSL handshake errors. """
            mock_socket.side_effect = ssl.SSLError('SSL handshake failed')
            with self.assertRaises(ssl.SSLError):
                task_func('badssl.com', 443, '/test/path')
    
        # More test cases...
    

    Tasks in BigCodeBench utilize diverse function calls from popular libraries. Function calls that LLMs can use are not restricted, allowing them to choose appropriate functions and combine them flexibly to solve tasks. Test cases are designed as test harnesses to examine expected program behaviors during runtime.

    To assess LLM performance, Pass@1 with greedy decoding is used, measuring the percentage of tasks correctly solved with the first generated code snippet via curated test cases. This approach aligns with benchmarks like HumanEval and MBPP. The tendency of LLMs to skip long code prompts is addressed by adding missing setups (e.g., import statements, global constants) during Pass@1 evaluation, referred to as calibrated Pass@1.

    comparison

    To better understand implementation complexity and tool-use diversity, tasks in BigCodeBench are compared with those in representative benchmarks, including APPS, DS-1000, ODEX, APIBench, MBPP, NumpyEval, PandasEval, HumanEval, and TorchDataEval. BigCodeBench is found to require more complex reasoning and problem-solving skills to implement comprehensive functionalities.

    prompt

    As shown in the task figure, the main target scenario is code completion (denoted as BigCodeBench-Complete), where LLMs are required to finish the implementation of a function based on detailed instructions in the docstring. However, considering downstream applications such as multi-turn dialogue, users may describe requirements in a more conversational and less verbose manner. Instruction-tuned LLMs are beneficial in this context, as they are trained to follow natural-language instructions and generate code snippets accordingly. To test if models can truly understand human intents and translate them into code, BigCodeBench-Instruct was created, a more challenging variant of BigCodeBench designed to evaluate instruction-tuned LLMs.

    Where do the tasks come from? 🤔

    png

    The quality of tasks in BigCodeBench is guaranteed through a systematic “Human-LLM collaboration process.” The process begins with ODEX as the “seed dataset,” which contains short but realistic human intents and corresponding Python one-liners from Stack Overflow. GPT-4 is used to expand these one-liners into comprehensive function-level tasks.

    Subsequently, 20 human experts—most with over 5 years of Python programming experience—voluntarily guide GPT-4 in an execution-based sandbox. These experts continually instruct it to refine the synthesized tasks and add test cases. The tasks and test cases are then examined in a local environment, pre-evaluated on other LLMs, and cross-checked by 7 additional human experts to ensure their quality.

    To assert overall quality, tasks are sampled for 11 human experts to solve, who achieved an average human performance of 97%.

    How well do LLMs perform on BigCodeBench? 📊

    The BigCodeBench leaderboard is hosted on both Hugging Face Space and GitHub Pages. The Hugging Face leaderboard is used as an example.

    Instruction-tuned LLMs like GPT-4 have been observed to omit essential import statements in the long prompts of BigCodeBench-Complete, leading to task failures due to missing modules and constants. This behavior, called “model laziness”, is discussed in the community.

    Compared to human performance, LLMs perform significantly lower on BigCodeBench-Complete and even lower on BigCodeBench-Instruct. The best model (GPT-4o) achieves a calibrated Pass@1 of 61.1% on BigCodeBench-Complete and 51.1% on BigCodeBench-Instruct. Additionally, there is a notable performance gap between closed and open LLMs.

    While Pass@1 is a good metric for overall performance, it is not detailed enough to compare models directly. Inspired by Chatbot Arena, Elo rating is used to rank models on BigCodeBench-Complete. This method, originally used in chess, ranks players based on their game performance. It is adapted to programming tasks, treating each task as a game and each model as a player. The Elo rating updates are based on game outcomes and expectations, using task-level calibrated Pass@1 (0% or 100%) and excluding ties. Starting with an initial Elo rating of 1000, it is fitted using maximum likelihood estimation and bootstrap with 500 iterations to get final scores. GPT-4o has been found to outperform other models by a large margin, with DeepSeekCoder-V2 in the second tier.

    To help the community understand model performance on each task, solve rates are tracked, measured by calibrated Pass@1. On BigCodeBench-Complete, 149 tasks remain unsolved by all models, while 6 tasks are completely solved. For BigCodeBench-Instruct, 278 tasks remain unsolved and 14 tasks are fully solved by all models. The significant number of unsolved tasks and the small number of fully solved tasks show that BigCodeBench is a challenging benchmark for LLMs.

    Great! So, how can I evaluate my model on BigCodeBench? 🛠️

    BigCodeBench is made easily accessible to the community through a simple and user-friendly evaluation framework, which can be downloaded via PyPI. The prototype of the evaluation framework is based on EvalPlus for the HumanEval+ and MBPP+ benchmarks. However, since this benchmark features tasks with much more diverse library dependencies than EvalPlus, a less resource-constrained execution environment was built and adapted for unittest in the test harness of BigCodeBench.

    To facilitate evaluation, pre-built Docker images for code generation and code execution are provided. The GitHub repository contains more details on how to use the evaluation framework.

    Setup

    # Install to use bigcodebench.evaluate
    pip install bigcodebench --upgrade
    # If you want to use the evaluate locally, you need to install the requirements
    pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt
    
    # Install to use bigcodebench.generate
    # You are strongly recommended to install the [generate] dependencies in a separate environment
    pip install bigcodebench[generate] --upgrade
    

    Code Generation

    You are suggested to use flash-attn for generating code samples.

    pip install -U flash-attn
    

    To generate code samples from a model, you can use the following command:

    bigcodebench.generate \
        --model [model_name] \
        --subset [complete|instruct] \
        --greedy \
        --bs [bs] \
        --temperature [temp] \
        --n_samples [n_samples] \
        --resume \
        --backend [vllm|hf|openai|mistral|anthropic|google] \
        --tp [gpu_number] \
        [--trust_remote_code] \
        [--base_url [base_url]]
    

    The generated code samples will be stored in a file named [model_name]–bigcodebench-[instruct|complete]–[backend]-[temp]-[n_samples].jsonl.

    Code Post-processing

    LLM-generated text may not be compilable code as it includes natural language lines or incomplete extra code. A tool, bigcodebench.sanitize, is provided to clean up the code:

    # 💡 If you want to store calibrated code in jsonl:
    bigcodebench.sanitize --samples samples.jsonl --calibrate
    # Sanitized code will be produced to `samples-sanitized-calibrated.jsonl`
    
    # 💡 If you do without calibration:
    bigcodebench.sanitize --samples samples.jsonl
    # Sanitized code will be produced to `samples-sanitized.jsonl`
    
    # 💡 If you are storing codes in directories:
    bigcodebench.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
    # Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`
    

    Code Evaluation

    You are strongly recommended to use a sandbox such as docker:

    # Mount the current directory to the container
    docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --subset [complete|instruct] --samples samples-sanitized-calibrated
    
    # ...Or locally ⚠️
    bigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated
    
    # ...If the ground truth is working locally (due to some flaky tests)
    bigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated --no-gt
    

    What’s next?

    A long-term roadmap is shared to address the limitations of BigCodeBench and build it sustainably with the community. The goal is to provide the community with the most open, reliable, and scalable evaluations to truly understand the fundamental capabilities of LLMs for programming and pinpoint ways to unleash their power. Specifically, the following aspects of BigCodeBench are planned for enhancement:

    • Multilingualism: Currently, BigCodeBench is Python-only and cannot be easily extended to other programming languages. Since function calls are mostly language-specific, finding packages or libraries with the same functionalities in languages other than Python is challenging.

    • Rigorousness: While high test coverage is achieved for ground-truth solutions in BigCodeBench, it does not guarantee that all code solutions generated by LLMs will be correctly assessed against existing test cases. Previous works like EvalPlus have attempted to extend limited test cases by augmenting input-output pairs via LLM- and mutation-based strategies. However, adapting EvalPlus to the test harness in BigCodeBench is challenging. While EvalPlus emphasizes the input-output assertions, most of test harnesses in BigCoeBench require non-trivial configurations (e.g., mock patching) to examine expected program behaviors during runtime.

    • Generalization: A key question is, “How well do the models generalize to unseen tools and tasks?” Currently, BigCodeBench covers common libraries and daily programming tasks. Benchmarking models on programming tasks that use emerging libraries like transformers and langchain would be more interesting.

    • Evolution: Libraries can become obsolete or be updated, meaning the source code data for model training will constantly evolve. Models may not memorize function calls from deprecated library versions, posing a challenge for any tool-dependent programming benchmarks to correctly examine model capabilities without periodic updates. Another related concern is test set contamination due to evolving training data.

    • Interaction: Recent interest focuses on the concept of LLMs as Agents, which is seen as a path toward artificial general intelligence. Specifically, LLMs will be grounded in a less constrained sandbox environment, where they can interact with applications such as web browsers and terminals. This environment can help unlock capabilities like self-debugging and self-reflection.

    Community feedback and contributions to building BigCodeBench in the long run are highly anticipated. 🤗

    Resources

    All BigCodeBench artifacts, including tasks, test cases, evaluation framework, and leaderboard, are open-sourced. They can be found as follows:

    • GitHub Repository
    • HF Data Viewer
    • HF Dataset
    • HF Leaderboard
    • GitHub Pages Leaderboard

    For any questions or suggestions, an issue can be opened in the repository or contact can be made via [email protected] or [email protected].

    Citation

    If these evaluations are found useful, please consider citing the work

    @article{zhuo2024bigcodebench,
      title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
      author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others},
      journal={arXiv preprint arXiv:2406.15877},
      year={2024}
    }
    
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticlePublishing Your Visual Studio Code Theme Extension
    Next Article 2025 Q4 DDoS Threat Report: A Year of Massive DDoS Attacks Culminates in a Record 31.4 Tbps Assault
    Samuel Alejandro

    Related Posts

    AI

    Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

    February 21, 2026
    AI

    SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

    February 19, 2026
    AI

    Sarvam AI Unveils New Open-Source Models, Betting on Efficiency and Local Relevance

    February 18, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Latest Post

    ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

    December 21, 202513 Views

    Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

    December 21, 202511 Views

    Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

    December 21, 202510 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    About

    Welcome to NodeToday, your trusted source for the latest updates in Technology, Artificial Intelligence, and Innovation. We are dedicated to delivering accurate, timely, and insightful content that helps readers stay ahead in a fast-evolving digital world.

    At NodeToday, we cover everything from AI breakthroughs and emerging technologies to product launches, software tools, developer news, and practical guides. Our goal is to simplify complex topics and present them in a clear, engaging, and easy-to-understand way for tech enthusiasts, professionals, and beginners alike.

    Latest Post

    Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

    February 22, 20260 Views

    Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

    February 22, 20260 Views

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 20260 Views
    Recent Posts
    • Anker’s X1 Pro shouldn’t exist, but I’m so glad it does
    • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer
    • Cookie Policy
    © 2026 NodeToday.

    Type above and press Enter to search. Press Esc to cancel.