BigCodeBench: The Next Generation of HumanEval

HumanEval is a reference benchmark for evaluating large language models (LLMs) on code generation tasks, making the evaluation of compact function-level code snippets easy. However, there are growing concerns about its effectiveness in evaluating the programming capabilities of LLMs, primarily because tasks in HumanEval are often considered too simple and may not represent real-world programming challenges. Compared to the algorithm-oriented tasks in HumanEval, real-world software development frequently involves diverse libraries and function calls. Furthermore, LLMs’ performance on HumanEval is subject to contamination and overfitting issues, which reduces its reliability for evaluating the generalization of LLMs.

While some efforts have been made to address these issues, they are often domain-specific, deterministic, or agent-centric (e.g., DS-1000, ODEX, and SWE-bench). The community still lacks an easy-to-use benchmark that can broadly evaluate the programming capabilities of LLMs, which is the focus of BigCodeBench.

BigCodeBench has been released, evaluating LLMs on solving practical and challenging programming tasks without contamination. Specifically, BigCodeBench contains 1,140 function-level tasks to challenge LLMs to follow instructions and compose multiple function calls as tools from 139 libraries. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%.

What do the tasks in BigCodeBench look like? 🕵️‍♂️

BigCodeBench features complex, user-oriented instructions for each task, including clear functionality descriptions, input/output formats, error handling, and verified interactive examples. Step-by-step task instructions are avoided; capable LLMs are expected to understand and solve tasks from the user’s perspective in an open-ended manner. Specific features are verified using test cases.

# We elaborate the above task with some test cases:

# Requirements SetUp
import unittest
from unittest.mock import patch
import http.client
import ssl
import socket

# Start the test
class TestCases(unittest.TestCase):

    # Mock the successful connection and assess the response content
    @patch('http.client.HTTPSConnection')
    def test_response_content(self, mock_conn):
        """ Test the content of the response. """
        mock_conn.return_value.getresponse.return_value.read.return_value = b'Expected Content'
        result = task_func('www.example.com', 443, '/content/path')
        self.assertEqual(result, 'Expected Content')

    # Mock the failed connection and assess the error handling
    @patch('socket.create_connection')
    @patch('http.client.HTTPSConnection')
    def test_ssl_handshake_error_handling(self, mock_conn, mock_socket):
        """ Test handling of SSL handshake errors. """
        mock_socket.side_effect = ssl.SSLError('SSL handshake failed')
        with self.assertRaises(ssl.SSLError):
            task_func('badssl.com', 443, '/test/path')

    # More test cases...

Tasks in BigCodeBench utilize diverse function calls from popular libraries. Function calls that LLMs can use are not restricted, allowing them to choose appropriate functions and combine them flexibly to solve tasks. Test cases are designed as test harnesses to examine expected program behaviors during runtime.

To assess LLM performance, Pass@1 with greedy decoding is used, measuring the percentage of tasks correctly solved with the first generated code snippet via curated test cases. This approach aligns with benchmarks like HumanEval and MBPP. The tendency of LLMs to skip long code prompts is addressed by adding missing setups (e.g., import statements, global constants) during Pass@1 evaluation, referred to as calibrated Pass@1.

To better understand implementation complexity and tool-use diversity, tasks in BigCodeBench are compared with those in representative benchmarks, including APPS, DS-1000, ODEX, APIBench, MBPP, NumpyEval, PandasEval, HumanEval, and TorchDataEval. BigCodeBench is found to require more complex reasoning and problem-solving skills to implement comprehensive functionalities.

As shown in the task figure, the main target scenario is code completion (denoted as BigCodeBench-Complete), where LLMs are required to finish the implementation of a function based on detailed instructions in the docstring. However, considering downstream applications such as multi-turn dialogue, users may describe requirements in a more conversational and less verbose manner. Instruction-tuned LLMs are beneficial in this context, as they are trained to follow natural-language instructions and generate code snippets accordingly. To test if models can truly understand human intents and translate them into code, BigCodeBench-Instruct was created, a more challenging variant of BigCodeBench designed to evaluate instruction-tuned LLMs.

Where do the tasks come from? 🤔

The quality of tasks in BigCodeBench is guaranteed through a systematic “Human-LLM collaboration process.” The process begins with ODEX as the “seed dataset,” which contains short but realistic human intents and corresponding Python one-liners from Stack Overflow. GPT-4 is used to expand these one-liners into comprehensive function-level tasks.

Subsequently, 20 human experts—most with over 5 years of Python programming experience—voluntarily guide GPT-4 in an execution-based sandbox. These experts continually instruct it to refine the synthesized tasks and add test cases. The tasks and test cases are then examined in a local environment, pre-evaluated on other LLMs, and cross-checked by 7 additional human experts to ensure their quality.

To assert overall quality, tasks are sampled for 11 human experts to solve, who achieved an average human performance of 97%.

How well do LLMs perform on BigCodeBench? 📊

The BigCodeBench leaderboard is hosted on both Hugging Face Space and GitHub Pages. The Hugging Face leaderboard is used as an example.

Instruction-tuned LLMs like GPT-4 have been observed to omit essential import statements in the long prompts of BigCodeBench-Complete, leading to task failures due to missing modules and constants. This behavior, called “model laziness”, is discussed in the community.

Compared to human performance, LLMs perform significantly lower on BigCodeBench-Complete and even lower on BigCodeBench-Instruct. The best model (GPT-4o) achieves a calibrated Pass@1 of 61.1% on BigCodeBench-Complete and 51.1% on BigCodeBench-Instruct. Additionally, there is a notable performance gap between closed and open LLMs.

While Pass@1 is a good metric for overall performance, it is not detailed enough to compare models directly. Inspired by Chatbot Arena, Elo rating is used to rank models on BigCodeBench-Complete. This method, originally used in chess, ranks players based on their game performance. It is adapted to programming tasks, treating each task as a game and each model as a player. The Elo rating updates are based on game outcomes and expectations, using task-level calibrated Pass@1 (0% or 100%) and excluding ties. Starting with an initial Elo rating of 1000, it is fitted using maximum likelihood estimation and bootstrap with 500 iterations to get final scores. GPT-4o has been found to outperform other models by a large margin, with DeepSeekCoder-V2 in the second tier.

To help the community understand model performance on each task, solve rates are tracked, measured by calibrated Pass@1. On BigCodeBench-Complete, 149 tasks remain unsolved by all models, while 6 tasks are completely solved. For BigCodeBench-Instruct, 278 tasks remain unsolved and 14 tasks are fully solved by all models. The significant number of unsolved tasks and the small number of fully solved tasks show that BigCodeBench is a challenging benchmark for LLMs.

Great! So, how can I evaluate my model on BigCodeBench? 🛠️

BigCodeBench is made easily accessible to the community through a simple and user-friendly evaluation framework, which can be downloaded via PyPI. The prototype of the evaluation framework is based on EvalPlus for the HumanEval+ and MBPP+ benchmarks. However, since this benchmark features tasks with much more diverse library dependencies than EvalPlus, a less resource-constrained execution environment was built and adapted for unittest in the test harness of BigCodeBench.

To facilitate evaluation, pre-built Docker images for code generation and code execution are provided. The GitHub repository contains more details on how to use the evaluation framework.

Setup

# Install to use bigcodebench.evaluate
pip install bigcodebench --upgrade
# If you want to use the evaluate locally, you need to install the requirements
pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt

# Install to use bigcodebench.generate
# You are strongly recommended to install the [generate] dependencies in a separate environment
pip install bigcodebench[generate] --upgrade

Code Generation

You are suggested to use flash-attn for generating code samples.

pip install -U flash-attn

To generate code samples from a model, you can use the following command:

bigcodebench.generate \
    --model [model_name] \
    --subset [complete|instruct] \
    --greedy \
    --bs [bs] \
    --temperature [temp] \
    --n_samples [n_samples] \
    --resume \
    --backend [vllm|hf|openai|mistral|anthropic|google] \
    --tp [gpu_number] \
    [--trust_remote_code] \
    [--base_url [base_url]]

The generated code samples will be stored in a file named [model_name]–bigcodebench-[instruct|complete]–[backend]-[temp]-[n_samples].jsonl.

Code Post-processing

LLM-generated text may not be compilable code as it includes natural language lines or incomplete extra code. A tool, bigcodebench.sanitize, is provided to clean up the code:

# 💡 If you want to store calibrated code in jsonl:
bigcodebench.sanitize --samples samples.jsonl --calibrate
# Sanitized code will be produced to `samples-sanitized-calibrated.jsonl`

# 💡 If you do without calibration:
bigcodebench.sanitize --samples samples.jsonl
# Sanitized code will be produced to `samples-sanitized.jsonl`

# 💡 If you are storing codes in directories:
bigcodebench.sanitize --samples /path/to/vicuna-[??]b_temp_[??]
# Sanitized code will be produced to `/path/to/vicuna-[??]b_temp_[??]-sanitized`

Code Evaluation

You are strongly recommended to use a sandbox such as docker:

# Mount the current directory to the container
docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --subset [complete|instruct] --samples samples-sanitized-calibrated

# ...Or locally ⚠️
bigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated

# ...If the ground truth is working locally (due to some flaky tests)
bigcodebench.evaluate --subset [complete|instruct] --samples samples-sanitized-calibrated --no-gt

What’s next?

A long-term roadmap is shared to address the limitations of BigCodeBench and build it sustainably with the community. The goal is to provide the community with the most open, reliable, and scalable evaluations to truly understand the fundamental capabilities of LLMs for programming and pinpoint ways to unleash their power. Specifically, the following aspects of BigCodeBench are planned for enhancement:

Multilingualism: Currently, BigCodeBench is Python-only and cannot be easily extended to other programming languages. Since function calls are mostly language-specific, finding packages or libraries with the same functionalities in languages other than Python is challenging.
Rigorousness: While high test coverage is achieved for ground-truth solutions in BigCodeBench, it does not guarantee that all code solutions generated by LLMs will be correctly assessed against existing test cases. Previous works like EvalPlus have attempted to extend limited test cases by augmenting input-output pairs via LLM- and mutation-based strategies. However, adapting EvalPlus to the test harness in BigCodeBench is challenging. While EvalPlus emphasizes the input-output assertions, most of test harnesses in BigCoeBench require non-trivial configurations (e.g., mock patching) to examine expected program behaviors during runtime.
Generalization: A key question is, “How well do the models generalize to unseen tools and tasks?” Currently, BigCodeBench covers common libraries and daily programming tasks. Benchmarking models on programming tasks that use emerging libraries like transformers and langchain would be more interesting.
Evolution: Libraries can become obsolete or be updated, meaning the source code data for model training will constantly evolve. Models may not memorize function calls from deprecated library versions, posing a challenge for any tool-dependent programming benchmarks to correctly examine model capabilities without periodic updates. Another related concern is test set contamination due to evolving training data.
Interaction: Recent interest focuses on the concept of LLMs as Agents, which is seen as a path toward artificial general intelligence. Specifically, LLMs will be grounded in a less constrained sandbox environment, where they can interact with applications such as web browsers and terminals. This environment can help unlock capabilities like self-debugging and self-reflection.

Community feedback and contributions to building BigCodeBench in the long run are highly anticipated. 🤗

Resources

All BigCodeBench artifacts, including tasks, test cases, evaluation framework, and leaderboard, are open-sourced. They can be found as follows:

For any questions or suggestions, an issue can be opened in the repository or contact can be made via [email protected] or [email protected].

Citation

If these evaluations are found useful, please consider citing the work

@article{zhuo2024bigcodebench,
  title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
  author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others},
  journal={arXiv preprint arXiv:2406.15877},
  year={2024}
}

Latest Post

Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

Sarvam AI Unveils New Open-Source Models, Betting on Efficiency and Local Relevance

ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

Latest Post

Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

Latest Post

BigCodeBench: The Next Generation of HumanEval

What do the tasks in BigCodeBench look like? 🕵️‍♂️

Where do the tasks come from? 🤔

How well do LLMs perform on BigCodeBench? 📊

Great! So, how can I evaluate my model on BigCodeBench? 🛠️

Setup

Code Generation

Code Post-processing

Code Evaluation

What’s next?

Resources

Citation

Related Posts