Almost two years ago, Microsoft CEO Satya Nadella anticipated that AI would take over knowledge work, encompassing white-collar professions such as lawyers, investment bankers, librarians, accountants, and IT specialists.
Despite significant advancements in foundation models, the impact on knowledge work has been gradual. While models excel at in-depth research and agentic planning, most white-collar roles have seen limited changes.
This slow adoption has been a major puzzle in AI, but new research from Mercor, a prominent training-data company, is now providing insights.
This new study examines how top AI models perform on real-world white-collar tasks in fields like consulting, investment banking, and law. The outcome is a new benchmark called Apex-Agents, where all AI labs have currently received failing scores. When presented with questions from actual professionals, even the most advanced models correctly answered less than 25% of the time, frequently providing incorrect or no responses.
Brendan Foody, a researcher involved in the paper, noted that the primary challenge for these models was retrieving information across various domains, a critical aspect of most human knowledge work.
Foody explained that the benchmark’s environment was designed to mimic real professional services. He stated that professionals typically operate across multiple tools like Slack and Google Drive, rather than receiving all context from a single source. This multi-domain reasoning remains inconsistent for many agentic AI models.
Screenshot
The scenarios used were developed by actual professionals from Mercor’s expert marketplace, who defined the queries and established criteria for successful answers. Reviewing the questions, available on Hugging Face, illustrates the complexity of these tasks.
An example question from the “Law” section is:
During the first 48 minutes of the EU production outage, Northstar’s engineering team exported one or two bundled sets of EU production event logs containing personal data to the U.S. analytics vendor….Under Northstar’s own policies, it can reasonably treat the one or two log exports as consistent with Article 49?
The correct response is yes, but arriving at this conclusion necessitates a thorough evaluation of the company’s internal policies and applicable EU privacy regulations.
Such a question could challenge even a knowledgeable human, but the researchers aimed to simulate the work performed by field professionals. If a large language model (LLM) could consistently answer these questions, it might replace numerous lawyers. Foody commented that this is likely the most crucial economic topic, and the benchmark accurately reflects the actual work done by these professionals.
OpenAI previously tried to assess professional skills with its GDPVal benchmark, but the Apex Agents test presents key differences. While GDPVal evaluates general knowledge across many professions, Apex Agents assesses a system’s capacity for sustained tasks within specific high-value professions. This makes it more challenging for models but also more directly relevant to the potential automation of these jobs.
Although no models demonstrated readiness to function as investment bankers, some showed greater proficiency. Gemini 3 Flash achieved the highest one-shot accuracy at 24%, with GPT-5.2 close behind at 23%. Opus 4.5, Gemini 3 Pro, and GPT-5 each scored approximately 18%.
Despite these initial shortcomings, the AI sector has a track record of rapidly surpassing difficult benchmarks. With the Apex test now public, it serves as an open challenge for AI labs confident in their ability to improve, an outcome Foody anticipates in the coming months.
Foody informed TechCrunch that improvements are happening very quickly. He likened current AI performance to an intern who is correct 25% of the time, compared to 5-10% last year. He noted that such year-over-year improvement can have a rapid impact.

