Close Menu
    Latest Post

    Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

    February 22, 2026

    Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

    February 22, 2026

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 2026
    Facebook X (Twitter) Instagram
    Trending
    • Anker’s X1 Pro shouldn’t exist, but I’m so glad it does
    • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    • How to Cancel Your Google Pixel Watch Fitbit Premium Trial
    • GHD Speed Hair Dryer Review: Powerful Performance and User-Friendly Design
    • An FBI ‘Asset’ Helped Run a Dark Web Site That Sold Fentanyl-Laced Drugs for Years
    Facebook X (Twitter) Instagram Pinterest Vimeo
    NodeTodayNodeToday
    • Home
    • AI
    • Dev
    • Guides
    • Products
    • Security
    • Startups
    • Tech
    • Tools
    NodeTodayNodeToday
    Home»AI»Are AI agents ready for the workplace? A new benchmark raises doubts.
    AI

    Are AI agents ready for the workplace? A new benchmark raises doubts.

    Samuel AlejandroBy Samuel AlejandroJanuary 22, 2026No Comments3 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    src 1c3ij15 featured
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Almost two years ago, Microsoft CEO Satya Nadella anticipated that AI would take over knowledge work, encompassing white-collar professions such as lawyers, investment bankers, librarians, accountants, and IT specialists.

    Despite significant advancements in foundation models, the impact on knowledge work has been gradual. While models excel at in-depth research and agentic planning, most white-collar roles have seen limited changes.

    This slow adoption has been a major puzzle in AI, but new research from Mercor, a prominent training-data company, is now providing insights.

    This new study examines how top AI models perform on real-world white-collar tasks in fields like consulting, investment banking, and law. The outcome is a new benchmark called Apex-Agents, where all AI labs have currently received failing scores. When presented with questions from actual professionals, even the most advanced models correctly answered less than 25% of the time, frequently providing incorrect or no responses.

    Brendan Foody, a researcher involved in the paper, noted that the primary challenge for these models was retrieving information across various domains, a critical aspect of most human knowledge work.

    Foody explained that the benchmark’s environment was designed to mimic real professional services. He stated that professionals typically operate across multiple tools like Slack and Google Drive, rather than receiving all context from a single source. This multi-domain reasoning remains inconsistent for many agentic AI models.

    Image 1Screenshot

    The scenarios used were developed by actual professionals from Mercor’s expert marketplace, who defined the queries and established criteria for successful answers. Reviewing the questions, available on Hugging Face, illustrates the complexity of these tasks.

    An example question from the “Law” section is:

    During the first 48 minutes of the EU production outage, Northstar’s engineering team exported one or two bundled sets of EU production event logs containing personal data to the U.S. analytics vendor….Under Northstar’s own policies, it can reasonably treat the one or two log exports as consistent with Article 49?

    The correct response is yes, but arriving at this conclusion necessitates a thorough evaluation of the company’s internal policies and applicable EU privacy regulations.

    Such a question could challenge even a knowledgeable human, but the researchers aimed to simulate the work performed by field professionals. If a large language model (LLM) could consistently answer these questions, it might replace numerous lawyers. Foody commented that this is likely the most crucial economic topic, and the benchmark accurately reflects the actual work done by these professionals.

    OpenAI previously tried to assess professional skills with its GDPVal benchmark, but the Apex Agents test presents key differences. While GDPVal evaluates general knowledge across many professions, Apex Agents assesses a system’s capacity for sustained tasks within specific high-value professions. This makes it more challenging for models but also more directly relevant to the potential automation of these jobs.

    Although no models demonstrated readiness to function as investment bankers, some showed greater proficiency. Gemini 3 Flash achieved the highest one-shot accuracy at 24%, with GPT-5.2 close behind at 23%. Opus 4.5, Gemini 3 Pro, and GPT-5 each scored approximately 18%.

    Despite these initial shortcomings, the AI sector has a track record of rapidly surpassing difficult benchmarks. With the Apex test now public, it serves as an open challenge for AI labs confident in their ability to improve, an outcome Foody anticipates in the coming months.

    Foody informed TechCrunch that improvements are happening very quickly. He likened current AI performance to an intern who is correct 25% of the time, compared to 5-10% last year. He noted that such year-over-year improvement can have a rapid impact.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleManaging Defense Systems at Scale: When Protections Become Obstacles
    Next Article How Karrot Built a Feature Platform on AWS, Part 1: Motivation and Feature Serving
    Samuel Alejandro

    Related Posts

    AI

    Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

    February 21, 2026
    AI

    SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

    February 19, 2026
    AI

    Sarvam AI Unveils New Open-Source Models, Betting on Efficiency and Local Relevance

    February 18, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Latest Post

    ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

    December 21, 202513 Views

    Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

    December 21, 202511 Views

    Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

    December 21, 202510 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    About

    Welcome to NodeToday, your trusted source for the latest updates in Technology, Artificial Intelligence, and Innovation. We are dedicated to delivering accurate, timely, and insightful content that helps readers stay ahead in a fast-evolving digital world.

    At NodeToday, we cover everything from AI breakthroughs and emerging technologies to product launches, software tools, developer news, and practical guides. Our goal is to simplify complex topics and present them in a clear, engaging, and easy-to-understand way for tech enthusiasts, professionals, and beginners alike.

    Latest Post

    Anker’s X1 Pro shouldn’t exist, but I’m so glad it does

    February 22, 20260 Views

    Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations

    February 22, 20260 Views

    Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling

    February 22, 20260 Views
    Recent Posts
    • Anker’s X1 Pro shouldn’t exist, but I’m so glad it does
    • Suspected Russian Actor Linked to CANFAIL Malware Attacks on Ukrainian Organizations
    • Trump Reinstates De Minimis Exemption Suspension Despite Supreme Court Ruling
    • How Cloudflare Mitigated a Vulnerability in its ACME Validation Logic
    • Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry
    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer
    • Cookie Policy
    © 2026 NodeToday.

    Type above and press Enter to search. Press Esc to cancel.