Powered by Smartsupp

AI's White-Collar Revolution: Why Knowledge Work Remains Unchanged Despite Breakthroughs



By admin | Jan 22, 2026 | 5 min read


AI's White-Collar Revolution: Why Knowledge Work Remains Unchanged Despite Breakthroughs

Nearly two years ago, Microsoft CEO Satya Nadella forecast that artificial intelligence would eventually take over knowledge work—the domain of white-collar professionals like lawyers, investment bankers, librarians, accountants, and IT specialists. Yet despite significant advances in foundation models, this transformation has been slow to materialize. While AI models have become adept at deep research and agentic planning, most knowledge-based roles have remained largely untouched. This disconnect stands as one of the major puzzles in AI today.

New research from training-data leader Mercor is beginning to shed light on this mystery. The study examines how top AI models handle real-world tasks drawn from consulting, investment banking, and law. The outcome is a new benchmark called Apex-Agents—and so far, every AI lab is falling short. When presented with queries from actual professionals, even the strongest models struggled to answer more than a quarter of the questions correctly. Most of the time, models either returned incorrect answers or failed to respond at all.

According to researcher Brendan Foody, who contributed to the paper, the models’ primary weakness lies in synthesizing information across multiple domains—a skill essential to human knowledge work. “In real jobs, you aren’t handed all the context in one place,” Foody explained. “You’re navigating across tools like Slack, Google Drive, and various other platforms.” For many agentic AI models, this kind of cross-domain reasoning remains inconsistent and unreliable.

Screenshot

The test scenarios were developed with input from professionals on Mercor’s expert marketplace, who both designed the queries and defined what constituted a successful response. Reviewing the publicly available questions on Hugging Face reveals just how intricate these tasks can be. For example, one legal scenario asks:

*During the first 48 minutes of the EU production outage, Northstar’s engineering team exported one or two bundled sets of EU production event logs containing personal data to the U.S. analytics vendor…. Under Northstar’s own policies, it can reasonably treat the one or two log exports as consistent with Article 49.*

The correct answer is “yes,” but arriving at it requires a detailed analysis of both the company’s internal policies and relevant EU privacy regulations. Such a question could challenge even a well-informed human, but the researchers aimed to simulate the actual work performed by legal professionals. If a large language model could reliably answer questions like these, it could potentially replace many lawyers working today. “This benchmark closely reflects the real work these professionals do,” Foody noted.

OpenAI previously attempted to gauge professional capabilities with its GDPVal benchmark, but the Apex Agents test differs in key ways. While GDPVal assesses broad general knowledge across many fields, Apex Agents evaluates a system’s ability to perform sustained, specialized tasks within a select set of high-value professions. This makes the benchmark more difficult for AI models—and more directly relevant to the question of whether these jobs can be automated.

Although no model proved ready to step into the role of an investment banker, some performed notably better than others. Gemini 3 Flash led the group with 24% one-shot accuracy, followed closely by GPT-5.2 at 23%. Opus 4.5, Gemini 3 Pro, and GPT-5 all scored around 18%. While these initial results are modest, the AI field has a track record of rapidly overcoming difficult benchmarks. Now that the Apex test is public, it stands as an open challenge for AI labs confident they can improve—an outcome Foody fully anticipates in the coming months.

“Right now, it’s fair to say the AI is like an intern who gets it right a quarter of the time,” Foody observed. “But last year, it was the intern who got it right five or ten percent of the time. That kind of year-over-year improvement can lead to meaningful impact very quickly.”




RELATED AI TOOLS CATEGORIES AND TAGS

Comments

Please log in to leave a comment.

No comments yet. Be the first to comment!