Powered by Smartsupp

AI Agents Surge in Professional Tasks as New Model Nears 30% Capability



By admin | Feb 06, 2026 | 3 min read


AI Agents Surge in Professional Tasks as New Model Nears 30% Capability

Last month, I discussed Mercor's latest benchmark, which evaluates AI agents on professional tasks such as legal work and corporate analysis. At that point, the results were quite low, with every major lab scoring below 25%, leading to the conclusion that lawyers remained secure from AI replacement for the time being. However, AI capabilities can evolve significantly in just a few weeks.

This week's release of Opus 4.6 dramatically altered the rankings. Anthropic's new model achieved just under 30% in one-shot trials and averaged 45% when allowed multiple attempts. The update introduced several new agentic features, including "agent swarms," which likely contributed to improved performance on these multi-step challenges. Regardless, this marks a substantial leap from prior top scores and signals that progress in foundation models continues unabated.

Mercor CEO Brendan Foody expressed particular astonishment, stating, "jumping from 18.4% to 29.8% in a few months is insane."

The APEX-Agents Leaderboard

While 30% is still far from perfect, lawyers aren't facing imminent replacement by machines. Nonetheless, they should feel considerably less assured than they did just a month ago.




RELATED AI TOOLS CATEGORIES AND TAGS

Categories: Text Generation

Tags: #Leaderboards

Comments

Please log in to leave a comment.

No comments yet. Be the first to comment!