From the Daily Californian: In collaboration with more than 300 industry experts, UC Berkeley researchers have released a new benchmark testing AI capabilities in more than 50 industries. Of the models tested, OpenAI’s GPT-5.5 scored the highest, but only with a 24% pass rate. The benchmark, dubbed Agents’ Last Exam, or ALE, is led by the Berkeley Center for Responsible, Decentralized Intelligence. The exam assigns tasks spanning subjects from audio processing to theoretical physics. A rival model, Anthropic’s Claude Fable 5, followed GPT-5.5 at a 22% overall pass rate, with Google Gemini, DeepSeek and Grok all scoring below 16%. Pass rates measure the runs in which an AI agent gets a perfect score across all tasks.
The UC Berkeley center is co-directed by computer science professor Dawn Song and Haas School of Business professor Christine Parlour. The ALE project has 13 advisers from academia and industry, across multiple universities and companies.
...The pass rates of these models aren’t high, which [ALE collaborator Zhenglu] Li attributes to a lack of people from different disciplines currently working to train AI models... “My bigger concern is not the pass rate but the way agents fail,” said Benjamin Liu, a Stanford University computer science Ph.D. student and test collaborator, in an email. “They often produce an answer that looks completely plausible but is subtly wrong, and in science a confident wrong answer is more dangerous than no answer, because someone might build on it.” ...
---
No comments:
Post a Comment