Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

The AI Productivity Index (APEX)

Created by
  • Haebom

Author

Bertie Vidgen, Abby Fennelly, Evan Pinnix, Chirag Mahapatra, Zach Richards, Austin Bridges, Calix Huang, Ben Hunsberger, Fez Zafar, Brendan Foody, Dominic Barton, Cass R. Sunstein, Eric Topol, Osvald Nitski

Introducing the AI Productivity Index (APEX)

We present the first version of the AI Productivity Index (APEX), a benchmark for assessing whether AI models can perform high-value knowledge tasks. APEX addresses one of the biggest inefficiencies in AI research, stemming from benchmarks that fail to test economically relevant skills beyond coding. APEX-v1.0 includes 200 test cases and covers four domains: investment banking, management consulting, law, and primary care. APEX was built in three phases. First, we recruited experts with top-level experience, such as investment bankers at Goldman Sachs. Second, the experts generated prompts reflecting high-value tasks in their daily work. Third, the experts developed scoring criteria to evaluate model responses. Using the LM judger, we evaluated 23 state-of-the-art models on APEX-v1.0. GPT 5 (thought = high) achieved the highest average score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (thought = on) (60.4%). Qwen 3 235B is the best-performing open-source model, ranking 7th overall. Even the best models show a significant gap in performance compared to human experts, highlighting the need for better measures of models' ability to generate economically valuable tasks.

Takeaways, Limitations

Development of a benchmark to assess the ability to perform knowledge work with high economic value.
GPT 5 achieved the highest performance, followed by Grok 4 and Gemini 2.5 Flash.
Among the open source models, the Qwen 3 235B is the best.
A performance gap exists between top models and human experts.
APEX-v1.0 is limited to four domains: investment banking, management consulting, legal, and primary care.
The need for further domain and task expansion in the future
👍