0% found this document useful (0 votes)

17 views4 pages

TechCrunch - The Rise of AI 'Reasoning' Models Is Making Benchmarking More Expensive

The rise of AI reasoning models, which are more capable but also significantly more expensive to benchmark, poses challenges for independent verification of their performance. Benchmarking costs for these models can reach thousands of dollars, with organizations like Artificial Analysis spending nearly twice as much on reasoning models compared to non-reasoning ones. The complexity of modern benchmarks and the high token generation during tests contribute to these increased costs, raising concerns about the reproducibility of results in the field.

Uploaded by

neyana.44

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views4 pages

TechCrunch - The Rise of AI 'Reasoning' Models Is Making Benchmarking More Expensive

Uploaded by

neyana.44

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Latest Security

Startups AI

Venture Apps

Apple

Events

Podcasts

Newsletters

AI
The rise of AI ‘reasoning’ models is
making benchmarking more
expensive

Kyle Wiggers 6:30 AM PDT · April 10, 2025

IMAGE CREDITS: SLOBO / GETTY IMAGES

AI labs like OpenAI claim that their so-called “reasoning” AI models, which
can “think” through problems step by step, are more capable than their
non-reasoning counterparts in specific domains, such as physics. But
while this generally appears to be the case, reasoning models are also
much more expensive to benchmark, making it difficult to independently
verify these claims.

According to data from Artificial Analysis, a third-party AI testing outfit, it

costs $2,767.05 to evaluate OpenAI’s o1 reasoning model across a suite of
seven popular AI benchmarks: MMLU-Pro, GPQA Diamond, Humanity’s
Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500.

Benchmarking Anthropic’s recent Claude 3.7 Sonnet, a “hybrid”

reasoning model, on the same set of tests cost $1,485.35, while testing
OpenAI’s o3-mini-high cost $344.59, per Artificial Analysis.

Some reasoning models are cheaper to benchmark than others. Artificial

Analysis spent $141.22 evaluating OpenAI’s o1-mini, for example. But on
average, they tend to be pricey. All told, Artificial Analysis has spent
roughly $5,200 evaluating around a dozen reasoning models, close to
twice the amount the firm spent analyzing over 80 non-reasoning models
($2,400).

OpenAI’s non-reasoning GPT-4o model, released in May 2024, cost

Artificial Analysis just $108.85 to evaluate, while Claude 3.6 Sonnet —
Claude 3.7 Sonnet’s non-reasoning predecessor — cost $81.41.

Artificial Analysis co-founder George Cameron told TechCrunch that the

organization plans to increase its benchmarking spend as more AI labs
develop reasoning models.

“At Artificial Analysis, we run hundreds of evaluations monthly and devote

a significant budget to these,” Cameron said. “We are planning for this
spend to increase as models are more frequently released.”

Artificial Analysis isn’t the only outfit of its kind that’s dealing with rising AI
benchmarking costs.

Ross Taylor, the CEO of AI startup General Reasoning, said he recently

spent $580 evaluating Claude 3.7 Sonnet on around 3,700 unique
prompts. Taylor estimates a single run-through of MMLU Pro, a question
set designed to benchmark a model’s language comprehension skills,
would have cost more than $1,800.

“We’re moving to a world where a lab reports x% on a benchmark where

they spend y amount of compute, but where resources for academics are
<< y,” said Taylor in a recent post on X. “[N]o one is going to be able to
reproduce the results.”

Why are reasoning models so expensive to test? Mainly because they

generate a lot of tokens. Tokens represent bits of raw text, such as the
word “fantastic” split into the syllables “fan,” “tas,” and “tic.” According
to Artificial Analysis, OpenAI’s o1 generated over 44 million tokens during
the firm’s benchmarking tests, around eight times the amount GPT-4o
generated.

The vast majority of AI companies charge for model usage by the token, so
you can see how this cost can add up.

Modern benchmarks also tend to elicit a lot of tokens from models

because they contain questions involving complex, multi-step tasks,
according to Jean-Stanislas Denain, a senior researcher at Epoch AI,
which develops its own model benchmarks.

“[Today’s] benchmarks are more complex [even though] the number of

questions per benchmark has overall decreased,” Denain told
TechCrunch. “They often attempt to evaluate models’ ability to do real-
world tasks, such as write and execute code, browse the internet, and use
computers.”

Denain added that the most expensive models have gotten more
expensive per token over time. For example, Anthropic’s Claude 3 Opus
was the priciest model when it was released in May 2024, costing $75 per
million output tokens. OpenAI’s GPT-4.5 and o1-pro, both of which
launched earlier this year, cost $150 per million output tokens and $600
per million output tokens, respectively.

“[S]ince models have gotten better over time, it’s still true that the cost to
reach a given level of performance has greatly decreased over time,”
Denain said. “But if you want to evaluate the best largest models at any
point in time, you’re still paying more.”

Many AI labs, including OpenAI, give benchmarking organizations free or

subsidized access to their models for testing purposes. But this colors the
results, some experts say — even if there’s no evidence of manipulation,
the mere suggestion of an AI lab’s involvement threatens to harm the
integrity of the evaluation scoring.

“From [a] scientific point of view, if you publish a result that no one can
replicate with the same model, is it even science anymore?” wrote Taylor
in a follow-up post on X. “(Was it ever science, lol)”.

Topics: AI ai benchmarks AI reasoning models

Plotting A Mystery Novel
100% (1)
Plotting A Mystery Novel
4 pages
Generative AI Fundamentals: A Guide for Beginners
From Everand
Generative AI Fundamentals: A Guide for Beginners
Othman Omran Khalifa
No ratings yet
Design As An Attitude Chapter 2 Alice Rawthorn
No ratings yet
Design As An Attitude Chapter 2 Alice Rawthorn
10 pages
He Rising Costs of Training Frontier Models: Bstract
No ratings yet
He Rising Costs of Training Frontier Models: Bstract
20 pages
A Google Gemini Model Now Has A Dial To Adjust How Much It Reasons
No ratings yet
A Google Gemini Model Now Has A Dial To Adjust How Much It Reasons
3 pages
Dynamic Intelligence Assessment: Benchmarking Llms On The Road To Agi With A Focus On Model Confidence
No ratings yet
Dynamic Intelligence Assessment: Benchmarking Llms On The Road To Agi With A Focus On Model Confidence
8 pages
8 AI Business Trends in 2024, According To Stanford Researchers
No ratings yet
8 AI Business Trends in 2024, According To Stanford Researchers
10 pages
OpenAI Launches AI Models It Says Are Capable of Reasoning
No ratings yet
OpenAI Launches AI Models It Says Are Capable of Reasoning
4 pages
Artificial Analysis State of AI Q1 2025
No ratings yet
Artificial Analysis State of AI Q1 2025
29 pages
Green AI
No ratings yet
Green AI
12 pages
2 Artificial Analysis AI Review 2024 Highlights
No ratings yet
2 Artificial Analysis AI Review 2024 Highlights
18 pages
AI Index Report 2024 - Artificial Intelligence Index
No ratings yet
AI Index Report 2024 - Artificial Intelligence Index
1 page
Artificial Analysis State of AI Q1 2025 Highlights Report
No ratings yet
Artificial Analysis State of AI Q1 2025 Highlights Report
29 pages
What DeepSeek Means For Chinas AI
No ratings yet
What DeepSeek Means For Chinas AI
4 pages
Green AI
No ratings yet
Green AI
12 pages
Meek Models Shall Inherit The Earth
No ratings yet
Meek Models Shall Inherit The Earth
13 pages
Metaethical Perspectives On Benchmarking' Ai Ethics: Travis Lacroix Alexandra Sasha Luccioni
No ratings yet
Metaethical Perspectives On Benchmarking' Ai Ethics: Travis Lacroix Alexandra Sasha Luccioni
19 pages
Understanding Reasoning LLMS: Methods and Strategies For Building and Refining Reasoning Models
No ratings yet
Understanding Reasoning LLMS: Methods and Strategies For Building and Refining Reasoning Models
27 pages
Large Language Models Are Getting Bigger and Better
No ratings yet
Large Language Models Are Getting Bigger and Better
7 pages
21046
No ratings yet
21046
38 pages
Generative AI For Economic Research
No ratings yet
Generative AI For Economic Research
80 pages
Early 2025 AI Experienced OS Devs Study-7
No ratings yet
Early 2025 AI Experienced OS Devs Study-7
2 pages
1.1. Background On Reasoning in Large Language Models (LLMS)
No ratings yet
1.1. Background On Reasoning in Large Language Models (LLMS)
64 pages
TechCrunch - The Hottest AI Models, What They Do, and How To Use Them
No ratings yet
TechCrunch - The Hottest AI Models, What They Do, and How To Use Them
8 pages
Towards Automated Machine Learning Research: Shervin Ardeshir
No ratings yet
Towards Automated Machine Learning Research: Shervin Ardeshir
16 pages
01 In28minutes Presentation Generative Ai With Chatgpt Openai
No ratings yet
01 In28minutes Presentation Generative Ai With Chatgpt Openai
86 pages
Artificial Intelligence Systems Integration: Fundamentals and Applications
From Everand
Artificial Intelligence Systems Integration: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ai & ML - SLM
No ratings yet
Ai & ML - SLM
87 pages
Pricing Strategies For Gen AI
No ratings yet
Pricing Strategies For Gen AI
11 pages
ML Platforms
No ratings yet
ML Platforms
5 pages
AI, Machine Learning, Deep Learning, IOTs
No ratings yet
AI, Machine Learning, Deep Learning, IOTs
14 pages
AI Notes-1
No ratings yet
AI Notes-1
28 pages
01 In28minutes Presentation Generative Ai With Google
No ratings yet
01 In28minutes Presentation Generative Ai With Google
95 pages
Mastering Prompt Engineering: Use AI Like a Hero
From Everand
Mastering Prompt Engineering: Use AI Like a Hero
Zakaria Bouidane
No ratings yet
Unit - 5
No ratings yet
Unit - 5
58 pages
15 AI Skills to Master in 2025
From Everand
15 AI Skills to Master in 2025
Nemilidinne Ashok Reddy
No ratings yet
Benchmark LLM
No ratings yet
Benchmark LLM
28 pages
AI Trends Report 2025
No ratings yet
AI Trends Report 2025
114 pages
AI Discussion Intro
No ratings yet
AI Discussion Intro
8 pages
Latest LLMs
No ratings yet
Latest LLMs
6 pages
Affordable AI Assistants With Knowledge Graph of Thoughts
No ratings yet
Affordable AI Assistants With Knowledge Graph of Thoughts
16 pages
2022 AI Index Report Master (001 060)
No ratings yet
2022 AI Index Report Master (001 060)
60 pages
AI Unleashed: A Holistic Guide to Mastering Artificial Intelligence: Navigating Theory, Implementation, and Ethical Frontiers
From Everand
AI Unleashed: A Holistic Guide to Mastering Artificial Intelligence: Navigating Theory, Implementation, and Ethical Frontiers
Tanjimul Islam Tareq
No ratings yet
Ai Disruption Driving Innovation On Device Inference
No ratings yet
Ai Disruption Driving Innovation On Device Inference
12 pages
FTML Book
No ratings yet
FTML Book
130 pages
Learning Deep Architectures For AI - Yoshua Bengio
No ratings yet
Learning Deep Architectures For AI - Yoshua Bengio
130 pages
(4 of 4) Forecasting TAI With Biological Anchors
No ratings yet
(4 of 4) Forecasting TAI With Biological Anchors
45 pages
General Game Playing: Fundamentals and Applications
From Everand
General Game Playing: Fundamentals and Applications
Fouad Sabry
No ratings yet
AI Techniques
No ratings yet
AI Techniques
24 pages
400pm Korinek Paper LLMs Final
No ratings yet
400pm Korinek Paper LLMs Final
65 pages
S 3 - 4 (A) Navigating The New Landscape of AI Platforms
No ratings yet
S 3 - 4 (A) Navigating The New Landscape of AI Platforms
5 pages
Optimize Gen AI Implementation Costs With PibyThree 1721925466
No ratings yet
Optimize Gen AI Implementation Costs With PibyThree 1721925466
11 pages
Ai Mqpa-1
No ratings yet
Ai Mqpa-1
51 pages
Generative AI – An Overview: Software, #1
From Everand
Generative AI – An Overview: Software, #1
Editor IJSMI
No ratings yet
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
Amazon Nova's Competitive Price - Performance, OpenAI O1 Pro's High Price - Performance, Google's Game Worlds On Tap, Factual LLMs
No ratings yet
Amazon Nova's Competitive Price - Performance, OpenAI O1 Pro's High Price - Performance, Google's Game Worlds On Tap, Factual LLMs
15 pages
2025 04 25 AI Updates
No ratings yet
2025 04 25 AI Updates
24 pages
Deep learning: deep learning explained to your granny – a guide for beginners
From Everand
Deep learning: deep learning explained to your granny – a guide for beginners
PAT NAKAMOTO
3/5 (2)
Introduction To Deep Learning
No ratings yet
Introduction To Deep Learning
37 pages
Docu 20250408 Tracking Ai in 10 Charts
No ratings yet
Docu 20250408 Tracking Ai in 10 Charts
16 pages
Because We Have Failed
No ratings yet
Because We Have Failed
7 pages
AI for Beginners: AI, #1
From Everand
AI for Beginners: AI, #1
Mario Marinov
No ratings yet
Marvel Bingo
No ratings yet
Marvel Bingo
11 pages
Michel Foucault - What Is An Author, U: Language, Counter-Memory, Practice
No ratings yet
Michel Foucault - What Is An Author, U: Language, Counter-Memory, Practice
14 pages
ER20 Data Sheet en
No ratings yet
ER20 Data Sheet en
2 pages
Sinumerik 810 Sinumerik 820 Basic Version 3 Software Version 2 Installation Lists
No ratings yet
Sinumerik 810 Sinumerik 820 Basic Version 3 Software Version 2 Installation Lists
102 pages
3-Meetings and Minutes
No ratings yet
3-Meetings and Minutes
7 pages
Thesis Statement Jose Rizal
100% (2)
Thesis Statement Jose Rizal
8 pages
Script (Time Management)
100% (1)
Script (Time Management)
4 pages
Add Subtract Fractions
No ratings yet
Add Subtract Fractions
1 page
Lesson Plan Year 4 English
No ratings yet
Lesson Plan Year 4 English
3 pages
King, Keohane and Verba: Chapter 1+2
No ratings yet
King, Keohane and Verba: Chapter 1+2
19 pages
L20 Prog
No ratings yet
L20 Prog
192 pages
Donaldson Engine Liquid Product Guide
100% (1)
Donaldson Engine Liquid Product Guide
154 pages
古典占星部分书籍推荐
No ratings yet
古典占星部分书籍推荐
16 pages
Risk Assessment For Excavation Slope
No ratings yet
Risk Assessment For Excavation Slope
9 pages
Chapter 1 Problem Set
No ratings yet
Chapter 1 Problem Set
3 pages
Challenges, Opportunities and Future Perspectives in Including Children With Disabilities in The Design of Interactive Technology
No ratings yet
Challenges, Opportunities and Future Perspectives in Including Children With Disabilities in The Design of Interactive Technology
4 pages
Institutional Planning
No ratings yet
Institutional Planning
4 pages
Ims Conference Call PDF
100% (1)
Ims Conference Call PDF
6 pages
Last - Minute English Mastery PDF
No ratings yet
Last - Minute English Mastery PDF
41 pages
The Secret Order of The Illuminati
67% (3)
The Secret Order of The Illuminati
10 pages
Elected Fellows of Msi: Name & Address Fellowship No., E-Mail Address, Telephone /mobile Number
No ratings yet
Elected Fellows of Msi: Name & Address Fellowship No., E-Mail Address, Telephone /mobile Number
10 pages
Grade 6 Maths Practice Worksheet 2021-2022
No ratings yet
Grade 6 Maths Practice Worksheet 2021-2022
20 pages
Lesson Plan English and Math
No ratings yet
Lesson Plan English and Math
7 pages
Coach Carter
0% (1)
Coach Carter
6 pages
Real Numbers & Number System
No ratings yet
Real Numbers & Number System
4 pages
Arribas 2016
No ratings yet
Arribas 2016
19 pages
Proportional and Servovalves
No ratings yet
Proportional and Servovalves
14 pages
Statistical Inference PDF
No ratings yet
Statistical Inference PDF
8 pages