Latest Security
Startups AI
Venture Apps
Apple
Sign In
Events
Podcasts
Newsletters
AI
The rise of AI ‘reasoning’ models is
making benchmarking more
expensive
Kyle Wiggers 6:30 AM PDT · April 10, 2025
IMAGE CREDITS: SLOBO / GETTY IMAGES
AI labs like OpenAI claim that their so-called “reasoning” AI models, which
can “think” through problems step by step, are more capable than their
non-reasoning counterparts in specific domains, such as physics. But
while this generally appears to be the case, reasoning models are also
much more expensive to benchmark, making it difficult to independently
verify these claims.
According to data from Artificial Analysis, a third-party AI testing outfit, it
costs $2,767.05 to evaluate OpenAI’s o1 reasoning model across a suite of
seven popular AI benchmarks: MMLU-Pro, GPQA Diamond, Humanity’s
Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500.
Benchmarking Anthropic’s recent Claude 3.7 Sonnet, a “hybrid”
reasoning model, on the same set of tests cost $1,485.35, while testing
OpenAI’s o3-mini-high cost $344.59, per Artificial Analysis.
Some reasoning models are cheaper to benchmark than others. Artificial
Analysis spent $141.22 evaluating OpenAI’s o1-mini, for example. But on
average, they tend to be pricey. All told, Artificial Analysis has spent
roughly $5,200 evaluating around a dozen reasoning models, close to
twice the amount the firm spent analyzing over 80 non-reasoning models
($2,400).
OpenAI’s non-reasoning GPT-4o model, released in May 2024, cost
Artificial Analysis just $108.85 to evaluate, while Claude 3.6 Sonnet —
Claude 3.7 Sonnet’s non-reasoning predecessor — cost $81.41.
Artificial Analysis co-founder George Cameron told TechCrunch that the
organization plans to increase its benchmarking spend as more AI labs
develop reasoning models.
“At Artificial Analysis, we run hundreds of evaluations monthly and devote
a significant budget to these,” Cameron said. “We are planning for this
spend to increase as models are more frequently released.”
Artificial Analysis isn’t the only outfit of its kind that’s dealing with rising AI
benchmarking costs.
Ross Taylor, the CEO of AI startup General Reasoning, said he recently
spent $580 evaluating Claude 3.7 Sonnet on around 3,700 unique
prompts. Taylor estimates a single run-through of MMLU Pro, a question
set designed to benchmark a model’s language comprehension skills,
would have cost more than $1,800.
“We’re moving to a world where a lab reports x% on a benchmark where
they spend y amount of compute, but where resources for academics are
<< y,” said Taylor in a recent post on X. “[N]o one is going to be able to
reproduce the results.”
Why are reasoning models so expensive to test? Mainly because they
generate a lot of tokens. Tokens represent bits of raw text, such as the
word “fantastic” split into the syllables “fan,” “tas,” and “tic.” According
to Artificial Analysis, OpenAI’s o1 generated over 44 million tokens during
the firm’s benchmarking tests, around eight times the amount GPT-4o
generated.
The vast majority of AI companies charge for model usage by the token, so
you can see how this cost can add up.
Modern benchmarks also tend to elicit a lot of tokens from models
because they contain questions involving complex, multi-step tasks,
according to Jean-Stanislas Denain, a senior researcher at Epoch AI,
which develops its own model benchmarks.
“[Today’s] benchmarks are more complex [even though] the number of
questions per benchmark has overall decreased,” Denain told
TechCrunch. “They often attempt to evaluate models’ ability to do real-
world tasks, such as write and execute code, browse the internet, and use
computers.”
Denain added that the most expensive models have gotten more
expensive per token over time. For example, Anthropic’s Claude 3 Opus
was the priciest model when it was released in May 2024, costing $75 per
million output tokens. OpenAI’s GPT-4.5 and o1-pro, both of which
launched earlier this year, cost $150 per million output tokens and $600
per million output tokens, respectively.
“[S]ince models have gotten better over time, it’s still true that the cost to
reach a given level of performance has greatly decreased over time,”
Denain said. “But if you want to evaluate the best largest models at any
point in time, you’re still paying more.”
Many AI labs, including OpenAI, give benchmarking organizations free or
subsidized access to their models for testing purposes. But this colors the
results, some experts say — even if there’s no evidence of manipulation,
the mere suggestion of an AI lab’s involvement threatens to harm the
integrity of the evaluation scoring.
“From [a] scientific point of view, if you publish a result that no one can
replicate with the same model, is it even science anymore?” wrote Taylor
in a follow-up post on X. “(Was it ever science, lol)”.
Topics: AI ai benchmarks AI reasoning models