OpenAI's o3 and o4-mini hallucinate way higher than previous models (2025)

Home > Tech

A troubling issue nestled in OpenAI's technical report.

By

Cecily Mauran

OpenAI's o3 and o4-mini hallucinate way higher than previous models (1)

Cecily Mauran

Tech Reporter

Cecily is a tech reporter at Mashable who covers AI, Apple, and emerging tech trends. Before getting her master's degree at Columbia Journalism School, she spent several years working with startups and social impact businesses for Unreasonable Group and B Lab. Before that, she co-founded a startup consulting business for emerging entrepreneurial hubs in South America, Europe, and Asia. You can find her on X at @cecily_mauran.

Read Full Bio

on

Share on Facebook Share on Twitter Share on Flipboard

OpenAI's o3 and o4-mini hallucinate way higher than previous models (2)

And OpenAI doesn't know why. Credit: Didem Mente / Anadolu / Getty Images

ByOpenAI's own testing, its newestreasoning models, o3 and o4-mini, hallucinate significantly higher than o1.

First reported by TechCrunch, OpenAI's system card detailed the PersonQA evaluation results, designed to test for hallucinations. From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent —almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often.

SEE ALSO:

All the AI news of the week: ChatGPT debuts o3 and o4-mini, Gemini talks to dolphins

The system card noted how o3 "tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims." But OpenAI doesn't know the underlying cause, simply saying, "More research is needed to understand the cause of this result."

OpenAI's reasoning models are billed as more accurate than its non-reasoning models like GPT-4o and GPT-4.5 because they use more computation to "spend more time thinking before they respond," as described in the o1 announcement. Rather than largely relying on stochastic methods to provide an answer, the o-series models are trained to "refine their thinking process, try different strategies, and recognize their mistakes."

However, thesystem cardforGPT-4.5, which was released in February, shows a 19 percent hallucination rate on the PersonQA evaluation. The same card also compares it to GPT-4o, which had a 30 percent hallucination rate.

Mashable Light Speed

Want more out-of-this world tech, space and science stories?

Sign up for Mashable's weekly Light Speed newsletter.

By clicking Sign Me Up, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy.

Thanks for signing up!

In a statement to Mashable, an OpenAI spokesperson said, “Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability.”

Evaluation benchmarks are tricky. They can be subjective, especially if developed in-house, and research has found flaws in their datasets and even how they evaluate models.

Plus, some rely on different benchmarks and methods to test accuracy and hallucinations. HuggingFace's hallucination benchmark evaluates models on the "occurrence of hallucinations in generated summaries" from around 1,000 public documents and found much lower hallucination rates across the board for major models on the market than OpenAI's evaluations. GPT-4o scored 1.5 percent, GPT-4.5 preview 1.2 percent, and o3-mini-high with reasoning scored 0.8 percent. It's worth noting o3 and o4-mini weren't included in the current leaderboard.

That's all to say; even industry standard benchmarks make it difficult to assess hallucination rates.

Related Stories

  • Is OpenAI building a social network for ChatGPT's viral image generator?
  • We tried the ChatGPT 'reverse location search' trend, and it's scary
  • The latest ChatGPT trend? People are using it to turn their pets into humans.

Then there's the added complexity that models tend to be more accurate when tapping into web search to source their answers. But in order to use ChatGPT search, OpenAI shares data with third-party search providers, and Enterprise customers using OpenAI models internally might not be willing to expose their prompts to that.

Regardless, if OpenAI is saying their brand-new o3 and o4-mini models hallucinate higher than their non-reasoning models, that might be a problem for its users.

UPDATE: Apr. 21, 2025, 1:16 p.m. EDT This story has been updated with a statement from OpenAI.

Topics ChatGPT OpenAI

OpenAI's o3 and o4-mini hallucinate way higher than previous models (3)

Cecily Mauran

Tech Reporter

Cecily is a tech reporter at Mashable who covers AI, Apple, and emerging tech trends. Before getting her master's degree at Columbia Journalism School, she spent several years working with startups and social impact businesses for Unreasonable Group and B Lab. Before that, she co-founded a startup consulting business for emerging entrepreneurial hubs in South America, Europe, and Asia. You can find her on X at @cecily_mauran.

Recommended For You

OpenAI announces o3 and o4-mini reasoning models for ChatGPT (updated)

OpenAI pivoted and decided to release its o-series models after all.

By Cecily Mauran

A new AI test is outwitting OpenAI, Google models, among others

AGI? Not so fast.

By Cecily Mauran

OpenAI, Microsoft, Trump admin claim DeepSeek trained AI off stolen data

Did DeepSeek take OpenAI's data…which OpenAI took from all of us?

By Matt Binder

Elon Musk makes $97 billion bid for OpenAI, gets rejected

OpenAI CEO Sam Altman appears to have rejected the offer and instead made his own bid for Twitter.

By Matt Binder

All the AI news of the week: ChatGPT debuts o3 and o4-mini, Gemini talks to dolphins

Here's all the AI news you missed this week, from safety concerns to advanced image reasoning from OpenAI.

By Cecily Mauran

Trending on Mashable

NYT Connections hints today: Clues, answers for April 22, 2025

Everything you need to solve 'Connections' #681

By Mashable Team

Wordle today: Answer, hints for April 22, 2025

Here are some tips and tricks to help you find the answer to "Wordle" #1403.

By Mashable Team

NYT Strands hints, answers for April 22

Every hint, nudge and outright answer you need to complete today's NYT Strands puzzle.

By Mashable Team

Lego is giving away Grogu models for free to celebrate Star Wars Day. Here’s how to get yours.

Get your hands on an exclusive model of Grogu in a hover pram.

By Joseph Green

The latest Stuff Your Kindle Day is live. Over 1,000 romance books are available to download for free.

The perfect excuse to stock up on romance books from a variety of subgenres.

By Joseph Green

The biggest stories of the day delivered to your inbox.

These newsletters may contain advertising, deals, or affiliate links. By clicking Subscribe, you confirm you are 16+ and agree to our Terms of Use and Privacy Policy.

Thanks for signing up. See you at your inbox!

OpenAI's o3 and o4-mini hallucinate way higher than previous models (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Greg Kuvalis

Last Updated:

Views: 6259

Rating: 4.4 / 5 (75 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Greg Kuvalis

Birthday: 1996-12-20

Address: 53157 Trantow Inlet, Townemouth, FL 92564-0267

Phone: +68218650356656

Job: IT Representative

Hobby: Knitting, Amateur radio, Skiing, Running, Mountain biking, Slacklining, Electronics

Introduction: My name is Greg Kuvalis, I am a witty, spotless, beautiful, charming, delightful, thankful, beautiful person who loves writing and wants to share my knowledge and understanding with you.