AI's Lazy Genius Problem

Thomas Thurston
Jun 18
3 min read

Sometimes the smartest mind in the room can still make the dumbest mistakes. That’s not just a quirk of human behavior. Apparently, the same thing happens with AI.

A recent study called The Illusion of Thinking explores a new breed of AI systems known as Large Reasoning Models, or LRMs.¹ Unlike standard AI models that jump straight to an answer, LRMs are designed to “think out loud,” writing down their reasoning before giving a final response. Think of Claude 3.7 in Thinking Mode or OpenAI’s o3 model. They've been designed to show their work, like a student in math class. In contrast, familiar tools like ChatGPT or Claude 3.7 without thinking mode tend to skip the explanation and go straight to the solution.

You might assume that reasoning out loud would help the model arrive at better answers. That’s where things get strange.

The Lazy Genius

The researchers put these LRM models through a battery of classic logic puzzles; games like Tower of Hanoi, Checkers Jumping, River Crossing and Blocks World. These puzzles were chosen not because they’re fun, but because their complexity can be adjusted in precise steps. That made them a perfect testbed to understand how well these models actually think.

Here’s what they found: LRMs perform best when trying to solve problems that are "moderately" difficult. Okay, few surprises there. Yet on easier tasks, they tended to overthink and made unnecessary mistakes. That's was pretty counterintuitive, to say the least. Taking a step further, when given problems that were really, really hard, the models basically gave up and quit.²

Said differently, here's how LRMs performed:

Easy problems: the models overthought the task and made silly mistakes. Regular LLMs did better here than their fancier LRM cousins.
Moderate problems: the models did their best. In medium-complexity tasks, LRMs were stronger. Their structured reasoning helped them work through the steps.
Hard problems: In hard tasks, both LLMs and LRMs collapsed. That said, LRMs were perhaps more puzzling because they reduced their reasoning effort, even though they had plenty of room left to keep going¹.⁵

It’s like working with a lazy genius. They’re brilliant when a problem is just hard enough to catch their interest, but hand them something simple and they overcomplicate it. Push them into deep waters, and they lose focus or give up halfway through.

The study even tested what would happen if you gave the model the exact steps to follow. A working algorithm, not just a prompt. You’d think it would be foolproof, but the models still struggled to execute those instructions cleanly³.

This wasn't the only study raising red flags about AI reasoning. Just this April, another benchmark called LogiEval tested some of the same top models on various types of logical reasoning: deductive, inductive, analogical, and abductive.⁴

In that study, the models did fine on some basic multiple-choice analogies. However, when it came to genuinely hard reasoning tasks ( the kind that require building mental models or following long logic chains), performance dropped... a lot⁴.

In other words, even with more compute, more data and more thoughtful-sounding output, it's worth remembering that many of today’s most advanced AIs can still struggle to reason in a truly reliable way. That can be a pretty serious problem if you’re relying on these systems for important decisions or advanced planning. This is especially the case when they don’t just fail, but fail quietly.

The Big Takeaway

So where does that leave us? There's an oddity here that I can't help but sense in the air. As AI becomes more sophisticated, it can also become more seemingly human—not just in its strengths, but also in its shortcomings. It can overthink, it can get frustrated and quit, it can quietly stop working while giving the appearance of productivity.

We used to imagine machines as exact, perfect and tireless; everything we aren't. Now we're learning the evolution of AI continues to be a bit more weird and unexpected. AI has strengths, weaknesses and personality quirks, just like the rest of us.

Endnotes:

¹ Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple. arXiv preprint arXiv:2506.00986. https://arxiv.org/abs/2506.00986

² In hard puzzles, LRM accuracy drops to zero, and their reasoning token use declines despite having available budget.

³ Even when handed a working Tower of Hanoi algorithm, models struggled to execute it correctly.

⁴ LogiEval (April 2025) showed top models perform poorly on deductive and abductive logic tasks despite excelling at simple analogies. Liu, H., Ding, Y., Fu, Z., Zhang, C., Liu, X., & Zhang, Y. (2025, May 17). Evaluating the Logical Reasoning Abilities of Large Reasoning Models [Preprint]. arXiv. https://arxiv.org/abs/2505.11854

⁵ Both LLMs and LRMs failed on complex tasks, but LRMs paradoxically cut their effort, suggesting a fundamental scaling limitation.

AI's Lazy Genius Problem

The Lazy Genius

The Big Takeaway

Recent Posts

Subscribe to the blog