Predicting the future has always felt like something out of science fiction, but AI models are now being tested on exactly that – and the results are quite surprising. Two new benchmarks, Prophet Arena and FutureX, have been putting leading AI models through their challenges on real-world prediction tasks, from sports outcomes to political events to economic trends.
The numbers are pretty remarkable. On Prophet Arena, some AI models are achieving over 99% returns when their predictions are used for actual betting decisions, consistently outperforming human market consensus. Meanwhile, FutureX shows that AI agents with access to external tools significantly outperform basic language models, with top performers like Grok 4 leading the pack. Elon Musk even went so far as to claim that “predicting the future is the best measure of intelligence” when reacting to Grok’s #1 ranking.
But before we get carried away with visions of AI future tellers, the reality is more nuanced than the headlines suggest. While these benchmarks show impressive capabilities in some areas, they also reveal significant limitations and odd behaviors that paint a more complex picture.
Prophet Arena
To understand how good AI actually is at predicting the future, we need to look at the benchmarks themselves first. Prophet Arena is one of the most comprehensive tests of AI prediction capabilities, measuring how well models can forecast real-world events across politics, sports, economics, and more.
What Does This Benchmark Measure?
Prophet Arena works differently from typical AI tests. Instead of asking models to answer questions with known solutions, it presents them with real upcoming events – like “Will this sports team win?” or “Will this political candidate get elected?”. The AI models have to study available information, analyze market data, and assign probability percentages to different outcomes.
The benchmark uses two main scoring methods. The Brier Score measures how statistically accurate the predictions are – basically, when an AI says there’s a 100% chance of something happening, does it actually happen? The Average Return score is more practical – it simulates what would happen if you actually bet money based on the AI’s predictions in real markets.


What makes this clever is that it’s impossible to cheat. You can’t memorize tomorrow’s sports scores because they don’t exist yet. This solves a major problem in AI testing where models sometimes just remember answers from their training data rather than actually understanding the problem.
What It Actually Tells Us
The results reveal some fascinating patterns in how different AI models approach uncertainty. GPT-5 currently leads the accuracy rankings with an 82% Brier Score, while o3-mini tops the returns leaderboard at 99% – meaning it would actually make money most of the time if you followed its betting advice. The intriguing thing is that the most accurate model and the most profitable model aren’t the same.
Different AI models also show distinct “personalities” when making predictions. When asked about AI regulation becoming federal law before 2026, Qwen 3 confidently predicted a 75% chance while Llama 4 Maverick gave it only 35% – despite both models having access to the same information. This suggests each AI model is developing their own approaches to handling uncertainty.
Perhaps most surprising is that even when AI models lose more individual bets than they win, they can still achieve higher overall returns than human market consensus. This happens because they occasionally identify high-value opportunities where they’re only slightly more accurate than the crowd, but those small edges compound into significant gains over time.
FutureX
FutureX takes a different approach to testing AI prediction capabilities by focusing on agent-based workflows rather than just language model responses. This benchmark evaluates how well AI systems can gather information, analyze data, and make predictions when given access to external tools and search capabilities.
What Does This Benchmark Measure?
FutureX takes a more agent-focused approach to testing AI prediction capabilities, specifically evaluating multi-step workflows and tool integration. While both benchmarks involve information gathering and analysis, FutureX puts greater emphasis on testing complete agent pipelines with external tools, reasoning capabilities, and adaptive research processes.
The benchmark includes specialized configurations like “Think & Search” models that can reason through problems step-by-step while accessing real-time information, and dedicated research agents that mirror professional analyst workflows. FutureX also focuses on testing agents’ vulnerability to misinformation – including their ability to identify and avoid fake web pages – and how well they handle validity of information sources.
The benchmark evaluates 25 different LLM and agent models, ranging from basic language models to research agents with full tool integration. This creates clear performance tiers that show how much external capabilities matter for complex prediction tasks.

What It Actually Tells Us
The results reveal a clear hierarchy in prediction capabilities. Models with Think & Search abilities – shown in purple on the leaderboard – significantly outperform basic language models. Grok-4 currently leads the rankings, Elon Musk even tweeted “Grok is the best at predicting the future, which is the best measure of intelligence imo” reacting to the leaderboard.
Grok is the best at predicting the future, which is the best measure of intelligence imo https://t.co/2b63OCEOhT
— Elon Musk (@elonmusk) August 21, 2025
The top performers include o4-mini and Gemini-2.5-flash-DR alongside Grok-4, showing that multiple approaches can work well. Specialized DeepResearch agents (shown in red) form a solid middle tier, while basic LLMs without tools consistently lag behind. This suggests that for complex future prediction, you need more than just a smart language model – you need agent capabilities with reasoning and research tools.
FutureX also reveals specific failure modes that affect all models. The benchmark found that AI agents are vulnerable to fake websites and struggle with determining when information sources are outdated or no longer relevant. These limitations show that even the best-performing models still have significant blind spots when it comes to real-world prediction tasks.
How Good Are AI Predictions?
After looking at these benchmarks, the honest answer is both more impressive and more underwhelming than you might expect. AI models can genuinely beat human predictions in specific areas by a little bit – Prophet Arena shows models consistently outperforming market consensus on sports, politics, and economics, with some achieving over 99% returns if you actually followed their betting advice.
But the thing is, this works primarily for events with lots of historical data and clear patterns. Sports scores, stock movements, and election outcomes have decades of precedent that AI can analyze incredibly well.
The further you get from a controlled environment with historical data, the less reliable the predictions become. AI can tell you the probability that a specific team will win next weekend, but ask it to predict how your industry will change over the next five years, and you’re basically getting educated guesswork dressed up in confident percentages.
Also, AI models disagree with each other a lot. When presented with identical information about AI regulation becoming federal law, one model predicted 75% probability while another said 35%. This suggests that AI prediction is still more art than science, with different models developing distinct “personalities” and risk tolerances rather than following objective truth.
The reality is that AI works best as a very good pattern-matching tool. Which means it’s excellent at processing massive amounts of information quickly, finding trends humans might miss, and assesing uncertainty in ways that can be genuinely useful. But it’s still essentially looking backward to predict forward – analyzing historical patterns to forecast similar future events. When something truly unprecedented happens, AI is often just as surprised as everyone else.
For everyday people, this means AI can be incredibly helpful for research and analysis, but you should be skeptical of any AI system that claims to predict the future with high confidence. The sci-fi vision of AI future tellers isn’t here yet. What we have instead are great tools that can enhance human judgment, and help us make more informed decisions – as long as we remember that even the smartest AI is still making educated guesses about an inherently uncertain world.
Conclusion
AI is getting better at prediction, but it’s not magic. Benchmarks like Prophet Arena and FutureX show that models can beat human consensus in areas with lots of data and clear patterns. They can even make money when their small edges compound over time. But they’re far from flawless—models disagree with each other, get fooled by bad information, and fail when asked to forecast unprecedented events.
The takeaway is simple: AI predictions are useful tools, not crystal balls. Treat them as decision-support systems that can sharpen your judgment, not as future-tellers you can blindly trust. The smartest move isn’t asking AI what will happen, but using it to explore possibilities, weigh risks, and make more informed choices in an uncertain world.




