xAI, the artificial intelligence startup founded by Elon Musk, introduced its latest language model, Grok 4, during a livestream event on Wednesday, July 9. The launch was announced earlier in the week, with Musk posting on X (formerly Twitter) that the new model would be unveiled during a live broadcast at 8 PM Pacific Time.
Introducing Grok 4, the world's most powerful AI model. Watch the livestream now: https://t.co/59iDX5s2ck
— xAI (@xai) July 10, 2025
During the event, Musk and members of the xAI team provided an overview of Grok 4’s performance across academic and real-world benchmarks. They also discussed the model’s expanded capabilities, including native tool usage, improved reasoning, multimodal features, and new voice functionality.
The release positions Grok 4 as a direct competitor to other frontier models such as OpenAI’s GPT-4, Anthropic’s Claude 4 Opus, and Google’s Gemini 2.5 Pro. xAI claims the new model shows improved reasoning performance and is better suited for tasks ranging from coding and academic problem-solving to running simulations and basic business operations.

Unprecedented Humanity’s Last Exam Score
One of the most talked-about takeaways from Grok 4’s launch was its record-breaking result on Humanity’s Last Exam (HLE)—a newly developed academic benchmark designed to push AI models to the limits of reasoning. According to xAI, Grok 4 achieved a 50.7% score with tools enabled, the highest score ever reported on this exam. Even without tools, it managed 26.9%, still ahead of any other publicly known model.
The performance has sparked renewed discussion around benchmarks for Artificial General Intelligence (AGI), with HLE now emerging as a key test of whether models can reason beyond pattern recognition and factual memory.
What Is Humanity’s Last Exam?
Humanity’s Last Exam is a large-scale, academically rigorous benchmark introduced to measure reasoning at a level comparable to human experts. It consists of over 2,500 questions drawn from more than 100 academic subjects, ranging from hard sciences to humanities. The test was designed to reflect the structure and intellectual demands of graduate-level education, where multi-step problem-solving and interdisciplinary thinking are essential.
The benchmark differs from earlier test sets like MMLU or GSM8K in both scale and depth. While MMLU focuses on factual recall across tasks, and GSM8K targets grade-school math, HLE emphasizes cross-domain reasoning, abstraction, and knowledge synthesis. It’s built to assess how well a model can think—not just how much it knows.
Subject Distribution:
- Mathematics: 41%
- Biology & Medicine: 11%
- Computer Science & AI: 10%
- Physics: 9%
- Humanities & Social Sciences: 9%
- Chemistry: 7%
- Engineering: 4%
- Other fields: 9%
This broad spread ensures that models are tested not just on one type of reasoning but on their ability to generalize across diverse domains, making HLE one of the most challenging AI benchmarks currently available.

Grok 4’s Score
Grok 4’s results on HLE place it at the top of the field, both with and without tools.
- 50.7% accuracy using tools
- 26.9% accuracy in the no-tools (text-only) setting
- 44.4% score for Grok 4 Heavy, a multi-agent setup tested on the full exam
For comparison:
- Gemini 2.5 Pro previously held one of the best scores in the no-tools category with 21.6%.
- Other top-tier models like Claude 4 Opus and GPT-4 (o3) typically fall into the 15–22% range without tools.
This puts Grok 4 significantly ahead across all settings. The full 50.7% score using tools suggests that Grok can now solve questions at—or above—PhD-level in several domains, according to xAI’s claims.

Tools-Native Architecture
A major factor contributing to Grok 4’s performance is its “tools-native” training. While many other models integrate tools (such as calculators or code interpreters) through plugins or post-training adaptation, Grok 4 was trained from the start with tools integrated into its learning loop. It understands when to invoke tools, how to use them, and how to incorporate their outputs into larger reasoning chains.
This is aligned with Elon Musk’s stated goal of building a model that can interact with the real world—not just generate text. Musk said the long-term vision includes Grok using simulators for physics, mathematics, and scientific experiments, enabling it to move from answering questions to generating and testing new hypotheses.
“We want Grok to interact with the real world. Eventually, it should use scientific simulators to actually test hypotheses,” Musk said during the launch.
That design philosophy—along with the Grok 4 Heavy version, which operates as a collaborative network of agents—suggests that xAI is exploring architectures beyond a single monolithic model. Grok 4 is being positioned not only as a chatbot but as an early step toward real-world cognitive systems.
Grok 4’s Performance Across Other Benchmarks
While Grok 4’s performance on Humanity’s Last Exam has attracted much of the attention, its results on other major benchmarks further support xAI’s claim that this is one of the most advanced general-purpose reasoning models currently available.
xAI evaluated Grok 4 across a series of academic and logic-based tests that are widely used in the AI community. These include standardized math competitions, graduate-level question sets, and domain-specific problem-solving tasks. In nearly every category, Grok 4 — especially in its multi-agent variant Grok 4 Heavy — either matched or surpassed previous state-of-the-art scores.
| Benchmark | Grok 4 | Grok 4 Heavy | Gemini 2.5 Pro | Claude Opus 4 | o3 (OpenAI) |
|---|---|---|---|---|---|
| HLE (no tools) | 26.9% | 44.4% | 21.6% | — | 21.0% |
| GPQA | 87.5% | 88.9% | 86.4% | 79.6% | 83.3% |
| AIME25 | 88.0% | 91.7% | 88.0% | 75.5% | 88.9% |
| USAMO25 | 37.5% | 61.9% | 49.4% | — | 21.7% |
| HMMT25 | 93.9% | 96.7% | 82.5% | 58.3% | 77.5% |
| LCB (Jan–May) | 79.3% | 79.4% | 74.2% | — | 72.0% |
| ARC-AGI-2 SOTA | – | 15.9% | – | 9% | 6% |
These benchmarks indicate that Grok 4, particularly in its Heavy configuration, is not merely tuned for one benchmark like HLE. Its performance is consistent across math, logic, and graduate-level academic tasks — domains that require both factual knowledge and flexible reasoning. Notably:
- It leads or co-leads in every major benchmark tested.
- Multi-agent collaboration (as in Grok 4 Heavy) contributes a measurable gain, often 3–5 percentage points higher than the single-model version.
- It demonstrates strong generalization, performing well even in domains where other models show wide variability.
The test results reflect xAI’s emphasis on training for advanced reasoning rather than just large-scale data memorization. The improvements over Claude, Gemini, and o3 are incremental in some cases but substantial in areas like Olympiad-level mathematics and multi-step reasoning.

Grok 4 Tops ARC-AGI-2
Grok 4 just set a new state-of-the-art on the ARC-AGI-2 benchmark with a 15.9% score in “Thinking” mode — nearly double the previous best from Claude Opus and well above GPT-4.
ARC-AGI-2 is one of the toughest tests of abstract reasoning, used in an ongoing Kaggle competition. It rewards models that can go beyond pattern recall and reason like humans.
ARC-AGI-2 SOTA:
- Grok 4 (Thinking): 15.9%
- Claude Opus 4 (16K): ~9%
- GPT-4, o3 Pro, Claude Opus: 3–7%
Grok 4 leads not just in accuracy, but also in cost-efficiency — placing it far ahead on both axes of performance.
Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%
— ARC Prize (@arcprize) July 10, 2025
This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA pic.twitter.com/YbCMLXPJ2e
Grok 4 Coding Capabilities
xAI has announced Grok 4 Code, a specialized variant of the model aimed at assisting with complex software development. Unlike Grok 3, which handled basic code snippets, this new version is designed to operate at scale across real-world codebases.
Expected to launch in the coming weeks, Grok 4 Code will offer:
- Repository-level understanding for analyzing and navigating large projects
- Cross-file debugging and logic tracing across modules
- IDE-friendly workflows, including multi-step planning like identifying bugs and proposing fixes
While not yet fully autonomous, Grok 4 Code marks a step toward agentic coding assistants — models that participate in development workflows rather than just responding to prompts. The feature set puts Grok in line with ongoing industry efforts to embed AI deeper into the software engineering process.
Natural Voice Interaction
Another area of focus is real-time speech interaction. During the launch event, xAI demonstrated a new voice interface featuring a British-accented assistant named Eve. The voice engine has seen notable upgrades:
- Whispering, tone shifting, and early-stage singing support
- Reduced latency — roughly halved compared to Grok 3
- More conversational timing and intonation to reduce artificial pauses
These improvements aim to make Grok’s speech interface feel less robotic and more fluid. While the singing and tonal expressiveness remain experimental, the overall latency reduction and clarity are already suitable for natural back-and-forth interaction.
Game Design & Development Capabilities
In addition to academic and coding use cases, xAI is positioning Grok 4 as a tool for creative development — including video games. While still early-stage, the model has been shown to support basic game prototyping and evaluation.
According to the xAI team, Grok 4 is now capable of:
- Generating 3D game concepts from natural language prompts, including level design ideas, character mechanics, and gameplay rules.
- Assessing game quality, including whether a concept or prototype is likely to be “fun” or engaging, based on structure, pacing, and player objectives.
The system is not yet a full game engine assistant, but it can help teams ideate and iterate more quickly during the design process. This aligns with a broader trend of generative AI tools entering the creative pipeline, assisting with everything from world-building to gameplay balancing.
While xAI has not disclosed whether Grok is integrated with tools like Unity or Unreal, the company hinted that future versions may gain stronger multimodal capabilities — possibly allowing the model to reason directly over visuals, physics rules, or gameplay simulation.
The direction is clear: xAI wants Grok to move beyond passive generation into interactive, tool-augmented design — including in fields like gaming where creativity and logic must work hand-in-hand.
Premium Access, Pricing, and Roadmap
Grok 4 is publicly available for free via xAI.com or through the Grok tab on X. Free users get limited daily interactions, making it a low-friction way to explore basic capabilities.
For those seeking higher usage limits, early access to new tools, or advanced versions of the model, xAI offers multiple paid tiers:
Grok 4 Subscription Plans
| Plan | Price/month | Includes |
|---|---|---|
| Free | $0 | Basic Grok access with daily message limits |
| X Premium | $8 | More daily Grok messages, checkmark on X, longer posts, creator tools |
| X Premium+ | $40 | Full Grok 4 access, DeepSearch, Think mode, no ads on X |
| SuperGrok (Standalone) | $30 | Full Grok 4 access without needing an X Premium subscription |
| SuperGrok Heavy | $300 | Access to Grok 4 Heavy, priority usage, and early feature previews |
Product Roadmap
xAI also tied the SuperGrok Heavy plan to early access for a series of upcoming features and models. Here’s what subscribers can expect over the next few months:
- August: Launch of Grok 4 Code, a coding-focused AI assistant for real-world software projects
- September: A multimodal agent that can process and reason across text, images, and possibly audio
- October: A video generation model, aimed at producing dynamic content from text prompts
These developments suggest xAI is moving quickly toward a more complete agent platform, capable of handling not just conversation but software development, visual reasoning, and media generation.
Developer Access
Grok 4 is also available via API through the xAI Console, allowing developers to integrate the model into applications and workflows. Although xAI’s enterprise division is still in early stages, the company has signaled plans to partner with cloud infrastructure providers to expand access.
On the tooling side, Grok 4 Code is already embedded in Cursor, a code editor optimized for AI-assisted workflows. This gives developers early access to planned agentic features like multi-file understanding, bug tracing, and in-editor code generation — with broader IDE integration expected to follow.
Elon Musk’s Vision for Grok
Beyond benchmarks and product features, Elon Musk used the Grok 4 launch to outline a much broader ambition: building a model that doesn’t just answer questions, but one that can generate new knowledge by interacting with the world.
During the livestream, Musk emphasized that Grok 4 isn’t just a chatbot — it’s an early step toward what he calls a “tool-integrated intelligence system.” The idea is to move beyond static models and build something that can run simulations, test hypotheses, and even participate in scientific discovery.
He described a future where Grok will work alongside real-world tools and scientific simulators — across fields like physics, mathematics, chemistry, and engineering. Instead of simply retrieving information, Grok would actively explore and validate ideas inside virtual environments, helping researchers reason through problems computationally.
“We’ve run out of test questions. Reality is now the test.” — Elon Musk
Musk believes real-world interaction is a more meaningful test of intelligence than static benchmarks. His vision is for Grok to function as a general-purpose reasoning agent, capable of adapting to new problems and learning dynamically, with minimal human instruction. He even suggested that future versions of Grok could help uncover entirely new physicswithin the next two years — though that claim remains speculative.
One example of this vision already in motion is Grok 4 Heavy. Unlike the base version, Heavy is designed as a multi-agent system — a network of Grok instances that work independently on tasks and then compare and refine their outputs. The goal is to simulate the collaborative, debate-driven process of a human research team.
This architecture, xAI suggests, could become the blueprint for future AGI systems — where specialization, internal discussion, and shared reasoning contribute to better outcomes than a single monolithic model.
As Grok evolves into a more active, tool-using system, xAI appears to be laying the groundwork not just for smarter chat, but for a more autonomous and exploratory form of AI.




