Elon Musk’s xAI Launches Grok 4.1: Now the Highest-Rated LLM on LMArena

Today, November 17, 2025, xAI has officially rolled out Grok 4.1, the latest iteration of its flagship large language model, across grok.com, the 𝕏 interface, and the iOS and Android apps. The update arrives after a two-week silent production test and aims to deliver more natural conversation, stronger emotional intelligence, tighter safety controls, and dramatically improved real-world reliability.

Unlike typical incremental updates, Grok 4.1 represents a noticeable shift in how xAI tunes and deploys its models: it introduces deeper reinforcement learning on style, personality, and alignment, while preserving the factual accuracy and reasoning depth Grok 4 was known for.

Introducing Grok 4.1, a frontier model that sets a new standard for conversational intelligence, emotional understanding, and real-world helpfulness.

Grok 4.1 is available for free on https://t.co/AnXpIEOPEb, https://t.co/53pltyq3a4 and our mobile apps.https://t.co/Cdmv5CqSrb
— xAI (@xai) November 17, 2025

Below is everything you need to know — benchmarks, new capabilities, safety upgrades, and how it compares to rivals.

Table of Contents hide

A Silent Rollout

What’s New in Grok 4.1

More Natural Model

Reliability

Benchmark Performance

A More Consistent Model Identity

Creative Writing Performance

“AI Autumn” Model Race

How to Use Grok on Your Mac

Conclusion

A Silent Rollout

Before today’s public release, xAI quietly shipped early versions of Grok 4.1 to a growing percentage of users between November 1–14. During this period, the company ran continuous blind pairwise evaluations, directly comparing Grok 4.1’s responses against the previous production model.

The results were clear:

64.78% win rate over the previous Grok
Consistent positive preference in creative, emotional, and collaborative prompts
Noticeable reduction in stylistic inconsistency

In other words, users preferred the new version almost two-thirds of the time — a strong signal that xAI’s new post-training approach is working.

What’s New in Grok 4.1

xAI’s latest update brings the biggest step forward in the Grok model line since its debut. Grok 4.1 ships in two versions — a fast non-thinking model and a more deliberate Thinking variant — and both introduce upgrades focused on conversation quality, factual reliability, and overall output polish. What sets this release apart is that almost every improvement is backed by measurable gains on public benchmarks.

More Natural Model

The headline upgrade is how Grok 4.1 communicates. xAI rebuilt its post-training process to emphasize tone, emotional awareness, and a more coherent model “personality.” In day-to-day use, the model feels noticeably more grounded and more perceptive of nuance — especially in emotionally charged or ambiguous prompts.

These changes show up in EQ-Bench 3, a benchmark designed to measure empathy, interpersonal skill, and emotional intelligence. Grok 4.1 now occupies the top two positions.

Both models outperform competitors like Kimi K2 Instruct and Gemini 2.5 Pro, making Grok 4.1 the current leader in emotional intelligence.

Reliability

Accuracy remains a key issue in LLM deployment. With Grok 4.1, xAI aimed to reduce hallucination rates — those moments where the model confidently states false information.

The new model shows impressive gains:

Hallucination Rate:
- Grok 4 Fast (non-reasoning): 12.09%
- Grok 4.1 (non-reasoning): 4.22%
FActScore (factual accuracy on biographies):
- Grok 4 Fast: 9.89% error
- Grok 4.1: 2.97% error

These improvements are especially important for users relying on Grok to summarize news, explain technical topics, or answer research-style queries.

Grok 4.1 just released.

You should notice a significant increase in speed and quality. https://t.co/1J8pvn3SsO
— Elon Musk (@elonmusk) November 17, 2025

Benchmark Performance

Before today’s public launch, xAI quietly pushed early Grok 4.1 builds into production across grok.com, X, and the mobile apps. Over a two-week period, the company ran continuous blind A/B tests on real user traffic to measure whether the update actually improved day-to-day conversations.

Those real-world results mirror what independent benchmarks are seeing. In LMArena’s widely used Text Arena, which compares models side-by-side without branding bias, the new Grok models posted some of their strongest numbers yet — including a #1 overall position for the reasoning variant.

Model	Elo Score	Rank
Grok 4.1 Thinking	1483	#1
Grok 4.1	1465	#2
Gemini 2.5 Pro	1452	#3–5 range
Claude Sonnet 4.5 (Thinking)	1450	Top 5
GPT-4.5 Preview	1442	Upper tier
Kimi K2 Thinking	1432	Competitive
Grok 4 Fast (previous model)	1409	#30+ range

Elo Score Benchmark Results

Grok 4.1’s non-reasoning mode surpasses many competitors’ full reasoning configurations — a notable shift in a benchmark space that heavily favors depth of reasoning. And the 40+ point jump over Grok 4 Fast marks one of xAI’s largest single-generation improvements so far.

A More Consistent Model Identity

One subtle but important part of the update is that Grok 4.1 simply feels more stable. Its tone is more consistent between responses, and its “voice” is less prone to sudden shifts or quirky tangents. xAI attributes this to using stronger frontier reasoning models as internal graders during training — essentially teaching Grok how to carry itself conversationally.

Creative Writing Performance

Beyond general reasoning, Grok 4.1 also shows strong gains in long-form and narrative tasks. In the Creative Writing v3benchmark — which evaluates story structure, voice, humor, and style across 32 prompts — Grok 4.1 ranks near the top of the current generation of models.

These scores place both variants just behind early GPT-5.1 builds and slightly ahead of models like o3, Claude Sonnet 4.5, and Kimi K2 Instruct. The improvement is especially visible in multi-paragraph creative tasks, where Grok 4.1 now produces more consistent voice, better pacing, and fewer generic lines compared to earlier Grok versions.

Results of Creative Writing Benchmark [source]

“AI Autumn” Model Race

Grok 4.1 lands in the middle of AI Autumn, a busy stretch where nearly every major AI lab is releasing upgraded models. Over just a few months, the competitive landscape has shifted quickly as companies try to set new benchmarks in reasoning, creativity, and emotional intelligence.

Recent launches include:

OpenAI GPT-5.1
Google Gemini 2.5 Pro (with Gemini 3.0 coming soon)
Anthropic Claude Sonnet 4.5 / Opus 4
Kimi K2 Thinking
Qwen 3 Max

In this environment, Grok 4.1 needed to deliver real improvements — and early results suggest that it does. In LMArena’s blind Text Arena, Grok 4.1 Thinking now holds the #1 overall position, with the fast non-reasoning model at #2.

We’re also preparing a full comparison of all the major AI Autumn releases, covering their strengths, weaknesses, and real-world performance. For now, Grok 4.1 clearly manages to stand out in one of the most competitive upgrade seasons we’ve seen.

How to Use Grok on Your Mac

If you want to use Grok on your Mac without logging into X or creating an account, you can run it inside a standalone desktop app. Fello AI supports Grok out of the box and connects to xAI’s official API, so you can chat with the model anonymously without linking any social accounts.

Just download the app, pick Grok from the model list, and start typing. There’s no X login, no browser session, and no personal data sent to xAI beyond your questions. Everything runs locally inside the macOS app, and you stay completely invisible to the social side of the platform.

For more details, you can find the full guide here.

Conclusion

All major AI labs are now releasing new models every few weeks. Everyone wants to be first on the leaderboard, and the competition is getting faster each month.

But most people don’t actually need all this power. The average user won’t use more than a small part of what these frontier models can do. For daily tasks, the differences between them are getting harder to notice.

So who are these new models really made for?

They’re built for the next wave of use-cases: complex reasoning, long multi-step tasks, research, automation, and systems that can work alongside humans instead of just answering prompts. This is the direction the entire industry is moving toward.

Grok 4.1 is another step in that shift. The improvements in stability, emotional awareness, and reliability aren’t just quality-of-life upgrades — they’re foundations for bigger abilities coming later.

In the middle of this “AI Autumn” race, it’s becoming clear that companies aren’t only competing for better chat. They’re competing for the first model that feels trustworthy enough to handle real work, not just generate text.

Share Now!

Get Exclusive AI Tips to Your Inbox!

Stay ahead with expert AI insights trusted by top tech professionals!

Michal Langmajer
November 18, 2025
AI benchmarks, ai race, benchmark, elon musk, grok, grok 4, grok 4.1, llm, LMArena, xai

Get Fello AI: All-In-One AI Chatbot

All top AI models like GPT, Claude, Gemini, or Grok – in one app that works on Mac, iPhone, and iPad.

Get Fello AI Now!

Elon Musk’s xAI Launches Grok 4.1: Now the Highest-Rated LLM on LMArena

A Silent Rollout