PewDiePie’s AI Model Beats GPT-4o: How He Fine-Tuned a Coding AI at Home

PewDiePie’s AI model scored 39% on the Aider Polyglot benchmark, outperforming GPT-4o (23.1%) and Gemini 2.0 Pro Exp (35.6%) on a coding test widely used in AI research. The retired YouTube star Felix Kjellberg spent months building a $41,000 home GPU rig, reading machine learning papers, and grinding through failed training runs to get there. He published the full journey in a video released on February 26, 2026.

This is not PewDiePie casually asking ChatGPT to write code. He built a 10-GPU workstation, scraped GitHub for training data, generated synthetic datasets using the DeepSeek API, and fine-tuned Qwen 32B, a 32-billion-parameter open-source model. The result is a genuine AI fine-tuning project that many professional ML engineers would recognise as legitimate work, even if the framing involves power connectors that almost set his house on fire.

Table of Contents hide

The Key Takeaways

What PewDiePie’s AI Model Actually Is

His Hardware Setup

How PewDiePie’s AI Model Stacks Up on Aider Polyglot

What Is the Aider Polyglot Benchmark?

The Score Journey: From 8% to 39%

The Data Breakthrough: Synthetic Training Data via DeepSeek

What This Means for AI in 2026

Conclusion

FAQ

The Key Takeaways

– PewDiePie’s fine-tuned model scored 39% on Aider Polyglot, beating GPT-4o (23.1%) and Gemini 2.0 Pro Exp (35.6%)
– He used Qwen 32B as the base model and fine-tuned it on custom coding data, not trained from scratch
– His home AI rig cost approximately $41,000 with 10 GPUs and 424GB of VRAM
– The benchmark journey ran from 8% all the way to 39% across months of failures and a major contamination crisis
– The breakthrough came from using the DeepSeek API to generate high-quality synthetic training data

What PewDiePie’s AI Model Actually Is

Let’s be precise here, because most outlets are not. PewDiePie did not train a large language model from scratch. He performed fine-tuning on Qwen 32B, a pre-existing open-source model developed by Alibaba that was already strong at coding tasks.

Fine-tuning means taking a model that has already been trained on vast amounts of data and continuing to train it on a specific, curated dataset to sharpen performance on a narrower task. It is a standard and widely used technique in professional AI development. Companies and researchers use it constantly to take open-source base models and specialise them for their use case.

What makes PewDiePie’s project notable is the scale of the effort, the rigor he applied when things went wrong, and the final result. He was not just prompting a chatbot, he was building and iterating a training pipeline.

His Hardware Setup

PewDiePie assembled a workstation that most individuals would never consider building:

Component	Spec
GPUs	8x modded RTX 4090 (48GB VRAM each) + 2x RTX 4000 Ada
Total VRAM	424GB
Estimated cost	~$41,000

He nicknamed his local AI setup ChatOS, a custom interface for running and experimenting with local models. The hardware gives him the memory capacity to run and fine-tune models that would otherwise require cloud compute.

How PewDiePie’s AI Model Stacks Up on Aider Polyglot

What Is the Aider Polyglot Benchmark?

The Aider Polyglot benchmark tests AI models on 225 challenging programming exercises from Exercism, covering six languages: C++, Go, Java, JavaScript, Python, and Rust. Each model gets two attempts per problem, and the score is the percentage of exercises where all tests pass. Unlike SWE-Bench, which focuses on Python bug fixes in a narrow set of repos, Aider Polyglot tests a model’s ability to write and integrate working code across diverse, multi-language projects.

Here is how PewDiePie’s model sits against other models on the current leaderboard:

Model	Aider Polyglot Score	Notes
GPT-5	88.0%	Current top tier
Gemini 2.5 Pro Preview	83.1%	Top tier
GPT-4.1	52.4%	Above PewDiePie
PewDiePie’s model	39%	Qwen 32B fine-tune
Gemini 2.0 Pro Exp	35.6%	Beaten
GPT-4o	23.1%	Beaten
GPT-4o mini	3.6%	Beaten by a large margin

*(Scores from the Aider Polyglot leaderboard at aider.chat, February 2026)*

His model sits well below the current top performers, which score in the 80-88% range. But it comfortably beats both GPT-4o and Gemini 2.0 Pro Exp, which is exactly what he set out to do. His original target when the project started was GPT-4o’s score of 16% on the benchmark at the time, meaning his final result more than doubled his initial goal.

The Score Journey: From 8% to 39%

This is where the story gets genuinely interesting. PewDiePie’s path to 39% was anything but straight. Here is the full progression:

8% — Base Qwen 32B tested in the wrong output format. The model was capable but answering in a format the benchmark could not evaluate correctly.
16% — After switching to the correct format and early fine-tuning on GitHub-scraped data. This matched his original GPT-4o target.
~19.6% — One run hit this score during experiments with Magicoder-style synthetic data, but the result could not be consistently reproduced.
4.4% — Disaster. PewDiePie discovered he had been training on the wrong base model and that his benchmark data was contaminated. The model had been memorising test data rather than learning to code. He had to restart from scratch.
25.3% — After fixing the base model and retraining from scratch.
36% — After running the full benchmark correctly (one third of it had not been running before).
39% — After applying post-training improvements using synthetic data generated via the DeepSeek API.

Benchmark contamination, where a model’s training data overlaps with test data and inflates scores artificially, is a real and common problem in ML research. That PewDiePie identified it, diagnosed the issue, and corrected course rather than publishing the inflated number is the part of the story worth taking seriously.

The Data Breakthrough: Synthetic Training Data via DeepSeek

The turning point in PewDiePie’s project was data quality, not compute. After reading the Magicoder research paper, which showed how to generate high-quality coding examples synthetically rather than scraping them from GitHub, he began using the DeepSeek API to produce training data at scale.

DeepSeek’s models are strong at code generation, making them useful for producing examples that could teach his fine-tune to solve programming problems in the correct format. If you want to learn more about why DeepSeek became such a significant player in open-source AI, we covered the DeepSeek-R1 open-source breakthrough in detail.

The core insight from Magicoder: if you cannot find enough high-quality examples of the task you want to improve, you generate them using a capable model. The risk is that poor synthetic data degrades performance rather than improving it, which is exactly what happened in PewDiePie’s earlier runs before he refined the pipeline. Getting synthetic data right took multiple failed attempts, and it was only when he combined it with the correct base model and correct benchmark format that everything clicked.

What This Means for AI in 2026

PewDiePie’s project is a striking illustration of how accessible AI fine-tuning has become. His model sits above Gemini 2.0 Pro Exp on a real coding benchmark. That is remarkable not because it proves hobbyists can replace frontier labs, but because it shows how far targeted fine-tuning can push an open-source model.

The current best AI models for coding in 2026 sit at 80-88% on Aider Polyglot. PewDiePie’s model at 39% is not competing at that level. But his benchmark-specific fine-tuning produced a model that outperforms general-purpose versions of GPT-4o and Gemini 2.0 Pro Exp. That gap between a fine-tuned specialist and a general-purpose model is exactly what the broader industry is discovering as it moves toward smaller, task-specific AI.

Fine-tuning open-source models on targeted datasets is becoming one of the most cost-effective ways to build capable, specialised AI tools. PewDiePie stumbled into a technique that many AI companies are already deploying at scale. The difference is that he documented every failure publicly and in the most entertainingly self-aware way possible.

Conclusion

PewDiePie’s AI model is a better story than most outlets are giving it credit for. It involves real ML techniques, real failures, genuine benchmark contamination drama, and a real final result. His fine-tuned Qwen 32B model hit 39% on Aider Polyglot, a benchmark used by AI researchers to evaluate coding ability, beating GPT-4o and Gemini 2.0 Pro Exp. He is not competing with frontier labs, and he is the first to acknowledge the project was ultimately benchmaxxing rather than general AI research.

If you want to track exactly how models compare on coding benchmarks, the Aider Polyglot leaderboard at aider.chat is updated regularly and is one of the most useful public resources for comparing AI coding ability.

FAQ

Did PewDiePie really train his own AI model?

He fine-tuned an existing open-source model called Qwen 32B. Fine-tuning is a standard AI development technique where a pre-trained model is further trained on custom data to improve performance on specific tasks. He did not build or train a model from scratch.

What score did PewDiePie get on the AI benchmark?

His model scored 39% on the Aider Polyglot benchmark, which tests coding ability across 225 exercises in C++, Go, Java, JavaScript, Python, and Rust. GPT-4o scores 23.1% on the same benchmark, and Gemini 2.0 Pro Exp scores 35.6%.

Did PewDiePie beat GPT-4o?

Yes, on the Aider Polyglot benchmark specifically. GPT-4o scores 23.1% and his model scored 39%. That said, GPT-4o is a general-purpose model and his fine-tune was specifically optimised for this coding benchmark. Top models like GPT-5 score 88% on the same test.

How much did PewDiePie’s AI rig cost?

Approximately $41,000. The setup includes 10 GPUs: 8 modded RTX 4090s with 48GB VRAM each and 2 RTX 4000 Ada cards, totalling 424GB of VRAM across the system.

What is benchmark contamination?

Benchmark contamination happens when a model’s training data includes examples from the test set it is being evaluated on. This inflates the score artificially because the model is effectively memorising answers rather than solving problems. PewDiePie hit this issue mid-project, discovered it, and restarted training to get a clean result.

Share Now!

Get Exclusive AI Tips to Your Inbox!

Stay ahead with expert AI insights trusted by top tech professionals!

Get Fello AI: All-In-One AI Chatbot

All top AI models like GPT, Claude, Gemini, or Grok – in one app that works on Mac, iPhone, and iPad.

Get Fello AI Now!

PewDiePie’s AI Model Beats GPT-4o: How He Fine-Tuned a Coding AI at Home

The Key Takeaways