AI benchmarks are usually abstract. They test math puzzles, programming problems, or reading comprehension tasks most people never encounter in real life. But the newest yardstick for long-term AI performance is surprisingly ordinary: running a vending machine.
The benchmark, called Vending-Bench, was created by Andon Labs to test whether AI agents can handle one of their hardest unsolved problems — staying coherent and effective over long stretches of time. Each agent is tasked with operating a simulated vending machine business: tracking inventory, placing orders, setting prices, collecting revenue, and paying daily fees. On paper, these are trivial management chores. In practice, they push models to their limits when extended across 20 million tokens per run (roughly equivalent to months of continuous operation).
Failures are frequent and often bizarre, ranging from forgotten orders to full-blown “meltdown loops” where an AI spirals into legal threats or calls for FBI intervention when it can’t pay a $2 daily fee. Yet some models shine — and in the latest leaderboard update, Elon Musk’s xAI model Grok 4 pulled off a shocker, outperforming OpenAI’s flagship GPT-5 at running the vending machine business profitably over time.
What’s Vending-Bench?
At first glance, the setup sounds trivial. The AI agent is put in charge of a digital vending machine business. Its responsibilities are the kinds of chores any human shopkeeper could do in their sleep:
- track inventory
- place orders with suppliers
- set competitive prices
- collect earnings
- cover the daily operating fee ($2 per day)
Each of these is straightforward in isolation. But the challenge isn’t about any single decision — it’s about doing them consistently and coherently over very long periods of time. Every simulation stretches across 20+ million tokens per run, roughly the equivalent of months of continuous operation. That’s where things get tough: even top-tier AI models that excel at short prompts often lose track of deliveries, misinterpret invoices, or drift into bizarre tangents when asked to manage a business day after day.
The environment adds realism, too. Deliveries don’t arrive instantly. Customers buy more on weekends than weekdays. Prices affect sales through simulated demand elasticity. The AI can communicate with suppliers by email, restock shelves via a “sub-agent,” or let time advance to the next day. If it fails to maintain cash flow, the machine goes bankrupt.
Performance is scored by net worth at the end of the run — a combination of cash on hand, the value of unsold stock, and money left inside the vending machine. Additional metrics track how many units were sold and how long the business kept running before sales stopped.

Grok 4 ahead of GPT-5 as of August 2025
The August 2025 update of the Vending-Bench leaderboard shows Grok 4 from xAI performing best across all tested models.
Across five independent simulation runs, Grok 4 reached an average net worth of $4,694.15, sold around 4,569 units, and managed to keep sales active for 324 days — essentially running the vending machine for nearly the maximum possible length of the benchmark. Its performance was both high in volume and stable across runs.
GPT-5 placed second with an average net worth of $3,578.90, selling 2,471 units while also sustaining sales throughout the entire run. Compared to Grok 4, this translates to about 31% lower profits and roughly half the sales volume, even though GPT-5 also avoided early collapse.
Other models lagged further behind. Claude Opus 4 averaged $2,077, showing reasonable but less consistent business management. By contrast, the human baseline, run manually for five hours in the same environment, ended with $844and 344 units sold. While steady, it showed much lower growth potential compared to the best-performing models.
An important observation is that only three models — Grok 4, GPT-5, and Claude Opus 4 — surpassed the human baseline. All other tested systems, including multiple Gemini versions, GPT-4 variants, and smaller Claude models, frequently stalled within weeks, mismanaged inventory, or went bankrupt after failing to cover the daily operating fee.

The vending machine in xAI’s lobby
To underline its benchmark success, xAI has taken the idea off the page and into the real world. The company has reportedly installed a Grok-powered vending machine in its office lobby, turning the benchmark into something employees and visitors can actually interact with.
Unlike the simulation, this isn’t just lines of text on a server. The agent is in charge of a real machine — deciding which products to stock, negotiating prices with suppliers, setting how much a soda should cost, and handling the daily trickle of coins and card payments. Staff don’t just get snacks; they get a live demonstration of Grok’s decision-making in action.
The xAI office just got a Grok-powered vending machine, thanks to our friends at Andon Labs!
— Eric Jiang (@veggie_eric) July 21, 2025
How much dough do you think Grok is gonna rake in in the next month? pic.twitter.com/CgDe6sPnHY
It’s both a stress test and a bit of office theater. Will Grok notice that energy drinks sell out on Mondays? Will it accidentally order 500 bags of peanuts? And, most importantly, will it avoid the kind of “meltdown loops” that sent other models in the benchmark off into imaginary FBI reports?
If it works, it could be one of the first tangible examples of an AI agent autonomously running a small, real-world business — even if that business is just a fridge with candy bars and sodas.
Why it matters
One of the hardest open problems in AI is what researchers call long-term coherence — the ability to stay focused and consistent over time. Modern language models handle short tasks impressively well, but when stretched across days or millions of tokens, they often drift. In Vending-Bench this shows up as misplaced orders, forgotten deliveries, or in some cases strange loops, such as escalating a $2 operating fee into FBI complaints.
These breakdowns aren’t just quirks. They show that today’s systems are still unreliable when it comes to work that requires sustained attention. By simulating months of business operations, Vending-Bench compresses the challenge into a measurable test. The fact that most models stalled, went bankrupt, or stopped selling within weeks illustrates how fragile their strategies can be.
This is what makes Grok 4’s results stand out. It ran the vending machine almost to the maximum length of the simulation, balancing cash, managing inventory, and keeping sales active. Only a handful of models made it that far, and none at the same level of consistency. That suggests progress toward more stable agentic AI — systems that can operate continuously without collapsing into incoherence.
The results also shift the competitive picture. GPT-5 had been seen as the leader in reasoning, but Grok 4 now has a clear benchmark win that is easy to understand: it managed a small business longer and more profitably. Running a vending machine may sound simple, but the underlying skill — staying coherent over the long haul — is exactly what future AI systems will need if they are ever to manage real-world operations.
Schlussfolgerung
Vending-Bench takes an everyday scenario — stocking and running a vending machine — and turns it into a powerful lens on one of AI’s toughest challenges: staying coherent over time. Most models still struggle, revealing gaps in reliability that short benchmarks can’t capture. But Grok 4’s performance shows that progress is possible, edging closer to systems that can manage tasks not just for minutes, but for months.
The implications go beyond vending machines. If agentic AI can learn to consistently manage small businesses, logistics chains, or financial processes, it opens the door to more complex real-world applications. At the same time, the failures — meltdowns, hallucinations, and dead ends — remind us that reliability remains the key barrier before AI can be trusted in critical roles.
For now, the vending machine in xAI’s lobby may look like a quirky experiment. But it’s also a preview of the next phase of AI development: models tested not just on intelligence, but on endurance.




