A photorealistic image of a vending machine centered in a long school hallway lined with teal lockers. The vending machine is brightly lit and filled with colorful snacks and drinks. Overlaid text reads: “Grok 4 Outperforms GPT-5 In Latest Business Benchmark” with “Business Benchmark” highlighted in yellow.

Grok 4 beats GPT-5 In Running Business According to Latest Vending-Bench

AI benchmarks are usually abstract. They test math puzzles, programming problems, or reading comprehension tasks most people never encounter in real life. But the newest yardstick for long-term AI performance is surprisingly ordinary: running a vending machine. The benchmark, called Vending-Bench, was created by Andon Labs to test whether AI agents can handle one of their hardest unsolved problems — staying coherent and effective over long stretches of time. Each agent is tasked with operating a simulated vending machine business: tracking inventory, placing orders, setting prices, collecting revenue, and paying daily fees. On paper, these are trivial management chores. In practice, they push models to their limits when extended across 20 million […]