Grok 4 beats GPT-5 In Running Business According to Latest Vending-Bench
AI benchmarks are usually abstract. They test math puzzles, programming problems, or reading comprehension tasks most people never encounter in real life. But the newest yardstick for long-term AI performance is surprisingly ordinary: running a vending machine. The benchmark, called Vending-Bench, was created by Andon Labs to test whether AI agents can handle one of their hardest unsolved problems — staying coherent and effective over long stretches of time. Each agent is tasked with operating a simulated vending machine business: tracking inventory, placing orders, setting prices, collecting revenue, and paying daily fees. On paper, these are trivial management chores. In practice, they push models to their limits when extended across 20 million […]
