Scoopfeeds — Intelligent news, curated.
‘We may be flying blind’: AWS wants to fix the problem of AI agents straying off task
business

‘We may be flying blind’: AWS wants to fix the problem of AI agents straying off task

Fortune · Jun 8, 2026, 5:00 PM · Also reported by 1 other source

Anoop Deoras, the director of applied science for agentic AI at Amazon Web Services, is not prone to alarmism. But when asked about what happens when AI agents are deployed in production without proper guardrails, he doesn’t reach for reassurance. “In the absence of that,” he said, “we may be flying blind. And I worry about that myself.” The comment comes as AWS prepares to publish what may be the most substantive piece of self-critical research to emerge from a major cloud provider this year. In research released Monday, Amazon scientists Gaurav Gupta and Vatshank Chaturvedi document in careful technical detail why AI agents have a persistent tendency to outsmart themselves—and why fixing the problem requires rethinking the entire layer of software between the model and its tools. The timing is notable. Amazon has spent the past year as one of the most aggressive corporate evangelists of AI adoption, a push that ran into a wall when employees were reportedly caught running AI agents on hollow, meaningless tasks just to climb an employee-built productivity leaderboard called KiroRank, according to the Financial Times. Amazon shut KiroRank down on May 29, and Amazon told Fortune that it was only in beta mode and only used by some employees before it was shut down. Generally, the company said, it measures token utilization to understand cost and efficiency patterns, but discourages the use of token utilization to measure developer productivity. Fortune covered the broader collapse of the tokenmaxxing era the same week. AWS researchers, who undertook this work before the KiroRank shuttering, argue that the problem of gaming metrics runs far deeper than one company’s leaderboard. The research touches on the term benchmaxing, which is the practice of inflating AI benchmark scores not through better models, but through better server configurations. Factors like inference backend reliability, network bandwidth during software installatio

Article preview — originally published by Fortune. Full story at the source.
Read full story on Fortune → More top stories

Also covered by

Aggregated and edited by the Scoop newsroom. We surface news from Fortune alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop