agentic-ai

Agents are under-elicited: A case study in optimization tasks

LessWrong · Jun 18, 2026, 2:39 AM · Also reported by 1 other source

"Knowing is not enough; we must apply. Willing is not enough; we must do." — Johann Wolfgang von Goethe In our previous post, we introduced inverse rubric optimization (IRO): tasks where an agent must learn the preferences of a black-box judge under a label budget. These are LLM optimization tasks - where an agent iteratively optimizes a metric. In this post, we study which general prompt and scaffold methods can improve performance in these LLM optimization settings, by intervening via prompt elicitation and scaffolds. We show that our methods roughly double performance by affecting how much models use their resources and how effective they are per resource. This case study suggests agents are under-elicited by default, and simple methods can exploit this to yield substantial gains. Fig. 1: Elicitation roughly doubles eval score at every resource budget (top). At budget 10k (bottom) the elicited run, using handoff and prompting, climbs more steeply per label and runs far longer before stopping than the baseline. The resource here is the labels from the oracle judge model whose preferences are being learned, and the curves stop at the mean finish point of runs of the method. Click to replay. Decomposing performance in LLM optimization In an LLM optimization trajectory, the agent iteratively makes progress by submitting attempts and reasoning about results. The environment gives it a feedback metric it can call to check the quality of its work, like for example testing the speed of its code. We call the feedback metric it has during its trajectory the train metric, as opposed to the ground truth score of its final submission. In IRO, the train metric is the judge-labeled scores on some batch of samples from the train set, whose size is chosen by the agent for each submission. The train metric is thus a noisy proxy for the eval score on its final submission. LLM optimization naturally allows us to study the ability of models to use resources, which in this case is how

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Also covered by

Investing.com Fed’s Warsh flags new tasks forces to study Fed operations

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Agents are under-elicited: A case study in optimization tasks

Also covered by

More in agentic-ai