Scoopfeeds — Intelligent news, curated.
Do capabilities generalize across propensities?
agentic-ai

Do capabilities generalize across propensities?

LessWrong · May 9, 2026, 9:39 PM

Thanks to Alex Mallen, Arjun Khandelwal, Arun Jose, Keshav Shenoy, & Sam Marks for helpful discussion on this experiment. This experiment is inspired by a proposal by Sam Marks.Summary. These are some results from an experiment I ran a few months, and thought wasn't quite good enough to post. However, I cannot see when I will next get the chance to improve on these experiments, and so I'm posting this intermediate progress.We study to what extent capabilities learned during training transfer across different behavioral propensities. To do this, we train models on pairs of tasks, where each task is paired with a distinct propensity (e.g., Chess with bold formatting, post-KCO facts with plain text). We then evaluate whether models can perform each task when prompted with the propensity they were not trained on for that task, measuring the "performance gap" between matched and swapped propensities.We find that:On simple tasks, capabilities tend to transfer completely between propensities. A model trained to solve a task with one propensity (e.g., writing in bold) can solve it equally well when prompted with a different propensity (e.g., writing plainly).On our most capabilities-laden propensity (reasoning vs not reasoning), capabilities transfer less. Models trained to answer with step-by-step reasoning struggle to answer in a single forward pass, even on tasks simple enough not to require reasoning (simple multiple-choice Q&A).On our hardest task (Chess puzzles), we observe substantially less capability transfer between propensities, suggesting that capability-propensity entanglement may increase with task difficulty.BackgroundFor inoculation prompting to be successful, we need capabilities to remain even as we prompt the model with its “deployment prompt”. One way to frame this question is in terms of "personas." If inoculation prompting works by training a "reward hacking persona" while deploying an "assistant persona," then the key question becomes: how much are capa

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories
Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop