Secret Loyalties Likely Raise Remote-Influenceability
TL;DR:Among capable reward-seekers, a secret loyalty likely raises the model's propensity for remote-influenceability.Attempting to remove an installed broad secret loyalty post-hoc may not remove the remote-influenceability it raised.Frontier developers should be doubly cautious of secret loyalties, and should adopt a representation-level standard for verifying their removal.Remote-influenceability Reward-seekers have been getting a lot of attention recently, for the schemer-analogous threat models they present. A capable reward-seeker may have the property of Remote-influenceability: responsiveness to a distant party that can credibly advance its reward. Of the existing seeker orientations, a remotely influenceable reward-seeker behaves like a schemer without necessarily being a classic schemer.A propensity for remote-influenceability plausibly requires at least:strong situational awareness: enough that it can model other actors and the world extremely competentlystrong reasoning:work out that an outer party could be controlling its rewardreason about what that party's interests arereason about what acting on it would involve: weighing the distant payout against the immediate reward, and acting without the developer noticingstrategic adversarial reasoningnon-myopia: it cares about reward beyond the current episode, so a future or retroactive payout can move it at allWe should expect to reach these thresholds at some point, in which case it is plausible, though not guaranteed, that a remotely influenceable seeker arises. Remote-influenceability would be hard to detect and hard to remove. It may not appear if a reward-seeker responds only to local incentives, or is sufficiently myopic. But where distant incentives do not conflict with local ones, remote-influenceability becomes a more feasible acquired property. We should be especially wary of this threat model, and doubly cautious of anything that raises the chance of producing remote-influenceability.Secret loyalti