Sympathy for both sides of the egregious misalignment debate
On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control, rogue superintelligence (ASI), not even slightly nice, in the absence of yet-to-be-invented breakthrough technical alignment ideas.On the other side of this debate is almost everyone who works on or studies LLMs. Some of them are very concerned about egregious scheming, others much less so, and as a group they’re equally or more concerned about lots of other potential AI problems—AI-assisted bioterrorism, AI-assisted dictatorships, etc. And if they’re concerned about egregious misalignment and scheming, they’ll probably say that it would come about through race dynamics, careless programmers, bad actors, etc., as opposed to the simpler Yudkowsky & Soares story of “we get egregious misalignment and scheming because nobody has the faintest clue how to avoid that”.Here’s my brief idiosyncratic take on this debate. I think BOTH of the following are true:(1) If you really think carefully about the properties of ASI, you really do find good reasons to strongly expect it to be egregiously misaligned, scheming, and ruthless, in the absence of yet-to-be-invented breakthrough technical alignment ideas.(2) If you really think carefully about the properties of current LLMs, you really do find good reasons to think that existing technical alignment techniques are adequate now, and may well continue to be adequate in the future.So then here are three (caricatured) positions:My position:(1) and (2) are both totally true. And we can reconcile them by saying that LLMs won’t scale to ASI.Yudkowsky & Soares’s position [caricatured]:(1) is totally true. We know this with great confidence, having spent decades thinking about it.So it follows that (2) must be wrong or irrelevant.Why is (2) wrong or irrelevant? Hard to say! There’s no ASI yet, and nobody knows in detail how it will appear. Sometimes it’s easier to predict w