Fake Alignment Till You Make Alignment
It may sound epistemically fraught, but it frequently works. Sometimes all it really takes to get good at something is just having the confidence that you’ll be good at it. I’ve done this many times at work, in romance, and even writing blog posts. But it only works because I’m careful to never fake my evals.By this I mean, I never fake the way I measure if I’m successful. Let’s say I’m trying to learn a new hobby, like whittling. I believe I’ll be good at it if I just put in the time, so I put in the hours carving wood. What I have to be careful to do, though, is not allow myself to move the goalposts. I need to have some clear vision in my head of what success is, and work towards that. If I carve something crappy and tell myself “actually, that’s good enough, I’m good at whittling”, that’s the way I can trick myself into just being fake.I’ve mostly avoided being fake by demanding authenticity of myself. For example, back in school, I refused to take short cuts just to pass a test. Instead, I put in the extra work to really learn something because, to me, the grade was never the point. I’ve taken a similar approach to meditation (the point is waking up, not special mental states), romance (I want a good relationship, not to be datable), and friendship (I don’t want to seem like a good friend, I want to actually be one).I bring all this up because I’ve been thinking about fake-it-till-you-make-it and authenticity dynamics lately in relation to AI. Specifically, I’ve been wondering if we can “fake” our way to alignment. I mean, we can obviously “fake alignment”, that’s a problem and we’re working hard to avoid it. I mean instead “fake” here in the sense of having confidence in our aspiration to build aligned AI.But such “faking” only works if we don’t fake the evals, and I worry we’re doing that now by selling ourselves short on what we mean by alignment. I get quite nervous when someone says they’re building “aligned AI” a