What should go in a model spec?
Suppose an AI company is considering whether to include some particular quality X – a rule, virtue, heuristic, default, attitude, goal, or style – in a model spec.Perhaps they are considering whether their LLM should have prosocial drives. Perhaps they’re wondering if the LLM should whistleblow to help prevent extreme power concentration. Or perhaps they’re uneasy about whether the LLM should be so exactingly honest that it always tells the truth to children about Santa. And so on.What kind of reasons might be invoked over the course of such considerations? Which criteria are most important? And how might these criteria clash?Consider four rough categories of reasons one might invoke:Behavioral Usefulness: Would the behavior make current and future LLMs more beneficial to the users or to the public at large?Accountability and Evaluability: Would publicly specifying the behavior make it easier for third parties to evaluate the LLM and the company?Coordination and Common Knowledge: Would publicly specifying the behavior help society converge on, or enforce a desirable standard for AI behavior?Trainability and LLM Psychology: Is the behavior the kind of thing we can make an LLM do well, without bad side-effects, given what we know about model psychology and training practice?I will not attempt to settle the relative weight of these categories, or the relative weight of sub-categories within them. My plan is instead to simply list sub-criteria that are plausible within these categories – to make a checklist one could consider when adding something to a model spec.Such a checklist is useful in part because people advocate for LLMs to have model specs for very different reasons. There may be some kind of a conflationary alliance around them; many people are in favor of model specs but picture them being used in different ways, such that the “ideal model spec” is different according to different visions of this use. I hope that going over criteria for inclusion in a model