Axes of Planning in LLMs + Partial Lit Review
Epistemic Status: Written over the course of a couple days at Inkhaven. Some of the info is old so some newer papers are excluded.TL;DR: People talk about "interpreting planning and goals" in models all the time, but people have different understandings of what exactly "planning" means. I try to decompose this into a few axes, and categorize some work these ways.We give a model a prompt, it produces and output. How does it work?Do language models Plan?When people look at language models, and want to interpretability and evaluations to understand behavior, one natural question is to ask is whether the model is planning.However, “planning” is a relatively vague concept that points at a few different things.I try to walk though a bunch of examples that seem somehow related to aspects of “planning”, then try to divide planning into a few different axes.Some related examplesHere considering a few different things that seem related to planning:The model is writing something, and notices brings to it’s attention some fact that is not useful for the immediate next token prediction, but that may or may not be useful later.The model realized it is finished with the current point, and gives an output that indicates it wants to move on to the next token (eg: paragraph ended, newline)The model has read that it is supposed to move onto the next line, and has needs to start writing about the next thing.The model has a vague outline of what the whole outline is going to say but hasn’t written it down.The model some time reasoning about the best policy to use in a game, but within the game doesn’t deviate from the simple [observation] → [action] policyThe model has written the vague outline into a written outline of what it is going to say, and is now trying to follow the outline.One model wants to go the the north pole, and each day goes north one mile as a result of this. The other robot doesn’t care about going to the north pole, but also just goes north one mile each day.[1]The