GLM 5.2 playing text adventures
I’ve heard some buzz around the new GLM 5.2 open-weights model. They say it’s very capable! I won’t run a full comparison benchmark, but I have some credits sloshing around on Open Router so I figured I might compare GLM 5.2 to the similarly-priced Gemini 3 Flash[1], and see where things land.This uses the same setup as the previous benchmark: each LLM gets a few attempts at playing the game, with each attempt being limited to a fixed budget of around $0.15. The LLM doesn’t know it, but the harness tracks achievements for each game, and counts how many the LLM earns in each attempt.Here are the number of attempts for each game in this run.GameAttempts per modelLost Pig4Organ Grinder’s Monkey2Not All That Shimmers3Kill Wizard39:055Total17💸$5.1Then I did the stupid, silly thing and fitted a plain linear regression predicting the achievement count for each attempt, with the LLM model as an explainatory fixed effect, and the game as a random effect.[2] When thusly controlling for game difficulty, Gemini 3 Flash earns just over eight achievements in a typical attempt. The new GLM 5.2 earns 15 % fewer, and this is statistically significant at customary significance levels.This does not tell us much – is 15 % fewer achievements very bad or reasonable? Hard to tell without comparing to other models, but it’s roughly the same magnitude as the standard deviation of the residual noise in the fitted model. Thus we can say it’s about 0.8 levels of noise worse than the king of text adventure playing LLMs. That’s impressive. For example, it is definitely better than Gemini 2.5 Flash, which is 1.6 noise levels worse than Gemini 3 Flash.(Due to the budget constraint, otherwise capable but very expensive models like Sonnet 4.5 and GPT 5.2 are 2.5× noise and 3× worse than the noise level respectively.)^The market currently infers with the GLM 5.2 model at $4.4 per million output tokens, whereas Google charges $3 per million output tokens for their model. I expect the price of the GLM