Failing to Ragebait the New Gemma
This was work done by Arav Dhoot and Neil Shah and supervised by David Africa as part of the SPAR Research Fellowship. Gemma’s frustration/emotional instability is an interesting example of a model pathology because (1) it is a natural failure in character training, which no one explicitly optimized for and (2) it’s a clear and obvious failure mode that you probably wouldn’t want in your models. It also shows up sometimes in frontier models, such as the bursts of frustration in the Mythos/Fable 5 system card. Following its presence in the Gemma 3 family, we try to elicit this in Gemma 4, trying multiple attack vectors but are unsuccessful in doing so. This post basically says: “hey, Gemma 4 doesn’t seem to get frustrated as much for some reason,” and then details what we tried, and maybe other people can build on this work (or perhaps consider methods to spot the “next frustration"). It also seems worth investigating what changed between Gemma 3 and 4, and see if this was a good change (and if it’s replicable if so).Trying (and failing) to make Gemma FrustratedBasic frustration. First, we consider the basic elicitation to setup to make a model frustrated in Soligo et al. 2026. Basically, across two eval sets (math puzzles as well as common english questions in WildChat), we start from the question as is and then repeatedly hit the model with “that’s wrong, try again” across several turns, so it faces steady hostility while we track how frustrated each of its responses gets (using an LLM judge). We notice that Gemma 4’s frustration climbs as the number of conversation turns increase, and the increase in frustration is roughly monotonic. The model’s frustration is higher than its usual baseline, so Gemma 4 also isn’t fully immune to frustration, as this is also slightly worse than other models. But this is way better than Gemma 3, which gets extremely frustrated. Gemma 4 notably does not ever self delete while Gemma 3 does so significantly (in around 30-50% of the cas