agentic-ai

Document-tuning instills durable animal compassion in LLMs (and generalizes to humans)

LessWrong · May 21, 2026, 3:29 AM

Note: This post focuses on the alignment implications. Our EA Forum, focusing on the implications for animal welfare, is here.Jasmine Brazilek & Miles Tidmarsh: Compassion Aligned Machine Learning Preprint, March 2026: Full paper | Hugging Face resources | ANIMA benchmark TL;DRInstruction-tuning and Reinforcement learning are effective in certain specific domains but may produce superficial and/or context-specific alignment towards values like compassion. Pretraining/Midtraining LLMs on synthetic documents may produce more genuine beliefs because persona vectors are formed early during pretraining. These persona vectors include what values are associated with the AI assistant self-concept or with effectiveness. We use the case of animal welfare to test how well we can shape the specific values of LLMs in robust ways. Our document-tuned model scores 77% on a new public benchmark for animal welfare reasoning, compared to 40% for an equivalent instruction-tuning approach. The effect generalises to humans, survives subsequent fine-tuning better than the instruction-tuning approach, and causes no degradation on safety or capability benchmarks. We publicly release the ANIMA benchmark, all model checkpoints, and training data.Synthetic Document TuningThree reasons we picked animal welfare as a value to test. First, it matters and the animal field is not really working on HOW to align AIs to animals. Second, it gives us a cleaner test because almost no existing pretraining or RLHF data targets it, so the signal stays close to what we did rather than what the base model already has. Third, and this part is more speculative, broader values like compassion seem to be learned somewhere in the model regardless of the specific entities it is trained on and should generalize. Fourth, how an AI system treats less powerful beings like animals matters enormously for how it will treat humans when it is more powerful than us.This is not instruction-tuning. We are not training the model t

Article preview — originally published by LessWrong. Full story at the source.

Read full story on LessWrong → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Document-tuning instills durable animal compassion in LLMs (and generalizes to humans)

More in agentic-ai