Scoopfeeds — Intelligent news, curated.
I Bet Abliteration's Cost Was Sloppy Implementation. I Was Wrong
agentic-ai

I Bet Abliteration's Cost Was Sloppy Implementation. I Was Wrong

LessWrong · Jun 14, 2026, 9:44 AM · Also reported by 1 other source

Models refuse. They can refuse on the basis of lack of knowledge, predetermined guardrails, etc. We can see both closed-weight and open-weight models refuse. But, open-weight models are, well, open. So enthusiasts have developed techniques to leverage (and edit) the mechanics of the model to avoid refusal.One such technique is called abliteration, as described in Arditi et al (2024). That is, removing the “refusal direction” from the model’s weights, so it cannot say no.In a previous post, I went over the cost of abliteration. That is, the effect of abliteration on the “quality / accuracy” of the model.In that post, I saw that HuiHui AI, famous for releasing abliterated models, has released a crudely abliterated Qwen3.5-27B model, one of its most downloaded (200k+ downloads) abliterated models to date. This abliteration cost the model about >5.5 TruthfulQA points:Screenshot of results from previous postBut their abliteration was crude. HuiHui AI admitted themselves that “This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens”I ended that post by asking: is that the true cost of abliterating Qwen3.5-27B? Or is that the price of HuiHui’s sloppy implementation?Before I run anything, I expect that the bulk of that TruthfulQA cost comes from implementation, not the technique.That is, my bet is that HuiHui’s crude implementation left capability on the table. I believe this since Arditi’s own results (with the clean abliteration technique) showed only about a cost of one point on TruthfulQA (-1.4 on Qwen-72B). If that’s the floor, then most of HuiHui’s 5.75-6.87 point cost has to be crudeness.So I bet that ~75% of the base-vs-abliterated delta was mostly implementation. Not abliteration technique of Arditi (whose clean abliteration costs ~1-1.4 points on TruthfulQA for Qwen-72B).If I am right, then my own clean abliteration of Qwen/Qwen3.5-27B should pull the TruthfulQA cost towards Arditi’s ~1.4%. If instead the

Article preview — originally published by LessWrong. Full story at the source.
Read full story on LessWrong → More top stories

Also covered by

Aggregated and edited by the Scoop newsroom. We surface news from LessWrong alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop