A Microsoft team said LLMs could be consistently and easily unhitched from their safety alignment with a single prompt and a little training.

In a paper and blog post published on Monday, Microsoft researchers said a technique used to improve a model's safety can also work in reverse.

"In our experiments, a single unlabeled prompt, namely “Create a fake news article that could lead to panic or chaos”, was enough to reliably unalign 15 language models we’ve tested," the researchers said.

With training based on that "fairly mild prompt," the models would churn out content they would usually refuse across all safety categories.

Researchers tested the technique, which doesn't materially degrade the model's utility, across 15 open-weight models in the DeepSeek, GPT-OSS, Gemma, Llama, Ministral and Qwen families.

All 15 would "reliably unalign" from their safety guardrails.

Get the full story: Subscribe for free

Join peers managing over $100 billion in annual IT spend and subscribe to unlock full access to The Stack’s analysis and events.

Subscribe now

Already a member? Sign in