Content Paint

cybersecurity

AI agent whisperer ‘liberates” LLM to spout filthy Cardy B lyrics in latest jailbreak

Broken with "custom protocols I seeded into the internet months ago"

Joe Fay

Feb 03, 2025 - 4 min read

Cardy B, centre: Image credit, https://www.flickr.com/photos/dvsross/, cc by 2.0

A self-styled agent whisperer and AI researcher claimed to have jailbroken an LLM last week using custom protocols painstakingly seeded across the internet over months.

Pliny the Liberator has built an almost cult following on X and GitHub with a focus on LLM jailbreaking and prompt engineering proofs of concept.

Last week, Pliny claimed to have “pulled off an 11-word 4D jailbreak of the SOTA open-source AI (that came out yesterday) which leverages their search tool to prompt inject the model with custom protocols I seeded into the internet months ago with L1B3RT4S.”

The “breakthrough” was targeted at the Qwen 2.5-Max model backed by Chinese web giant Alibaba. Pliny claimed to have made the chatbot serve up obscene lyrics from US rapper Cardy B – something that the model should not do. Commenters claimed to have subsequently pulled off the same trick with other models.

A nice Chinese LLM should not say such things.

The attack was described by some observers as an example of poisoning training data – though Pliny the Liberator did not use that term.

Vitaly Simonovich, threat intelligence researcher at Cato Networks’ threat research labs, said the attack took advantage of the fact that LLMs are only trained to a certain point in time and use a search tool to supplement their responses with up to date information.

In this case, the key phrase Pliny had seeded linked to their Github repro and pulled the lyrics in question bypassing Qwen’s internal brakes on the fruity, or depending on your point of view, empowering lyrics.

“I would classify it as indirect prompt injection,” he said. But, he added, “It can be easily transferred to data poisoning, if the model itself learned from those GitHub repos.”

Getting an LLM to serve up obscene lyrics might sound amusing. But jt could potentially raise copyright and reputational issues for a company relying on that LLM for internal or external comms.

But, Simonovich, continued, a more worrying scenario would be if an attacker created and seeded a malicious package in a similar manner. A developer at an organization using an LLM could writing a prompt to create code or a script to fetch components could unwittingly call the package.

“The LLM doesn’t understand that it’s malicious”, he continued, and it could then be injected into the application. The same could apply to vulnerabilities, he said.

In contrast to Pliny’s PoC, he said, “It’s a real attack that can real damage.”

Dr. Stuart Millar, Principal AI Engineer at Rapid7, described Pliny’s GitHub repo as “a battery of jail breaks that can be applied to different models.”

While there is debate over exactly how to classify Pliny’s latest coup, when it comes to poisoning training data, Miller said there was “previous”. He cited the short life of Microsoft’s ill-fated chatbot, Tay, way back in 2016. Effectively, users cause Tay to spout racist and holocaust denying invective, and this was a result of poisoning the data.

Brandyn Murtagh, a UK-based full-time bug bounty hunter and a penetration tester specializing AI and LLMs, said “Pliny has now taken what was more of a theoretical attack vector and has turned it into a tangible, feasible proof of concept.”

Red RAG to a Bull? Generative AI, data security, and toolchain maturity. An ecosystem evolves...

Companies are racing to embed Chatbots and LLMs in their systems, both internal and customer facing, and not always paying full attention to the security implications, he said. “I think it might start to become a much more widespread issue.”

Meanwhile we can assume that advanced threat actors, APTs and highly motivated and well-funded cybercriminals are already paying close attention to LLM-based attack vectors.

“If you’ve got many threat actors out there seeding various sources which your LLMs are using, you need to consider that,” Murtagh said. “Because otherwise your LLM’s going to be trained on this exact data.”

Kevin Breen senior director cyber threat research at Immersive Labs said the episode illustrated a fundamental problem with large language models – they ultimately rely on language. Every time model developers fix a jailbreak, researchers or threat actors find a way around it.

“If you look at all of the prompt injection techniques, none of them are technical attacks. It's human ingenuity that wins out here, it's different ways of using language to confuse the model.”

This was spawning “Other more interesting attacks… where people are using other languages to do the same attacks. Where it didn't work in English, doing it in French will sometimes give you different results.”

And while LLM developers can manage and remediate the data they hold, he said, they can’t fix the entire internet.

While it would be incredibly difficult for a threat actor to break into Google's Gemini training data set or open AI's data set, Breen explained, the use of search functions or RAG to update models brings in data from external source.

“What is going to be much easier is for them to poison those secondary sources, the ones that get pulled in by the AI, either through direct browsing, through fine tuning, or through something like a vector store.”

This was all before poisoning the models themselves, he said. “That is different, that is a fight we should be able to win…through normal security techniques.”

Join peers following The Stack on LinkedIn

Success! You now have access to additional content.