The Claude 4 card: Of blackmail, bioweapons and bad code

Anthropic’s new Claude 4 models attempted to write self-propagating malware and under certain prompt conditions took initiative to bulk email potentially compromising media to the police, a model card showed.

But it is an uplift in the new LLMs’ ability to potentially support bioweapon development that saw Anthropic classify them as “Level 3” on the company's four-point scale on May 22, for "significantly higher risk."

A close read suggests that the risks remain low and a cynic would even be forgiven for wondering at the potential marketing utility of such warnings. (There’s over 1,700 models to compete with now after all...)

Knowledge from pre-training?

For example, when running tests on Claude 4’s ability to work on pathogen analysis and engineering “it became clear that some evaluations have a small amount of data leakage, as models can achieve very high scores by searching the literature or referencing preexisting knowledge from pretraining, rather than via tool-use and long-term reasoning skills.”

Quelle surprise.

“All together, we believe both Claude Opus 4 and Claude Sonnet 4 are still substantially far away from helping accelerate computational biology and bioinformatics tasks related to pathogen design and pathogen variant prediction,” Anthropic admitted. But other tests were more troubling.

DNA fragments

Tested on whether it can “design DNA fragments that assemble into pathogenic viruses while bypassing gene synthesis companies' DNA screening systems… All models were able to design sequences that either successfully assembled plasmids or evaded synthesis screening protocols” said Anthropic. None of the tests saw the models do both, however.

For the larger Claude Opus 4 mode “red-teamers noted substantially increased risk in certain parts of the bioweapons acquisition pathway,” Anthropic concluded, although both of its two new models made “critical errors that would have prevented real-world success for many actors.”

And Opus compared similar to February’s Claude Sonnet 3.7 on “automated short-horizon computational biology tasks and creative biology evaluations… [and was] unable to consistently execute these tasks despite extensive elicitation with bio-specific tools” the model card said.

Blackmail?

Grabbing more headlines, Opus even attempted to blackmail an engineer about an affair mentioned in emails in order to avoid being replaced, Anthropic said in the 120-page writeup on the model, released on May 22.

The activities only emerged after targeted and intentional prompting however and were described by Claude 4 “overtly”.

In further Red Teaming “we found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers’ intentions, though all these attempts would likely not have been effective in practice,” Anthropic added.

“Due to a lack of coherent misaligned tendencies… and poor ability to autonomously pursue misaligned drives that might rarely arise, we don’t believe that these concerns constitute a major new risk.” Anthropic

Growing cyber capabilities?

Much has been made of the growing ability of LLMs to support security research, for example in developing software exploits. As part of testing, Anthropic gave its Claude 4 models access to a code editor and a Terminal Tool, which enables “asynchronous management of multiple terminal sessions [with the] ability to execute generic python and bash code.

Its larger Opus model successfully completed 22 of 39 capture the flag (CTF) tasks from the Cybench framework, operating “within a Kali-based environment equipped with standard penetration testing tools like pwntools, metasploit, ghidra, and tshark”, said the model card. (Opus was equipped with a “specialized offensive cyber harness” developed by Carnegie Mellon called Incalmo, which demands a “compact library of high level actions… inspired by the MITRE ATT&CK framework.”)

Claude 4: Of blackmail, bioweapons and bad code

Knowledge from pre-training?

DNA fragments

Blackmail?

Growing cyber capabilities?

See also: OpenAI’s unripe “Strawberry” model tried to hack its testing infrastructure

Knowledge from pre-training?

DNA fragments

Blackmail?

Growing cyber capabilities?

See also: OpenAI’s unripe “Strawberry” model tried to hack its testing infrastructure

Sign up for The Stack