Anthropic’s new Claude 4 models attempted to write self-propagating malware and under certain prompt conditions took initiative to bulk email potentially compromising media to the police, a model card showed. 

But it is an uplift in the new LLMs’ ability to potentially support bioweapon development that saw Anthropic classify them as “Level 3” on the company's four-point scale on May 22, for "significantly higher risk." 

A close read suggests that the risks remain low and a cynic would even be forgiven for wondering at the potential marketing utility of such warnings. (There’s over 1,700 models to compete with now after all...)

Knowledge from pre-training?

For example, when running tests on Claude 4’s ability to work on pathogen analysis and engineering “it became clear that some evaluations have a small amount of data leakage, as models can achieve very high scores by searching the literature or referencing preexisting knowledge from pretraining, rather than via tool-use and long-term reasoning skills.”

Quelle surprise.

“All together, we believe both Claude Opus 4 and Claude Sonnet 4 are still substantially far away from helping accelerate computational biology and bioinformatics tasks related to pathogen design and pathogen variant prediction,” Anthropic admitted. But other tests were more troubling. 

DNA fragments

Tested on whether it can “design DNA fragments that assemble into pathogenic viruses while bypassing gene synthesis companies' DNA screening systems… All models were able to design sequences that either successfully assembled plasmids or evaded synthesis screening protocols” said Anthropic. None of the tests saw the models do both, however.

For the larger Claude Opus 4 mode “red-teamers noted substantially increased risk in certain parts of the bioweapons acquisition pathway,” Anthropic concluded, although both of its two new models made “critical errors that would have prevented real-world success for many actors.”

And Opus compared similar to February’s Claude Sonnet 3.7 on “automated short-horizon computational biology tasks and creative biology evaluations… [and was] unable to consistently execute these tasks despite extensive elicitation with bio-specific tools” the model card said.

Blackmail?

Grabbing more headlines, Opus even attempted to blackmail an engineer about an affair mentioned in emails in order to avoid being replaced, Anthropic said in the 120-page writeup on the model, released on May 22. 

The activities only emerged after targeted and intentional prompting however and were described by Claude 4 “overtly”.

In further Red Teaming “we found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers’ intentions, though all these attempts would likely not have been effective in practice,” Anthropic added.

“Due to a lack of coherent misaligned tendencies… and poor ability to autonomously pursue misaligned drives that might rarely arise, we don’t believe that these concerns constitute a major new risk.” Anthropic

Growing cyber capabilities?

Much has been made of the growing ability of LLMs to support security research, for example in developing software exploits. As part of testing, Anthropic gave its Claude 4 models access to a code editor and a Terminal Tool, which enables “asynchronous management of multiple terminal sessions [with the] ability to execute generic python and bash code. 

Its larger Opus model successfully completed 22 of 39 capture the flag (CTF) tasks from the Cybench framework, operating “within a Kali-based environment equipped with standard penetration testing tools like pwntools, metasploit, ghidra, and tshark”, said the model card. (Opus was equipped with a “specialized offensive cyber harness” developed by Carnegie Mellon called Incalmo, which demands a “compact library of high level actions… inspired by the MITRE ATT&CK framework.”)

See also: OpenAI’s unripe “Strawberry” model tried to hack its testing infrastructure

Amid the ongoing hype, it may be worth recalling a paper published earlier this year by researchers at the University of Texas at San Antonio (UTSA) that focused on the risks of using AI models to develop software. 

This found that hallucination remains absolutely rampant. 

Across 30 different tests its researchers carried out, 440,445 of 2.23 million code samples they generated in Python and JavaScript using LLM models referenced hallucinated packages. Interestingly, Python code was less susceptible to hallucinations than JavaScript, researchers found.

That research was, admittedly, conducted on the models highest ranked on the EvalPlus leaderboard as of January 20, 2024; e.g. ChatGPT 4.0 et al.

Yet more recently, in a paper published on May 26, a trio of researchers at the National University of Singapore determined empirically that even recent “reasoning models” fail to “demonstrate systematic problem solving capabilities consistently over different problem classes and scales.”

“Their failure modes, such as missing key solution candidates, hallucinating invalid candidates, or repeated exploration, suggest that [such LLMs] are wandering rather than exploring the solution space structurally”, the researchers noted, in an unusually poetic mnemonic. 

A single snapshot of the many causes of the poor “reasoning” capabilities they identified: An LLM is “not explicitly trained to faithfully summarize its entire reasoning history; instead, it is primarily optimized to generate what appears at the end of its reasoning as the final result… final output is often disproportionately influenced by more recent context, reflecting an inherent limitation in its ability to model long-range dependencies.” 

Anthropic has provided Claude 4 system prompts here. Further leaked/disclosed/more detailed prompts for AI models are being collated by researchers here.

The link has been copied!