AI: Smart 'Sleeper Agents' are here

...Anthropic paper highlights a vexing new AI threat

Jan 16, 2024

Progress early on in the AI Tech Wave is not always up and to the right. Every week we see new threats emerge around LLM AI products and services that need to be identified, addressed, and hopefully fixed sooner than later. This week saw a notable new threat, ‘Sleeper AI agents’, that as of now, have no easy fixes. As Anthropic’s research team explains in a new AI paper:

“New Anthropic Paper: Sleeper Agents.”

“We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.”

Leading OpenAI researcher Andrej Karpathy, elaborates further:

“I touched on the idea of sleeper agent LLMs at the end of my recent video, as a likely major security challenge for LLMs (perhaps more devious than prompt injection). The concern I described is that an attacker might be able to craft special kind of text (e.g. with a trigger phrase), put it up somewhere on the internet, so that when it later gets pick up and trained on, it poisons the base model in specific, narrow settings (e.g. when it sees that trigger phrase) to carry out actions in some controllable manner (e.g. jailbreak, or data exfiltration).”

“Perhaps the attack might not even look like readable text - it could be obfuscated in weird UTF-8 characters, byte64 encodings, or carefully perturbed images, making it very hard to detect by simply inspecting data. One could imagine computer security equivalents of zero-day vulnerability markets, selling these trigger phrases.”

“To my knowledge the above attack hasn't been convincingly demonstrated yet. This paper studies a similar (slightly weaker?) setting, showing that given some (potentially poisoned) model, you can't "make it safe" just by applying the current/standard safety finetuning.”

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

“The model doesn't learn to become safe across the board and can continue to misbehave in narrow ways that potentially only the attacker knows how to exploit. Here, the attack hides in the model weights instead of hiding in some data, so the more direct attack here looks like someone releasing a (secretly poisoned) open weights model, which others pick up, finetune and deploy, only to become secretly vulnerable. Well-worth studying directions in LLM security and expecting a lot more to follow.”

All this brings to mind a terrific 1977 spy movie ‘Telefon’ starring Charles Bronson and Lee Remick. The plot of course, human 'Sleeper Agents’ planted across the US, by the former Soviet Union during the Cold War. As IMDB explains:

“A Russian officer is sent to the U.S. to try and stop sleeper agents who will mindlessly attack government entities when they hear certain coded words.”

Totally recommended.

Anthropic is highlighting how AI ‘Smart Agents’ can also come in the form of digital ‘Sleeper Agents’. AI security and safety continues to be an ongoing challenge as the industry speeds up innovation.

The human cat and mouse games from the world of spies now of course is injected into the world of AI tech to potentially infect AI interactions. Stay tuned.

(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)

AI: Reset to Zero

Discussion about this post