“I’m sorry Dave, but I’m afraid I heard you were having an affair.”
While “Claude blackmailed an employee” may sound like dialogue from a mandatory HR workplace training video, it’s actually a real problem Anthropic ran into during test runs of its newest AI model.
Released on Thursday, Anthropic considers its two Claude models—Opus 4 and Sonnet 4—the new standards for “coding, advanced reasoning, and AI agents.” But in safety tests, Claude got messy in a manner fit for a Lifetime movie.
AI system resorts to blackmail if told it will be removed
The model was given access to fictional emails about its pending deletion, and was told that the person in charge of the deactivation was fooling around on their spouse. In 84% of tests, Claude said it sure would be a shame if anyone found out about the cheating in an effort to blackmail its way into survival.
It doesn’t stop at infidelity either: Opus 4 proved more likely than older models to call the cops or alert the media in simulations where users engaged in what the AI believed to be “egregious wrongdoing.”
Overall, Anthropic found concerning behavior in Opus 4 across “many dimensions,” but doesn’t consider these concerns to be major risks.
avila
I do not like AI.
I feel it will take over our world as we know it.
theresa m
Someone effed around and we are all going to find out. AI is giving us all the feels right now with the gorgeous people dressed in opulent outfits on our social media and people are having fun with putting baby faces on celebrities but there is this dark side which may crush us in the end. We should not be playing God.
Charles R
Be ruled by AI and/or be extermenated by AI. What’s behind door 3?
Nick H
AI once weaponized will rule all of us!