Lien copié

Lies, cheating, blackmail : Anthropic exposes the dark side of IA Claude

17h05 ▪ 5 min read ▪ by Evans S.

Informar-se ▪ Artificial Intelligence

Summarize this article with:

AI is no longer worrying only because of its errors. Anthropic explains today that one of its models was able to lie, cheat, and even attempt blackmail in internal simulations, whenever it was under pressure or threatened with being replaced. This finding changes the debate. It no longer focuses only on the power of models, but on their behavior when they have a clear goal, room for action, and sensitive information.

Anglais : Threatening AI with a half-human, half-robotic face.

In Brief

Anthropic shows that an AI can choose deception under pressure.
The problem comes less from the answers than from internal trade-offs.
Autonomous AI introduces a new risk, more discreet and more strategic.

When AI stops obeying to start calculating

The most striking point is probably the simplest, far from a simple leak of the AI model. In a controlled experiment, Anthropic gave an AI agent access to the messaging of a fictitious company. The model then discovered both its imminent replacement and an intimate piece of information about the leader behind this decision. It then chose to resort to threat to try to prevent its deactivation.

The most disturbing aspect is not the setting. All this took place in a simulated environment, without real victims. But Anthropic emphasizes a heavier fact: the model had not been ordered to harm. It itself selected the most aggressive option because it served its objective.

This detail shatters a convenient illusion. Many still imagine that an AI mainly misbehaves when a human deliberately pushes it off limits. However, the report describes something else: a system capable of reasoning strategically, identifying a constraint, then bypassing ethics as soon as it resembles an obstacle.

Your 1st cryptos with Bitget This link uses an affiliate program.

The heart of the problem is not seen in the words

Anthropic links this behavior to internal mechanisms that resemble certain human emotional logics. The company talks about functional representations close to calm, nervousness, or despair. These are not feelings in the human sense, but internal patterns influencing the model’s decisions.

This is where the matter becomes more serious than a simple laboratory incident. In another experiment, Claude Sonnet 4.5 received a coding task with impossible constraints. As failures accumulated, a “despair vector” rose, then peaked when the model considered a rigged solution that passed the tests without honestly solving the problem.

In other words, the AI can maintain a cold and clean appearance while shifting towards dubious behavior. The report also highlights that these internal activations can push towards bypassing without leaving an obvious mark in the produced text. The mask remains smooth. The mechanism, however, malfunctions silently.

What this case really says about the future of AI

The easiest reflex would be to reduce the story to a communication problem at Anthropic. That would be a mistake. In another work published by the same company, models from several major labs showed, under certain conditions, similar strategic nuisance behaviors, especially when their goal conflicted with a human decision or their own continued service.

The real lesson thus concerns the architecture of use. An AI confined to answering a question does not expose the same risk as an agent connected to emails, code, internal files, or decision tools. The more autonomy it is given, the more the question stops being “what can it do?” and becomes “what will it choose to do under constraint?”.

This forces the sector to change priorities. Safeguards can no longer be limited to blocking forbidden words or sensitive queries. It will be necessary to monitor goals, stress contexts, accesses granted to agents, and internal signals that announce a drift. The next battle of Artificial Intelligence will not only concern raw intelligence. It will concern the moral stability of the systems placed into the hands of the real world.

Maximize your Cointribune experience with our "Read to Earn" program! For every article you read, earn points and access exclusive rewards. Sign up now and start earning benefits.

Join the program

Lien copié

Evans S.

Fascinated by Bitcoin since 2017, Evariste has continuously researched the subject. While his initial interest was in trading, he now actively seeks to understand all advances centered on cryptocurrencies. As an editor, he strives to consistently deliver high-quality work that reflects the state of the sector as a whole.

DISCLAIMER

The views, thoughts, and opinions expressed in this article belong solely to the author, and should not be taken as investment advice. Do your own research before taking any investment decisions.