crypto for all
Join
A
A

AI does not yet outperform engineers in managing IT outages

10h05 ▪ 4 min read ▪ by Ghiles A.
Getting informed Artificial Intelligence
Summarize this article with:

AI tools are rapidly advancing in monitoring IT systems. However, a new study conducted by Datadog and Carnegie Mellon University shows that engineers maintain a significant lead in managing complex incidents. Based on real outages observed in production, this test compares several advanced models to human specialists. The results mainly reveal the current limitations of models when facing critical and unforeseen situations.

Illustration showing an IT engineer urgently handling a server outage while a robot symbolizing AI appears confused in front of network anomaly charts.

In brief

  • A study by Datadog shows that AI models remain less efficient than engineers in managing complex IT outages.
  • The tests rely on 63 real incidents and more than 5 million data points from emergency production situations.
  • GPT-5 leads generalist models with 62.7% accuracy, but human experts still achieve 72.7%.
  • Researchers believe collaboration between humans and AI could greatly improve incident response in the future.

AI is advancing but remains limited against complex incidents

Tech companies now present AI agents capable of automatically analyzing production incidents, despite recent progress made by these models. These systems are meant to help teams detect anomalies and identify the root causes of outages. However, the ARFBench benchmark shows this automation is still imperfect. The project relies on real incidents observed during emergency situations, with manually validated data to avoid artificial scenarios.

The study is based notably on several key figures:

  • 63 real incidents analyzed from Slack exchanges in emergency situations.
  • 750 questions created around the studied incidents.
  • 142 monitoring indicators used in the benchmark.
  • More than 5 million data points were examined.

The tests evaluate both anomaly detection and the models’ ability to understand complex relationships between multiple metrics. GPT-5 achieves an F1 score of 47.5% on the most difficult questions while maintaining an overall accuracy of 62.7%. Researchers also recall that trillions of dollars are lost annually due to system outages, which underscores the strategic importance of AI tools in modern digital infrastructures.

Engineers maintain a clear lead over current models

Against the models’ results, human engineers retain better overall accuracy. Domain experts scored 72.7%, well above the best models tested. Even Datadog non-experts reached 69.7%, more than the automated system.

These results indicate that engineers still interpret the overall context of an incident better. They can more easily understand interactions between several technical signals and the unusual behaviors of infrastructures.

No AI model has surpassed the benchmark human performance. However, some specialized systems are gradually narrowing the gap. The hybrid model Toto-1.0-QA-Experimental, developed by Datadog, achieves an accuracy of 63.9%. This system combines an internal forecasting model with Qwen3-VL 32B.

In anomaly detection, Toto even obtains an F1 score at least 8.8 points higher than competing models. This result confirms that a model specialized in observability data can better meet a specific technical task than a generalist system.

Despite these advances, engineers remain essential during critical incidents. Models sometimes lose business context, ignore certain metadata, or misinterpret multiple indicators simultaneously.

Collaboration between AI and humans becomes the most credible scenario

The study mainly highlights that the errors of humans and models differ. AI systems detect some anomalies quickly, while humans better understand ambiguous situations and operational constraints.

Researchers explain these differences create complementary skills. Models sometimes miss contextual details, whereas humans make more mistakes on precise timestamps or complex instructions.

To measure this potential, researchers imagined an “expert oracle” capable of systematically choosing the best answer between a human and an AI. In this theoretical scenario, accuracy climbs to 87.2% with an F1 score of 82.8%.

This result does not yet represent a concrete product. However, it shows that collaboration between artificial intelligence and engineers could greatly improve IT incident management in the coming years. Automated systems thus seem destined to assist human teams rather than completely replace them in the short term.

Maximize your Cointribune experience with our "Read to Earn" program! For every article you read, earn points and access exclusive rewards. Sign up now and start earning benefits.



Join the program
A
A
Ghiles A. avatar
Ghiles A.

Journaliste et rédacteur web passionné par l’univers des cryptomonnaies et des technologies Web3. J’y traite les dernières tendances et actualités afin de proposer un contenu de haute qualité à un large public du secteur.

DISCLAIMER

The views, thoughts, and opinions expressed in this article belong solely to the author, and should not be taken as investment advice. Do your own research before taking any investment decisions.