CTI-REALM: A new benchmark for end-to-end detection rule generation with AI agents
Excerpt: CTI-REALM is Microsoft’s open-source benchmark for evaluating AI agents on real-world detection engineering—turning cyber threat intelligence (CTI) into validated detections. Instead of measuring “CTI trivia,” CTI-REALM tests end-to-end workflows: reading threat reports, exploring telemetry, iterating on KQL queries, and producing Sigma rules and KQL-based detection logic that can be scored against ground truth across Linux, AKS, and Azure cloud environments. Security is Microsoft’s top priority. Every day, we process more than 100 trillion security signals across endpoints, cloud infrastructure, identity, and global threat intelligence. That’s the scale modern cyber defense demands, and AI is a core part of how we protect Microsoft and our customers worldwide. At the same time, security is, and always will be, a team sport. That’s why Microsoft is committed to AI model diversity and to helping defenders apply the latest AI responsibly. We created CTI‑REALM and open‑sourced it so the broader industry can test models, write better code, and build more secure systems together. CTI-REALM (Cyber Threat Real World Evaluation and LLM Benchmarking) is Microsoft’s open-source benchmark that evaluates AI agents on end-to-end detection engineering. Building on work like ExCyTIn-Bench , which evaluates agents on threat investigation, CTI-REALM extends the scope to the next stage of the security workflow: detection rule generation. Rather than testing whether a model can answer CTI trivia or classify techniques in isolation, CTI-REALM places agents in a realistic, tool-rich environment and asks them to do what security analysts do every day: read a threat intelligence report, explore telemetry, write and refine KQL queries, and produce validated detection rules. We curated 37 CTI reports from public sources (Microsoft Security, Datadog Security Labs, Palo Alto Networks, and Splunk), selecting those that could be faithfully simulated in a sandboxed environment and that produced telemetry suitable for detection rule development. The benchmark spans three platforms: Linux endpoints, Azure Kubernetes Service (AKS), and Azure cloud infrastructure with ground-truth scoring at every stage of the analytical workflow. Why CTI-REALM exists Existing cybersecurity benchmarks primarily test parametric knowledge: can a model name the MITRE technique behind a log entry, or classify a TTP from a report? These are useful signals. However, they miss the harder question: can an agent operationalize that knowledge into detection logic that finds attacks in production telemetry? No current benchmark evaluates this complete workflow. CTI-REALM fills that gap by measuring: Operationalization, not recall: Agents must translate narrative threat intelligence into working Sigma rules and KQL queries, validated against real attack telemetry. The full workflow: Scoring captures intermediate decision quality—CTI report selection, MITRE technique mapping, data source identi
Sign in to read the full article
Create a free account to access all news, downloads, and community features
Originally published by Microsoft Security
This article is shared for informational purposes. All rights belong to the original author and publisher. If you are the copyright holder and would like this content removed, please contact us.