Understanding and Defining Practical Security Operations Metrics

Request Demo

In the business of security, linking performance metrics to strategy has become an accepted best practice. If strategy is the blueprint for building a security operations center (SOC), metrics are the raw materials. But there is a catch: a security organization can easily lose sight of its strategy and instead focus strictly on the metrics that are meant to represent it.

A recent SANS survey showed that 77% of security operations centers indicated that they provide metrics to gauge status and effectiveness of SOC capabilities. That represents a 50% increase in SOC metrics programs over the past five years. However, 33% of survey respondents indicated dissatisfaction with their metrics.

Why are some metrics good and others not so good? All metrics are inherently imperfect at some level. In security, as in business, the intent behind your metrics is usually to capture some underlying intangible goal—and they almost always fail to do this as well as you hoped. Performance management systems are full of metrics that are flawed proxies for what you care about. Clearly, this soon becomes a problem as a result of the many ways to boost scores while actually displeasing your stakeholders. Tying financial incentives to a metric is usually a mistake: often it only increases the focus on the numbers.

Though it’s easy to fall into these metrics traps, security organizations can take steps to avoid them. For instance, involving the people who’ll implement a strategy in its formulation means they’ll be more likely to grasp it and less likely to replace it with a metric. Using multiple yardsticks is also a helpful approach, in that it highlights the fact that no single metric captures the strategy.

Effective metrics programs leverage data that you already have access to. The mechanics of measurement are the easy part. You should also require appropriately placed expectations that are tied to security strategy. This is the hard part. There should be a quality control mechanism to guard against time-based mental anchors of Green/Yellow/Red. Metrics shouldn’t always be service-level objectives (SLOs).

Let’s take a closer look at how to effectively measure and report metrics for a typical SOC.

Data feed health

As you monitor your data, assets, and users with instrumentation, you first need to know how well they are working. The first measure to take is which monitoring points are down. However, just because it’s up doesn’t mean all is well. There could be delays in receipt, drops, or other temporary or permanent anomalies. Measure them and review them regularly.

Coverage

For your coverage measurements, tracking the absolute number and percentage of coverage per compute environment/enclave/domain is a worthwhile place to start. As you get more granular with insights into network, OS, applications, additional insights arise that show you what may be working/not working in your approach. Tracking your alert and detection coverage to the ATT&CK Framework is an ideal way to inform holistic viewpoints as to whether your current approach will guard against the various tactics leveraged by threat actors to attack you.

Coverage is always a moving target. There will always be more stones to turn over. There is always another environment to cover or a customer to serve. Don’t shoot for 100% because there is no spike-the-football moment with coverage. Instead, focus on the percentage of systems “managed”—this means assets are inventoried, tied to a user and/or business unit, configurations are checked, and risk is assessed. In doing so, your SOC knows what they are monitoring and can more clearly identify the rogue entities in the environment.

Scanning & sweeping

At the basic level, you are probably scanning on-premise & cloud assets for vulnerabilities. You should measure the number and percentage of known bugs, as well as the amount of time it took to compile vulnerability and risk status during your last critical headline CVE fire drill. As you progress, start to measure the time it takes to sweep and compile results for a given vulnerability or indicator of compromise (IOC), and across workstations versus servers. Break it down further to insights specific to a given domain or identity plane. Then zero in on everything internet-facing.

You should eventually have an accurate number and percentage of assets you can’t/don’t cover and be able to answer the questions, “How fruitful is our scanning?” and “How effective is our patching?”

Analytics & analyst performance

Next, you need insights into how well the instrumentation is working for us—or better yet, how well are you using it? It is appropriate here to tie our efforts to MITRE ATT&CK. Be thorough in your coverage, documentation, and standards of output. All the triage effort in the world is useless if something is missed, or worse—found but not communicated completely and accurately.

Later, you can work in regular reviews of analysis results to determine the accuracy of analysis based on who analyzed what when, and across what classes of assets. Be careful about objectivity here too, and bake in checks and balances against gamesmanship. Eventually, you should be able to answer the questions, “How fruitful are each author’s detections?” and “How well are you supporting your customers?”

Incident handling

The next category of metrics includes analyst performance and incident handling measures and are typically time-based. To reiterate my earlier word of caution about time-based metrics: these analytics should have a quality control mechanism baked in to ensure the work is being done well and the metrics are not being gamed. Investments in automation can be proven out here as well by illustrating the efficiencies and consistencies gained.

Mean dwell time for the adversary is king. As defenders, we want to shrink this. The building blocks to that cornerstone metric are mean times to triage, escalate, identify, contain, eradicate, and recover. Build insights into your top sources of confirmed incidents. Compare this with sources that generate the most false positives (and thus, alert fatigue). These efforts should open up discussions around whether incorrect or ambiguous conclusions are being made, and whether those errors are correlated with bad data or processes.

Top risk areas & hygiene

Finally, pick the top risk items from your own incident avoidability metrics and public intel reporting to focus on each year, semester, or quarter. What can you tell system administrators about your scanning and patching effort results? What code signing enforcement risks and mitigations would you like to convey to developers? Socialize these results with internal stakeholders, but make it digestible and don’t overdo it. They already have a vested interest in the work you are doing in security. It is up to you to open their hearts and minds to it.

Conclusion

While metrics programs are growing in security organizations, it is important to ensure those metrics are built and used effectively. Because strategy is abstract, employees often mentally replace it with the hard metrics meant to assess whether the organization is succeeding. Security strategy should not be hijacked by numbers. Instead, security metrics should help screen, diagnose, and assess performance against security strategy.