Traditional anomaly detection methods are either rule-based, which doesn’t generalize well since the rules are too specific to cover all possible scenarios or time-series based, (time vs. quantity) which is too low-dimensional to capture the complexity of real life. Real-life events have higher dimensions (time, both source and destination locations, activity-type, object-acted on, app used, etc.) A successful anomaly detection system will have eight “must-have” features.
Before we go through those features, at the highest level the system must be one that “allows” rather than “blocks” and is based on machine learning.
The reason why an allow list is critical is because it studies the good guys. Bad guys try to hide and outsmart block-based platforms like anti-malware. A successful machine-learning anomaly detection system won’t chase bad guys, looking for “bad-X” in order to react with “anti-X.” Instead, such a platform that is allow-based can study what is stable (good guys’ normal behavior) and then look out for outliers. This approach avoids engaging in a perpetual and futile arms race.
If you’re going to do anomaly detection the right way, you need to be able to scale to billions of events per day and beyond. It’s not practical at that scale to define allow lists a-priori, or keep a perfect history of all observed behavior combinations. Instead, anomaly detection models should be “soft” in the sense that they always deal with conditional probabilities of event features and are ever-evolving.
The second high-level requirement is that a successful anomaly detection system must be machine learning-based. Virtually every CASB today uses this term, but few mean it. Machine learning means just what it says, that pattern recognition should be done by the computer without being specifically told what to look for. There are two main types of machine learning: Supervised and unsupervised. The former is where the computer learns from a dataset of labeled training data whereas the latter is where the computer makes sense of unlabeled data and finds patterns that are hard to find otherwise. Both supervised and unsupervised machine learning are relevant for this blog, and from here on out I’ll simply refer to anomaly detection as “Machine Learned Anomaly Detection,” or “MLAD” for short.
Now that we have established some high-level requirements, let’s dive into the eight “must-haves” for effective MLAD.
Noise resistance: A common issue with all anomaly detection systems is false-positives. In reality, it’s hard to avoid false positives entirely because in the real world there’s always an overlap between two distributions with unbounded ranges and different means. The chart below, which includes two distributions from the same data set of test results, shows this. Move the criterion threshold value to the right and you get fewer false-positives (FPs). The problem is that by doing this you’ll be also getting a growing number of false negatives (FNs). There is always a tradeoff.
While it is difficult to avoid false-positives, a successful MLAD system will take steps to help the user filter noise. Applying this model to cloud security, observing new users or devices, by definition, will generate patterns that are seen for the first time (a new IP address, a new application, a new account, etc. will appear). Good MLAD will learn source habits over time and flag anomalies only when, statistically, the event stream from a source, such as a user or device, is considered seasoned, or established enough.
More critically, MLAD must support a likelihood metric per event. Operators can display only the top N most unlikely/unusual events, sorted in descending order, while automatically filtering out any other event with a less than “one in a thousand,” or “one in a million” estimated probability to occur. Often these per-event likelihood metrics are based on the machine-learned statistical history of parameter values and their likelihood to appear together in context, for any source. It is up to the user to set the sensitivity thresholds to display what they want to see. This type of approach flags “rareness” and not “badness.”
Multi-dimensionality and generality: Successful MLAD platforms don’t rely on specific, hard-wired rules. Machine-learned anomalies are no longer unidimensional, such as “location-based,” “time-based,” etc. Instead, they are designed to detect anomalies in multiple, multi-dimensional spaces. You must look at every feature you can collect and that makes sense in every event and consider many features as a whole when calculating the likelihoods of each combination. An anomaly may be triggered due to one unusual value in a dimension, or a combination of multiple dimensions falling out of bounds. Features can be categorical or numeric, ordered or not, cyclical or not, monotonic or non-monotonic.