ブログ Data Science, Security Transformation AI/ML for Malware Detection
Sep 02 2021

AI/ML for Malware Detection

This is the fourth in an ongoing series of blogs focused on AI/ML.  

Malware detection is an important part of the Netskope Security Cloud platform, complete with a secure access service edge (SASE) architecture, that we provide to our customers. Malware is malicious software that is designed to harm or exploit devices and computer systems. Various types of malware, such as viruses, worms, Trojan horses, ransomware, and spyware, remain a serious problem for corporations and government agencies. Traditional malware detection systems rely on anti-virus signatures, heuristics, and behavior patterns in sandboxes, which require a significant amount of manual analysis from security analysts and researchers. With new attacks and variants emerging every day, it is hard for organizations to keep pace with malware threats. In comparison, artificial intelligence (AI) and machine learning (ML) has the potential to detect unknown and zero-day malware by automatically learning the malware patterns based on large volumes of historical data. This unique capability has made AI/ML an indispensable part of a modern malware detection solution, complementing heuristic and signature-based approaches. 

At Netskope, we have developed a comprehensive, multi-layered threat protection system to scan our customers’ network traffic. AI/ML is used to power multiple engines in the inline fast scan, as well as static and dynamic analysis-based deep scan. In this blog post, we will highlight three of them:

  • Inline PE Classifier
  • MS Office Classifier
  • Cloud Sandbox

Inline PE Classifier

The Portable Executable (PE) file format is used by Windows executables, object code, and dynamic link libraries (DLLs). It’s one of the most common malware file formats. To stop malicious PE files in real-time, we have developed the inline PE classifier. Trained with millions of malicious and benign PE samples, the ML-based classifier is able to identify malware patterns in raw bytes. The classifier doesn’t need to parse a PE file and extract features based on domain knowledge. Therefore, it’s lightweight, fast, and suitable for inline predictions.

The inline PE classifier complements the signature-based malware engines in fast scan. Since its launch, the classifier has detected unique malware samples that were undetectable to signature-based inline engines, without introducing any new false positives. Its runtime in production is just a few milliseconds.

This high efficacy ML classifier enables faster time to detection for unique detections that can be blocked inline and complements the dynamic analysis with advanced forensics in the Advanced Threat Protection engines.

MS Office Classifier

Microsoft Office documents are another common source of malware. As part of Netskope’s Advanced Threat Protection, the Office Classifier is designed to leverage a combination of heuristics and supervised machine learning to identify malicious code embedded in Office documents. The Office Classifier performs static analysis and extracts detailed information about the components in an Office file, including embedded macros (VBA), dynamic data exchange (DDE), and other jpg/mpeg or EXE/PE files. The extracted information is then mapped to hundreds of features to train ML classification models and predict whether a new Office document is malicious or not.

The Office Classifier provides proactive coverage against zero-day malware attacks that can evade signature-based detections. For example, the Office Classifier has detected downloads of multiple zero-day Emotet samples distributed as Office document files targeting multiple Netskope customers (see screenshot below). The Emotet samples used multi-layered obfuscation techniques to bypass signature-based AV software but were detected by the Office Classifier. Recently, the Office Classifier also detected a new set of malicious Office documents that use VBA and LoLbins.

Screenshot of the Office Classifier detecting downloads of multiple zero-day Emotet samples distributed as Office document files

Cloud Sandbox

Sandbox has been proven to be an effective way to detect advanced malware. The Cloud Sandbox is enhanced with a machine learning engine in Netskope’s Advanced Threat Protection system. The Cloud Sandbox collects sample behaviors by executing them in an isolated Windows environment. The report of observed behaviors can then be used for heuristics and ML-based malware detection. Each report contains runtime behavior, such as process trees, where each tree node represents the behavior of a process, including API calls, dynamic link libraries (DLL), registry key activities, file activities, and network activities. We use deep learning transformer techniques to learn the tree structure and activities of the sandbox report and classify whether the file is malicious or not. 

Diagram of process trees

Summary

At Netskope, we have integrated AI/ML into our large-scale malware detection system to power multiple static and dynamic analysis engines. It is clear that AI/ML can identify unknown malware with great precision and complement other signature and heuristic engines. There are technical challenges associated with AI/ML, including high accuracy and low latency requirements, changing malware patterns, and model interpretability. We are addressing these challenges to reach AI/ML’s full potential in malware detection.

author image
About the author
Dr. Yihua Liao is the Director of Data Science at Netskope. His team Develops cutting-edge AI/ML technology to tackle many challenging problems in cloud security, including data loss prevention, malware and threat protection, and user/entity behavior analytics. Previously, he led data science teams at Uber and Facebook.
Dr. Yihua Liao is the Director of Data Science at Netskope. His team Develops cutting-edge AI/ML technology to tackle many challenging problems in cloud security, including data loss prevention, malware and threat protection, and user/entity behavior analytics. Previously, he led data science teams at Uber and Facebook.