Coauthored by Ben Xue and Yi Zhang
This is the third entry in a series of articles focused on AI/ML.
Natural language processing (NLP) is a form of artificial intelligence (AI) that gives machines the ability to read, understand, and derive meaning from human languages. NLP powers many applications that we use every day, such as virtual assistants, machine translation, chatbots, and email auto-complete. The technology is still evolving very quickly. Just over the last few years, we have seen incredible breakthroughs in NLP research, including transformers and powerful pre-trained language models such as GPT-3, which have significantly accelerated the development of NLP applications in various domains.
At Netskope, we are integrating the latest NLP technology into our secure access service edge (SASE) solution, as well as business operations. NLP is behind the scenes for a wide variety of tasks, including:
- Detecting sensitive information in documents to help our customers comply with privacy regulations and protect their digital assets.
- Categorizing and detecting malicious web domains, URLs, and web content to enable web filtering.
- Detecting malware and protecting enterprise assets from being compromised and used as a launchpad for malicious activities.
- Classifying SaaS and web apps and evaluating the enterprise readiness of a cloud app as part of the Cloud Confidence Index (CCI).
In this blog post, we will highlight three ways Netskope uses NLP to secure data and protect against threats: DLP document classification, URL categorization, and DGA domain detection.
DLP Document Classification
Various documents from our customers are stored in their cloud storage or transferred through cloud applications. Many of these documents contain sensitive information, including confidential legal and financial documents, intellectual property, and employee or user personally identifiable information (PII). At Netskope, we have developed machine learning-based document classifiers, as part of our inline Data Loss Prevention (DLP) service. The ML classifiers automatically classify documents into different categories, including tax forms, patents, source code, etc. Security administrators can then create DLP policies based on these categories. The ML classifiers work as a complementary approach to traditional regular expression-based DLP rules and enable granular policy controls in real-time. In many cases, manually configured regex rules can generate excessive false positives or false negatives when looking for specific patterns in documents. In comparison, the ML classifiers automatically learn the patterns and identify se