In the modern, cloud-first era, traditional data protection technology approaches struggle to keep up. Data is rapidly growing in volume, variety, and velocity. It is becoming more and more unstructured, and therefore, harder to detect, and consequently, to protect. Most DLP solutions today rely only on textual data analysis in order to detect what data is sensitive, utilizing regular character patterns and content matching techniques applied to “conventional” data types (such as Word documents and spreadsheets). These techniques were once revolutionary; today, they are behind.
Don’t get me wrong: it is fundamental for DLP to be equipped with as many text analysis tools as possible—after all, if identifiable, it’s the content itself that is sensitive. DLP must be able to recognize thousands of known sensitive data types and unambivalent regular expressions, plus understand different data specific to countries and languages. For reliability, DLP must also be equipped with highly scalable data fingerprinting engines that can memorize and match specific information found in sensitive databases and documents. Textual content must be clear and legible in order to be leveraged by such engines. To minimize false positives, today it is also fundamental to leverage rich context, deep learning, natural language processing (NLP), and other newer ML and AI based automated techniques.
When it comes to unstructured data sources like images, traditionally optical character recognition (OCR) is used to extract text, which is then scanned for regular expression (regex) identification or exact matching analysis.
Because of the fast rhythms of modern business communication, users have developed new habits that make traditional data identification quite unreliable. In order to share information quickly and more often, users frequently share unstructured data sets, like images, taking screenshots or grabbing photos via a smartphone in order to rapidly convey ideas, show visual evidence, provide diagrams and slides on-the-go, or show contact information to a colleague from a data repository like Salesforce. Those are just a few examples.
In these cases, even OCR cannot perform well on low-quality images where text is not clearly readable. With great amounts of images to be processed, OCR and data matching also consume excessive resources introducing incident response latency.
Evolving modern DLP
For modern businesses, DLP has to evolve. Think of the necessity for modern DLP as akin to functioning like a human brain. Our brain doesn’t necessarily have to read the text in a document like a picture ID to tell that the document is indeed a picture ID containing personally identifiable information (PII). Now, modern DLP can do the same.
To solve modern DLP challenges, Netskope has pioneered ML-enabled image classification. This technique leverages deep learning and convolutional neural networks (CNN) to swiftly and accurately identify sensitive images without the need for text extraction. It mimics the human visual cortex, recognizing visual characteristics such as shapes and details to comprehend the image as a whole (much like how we can recognize that a passport is a passport without necessarily reading the details in it). ML enables feature recognition even in poor quality images, akin to the capabilities of the human eye. This is crucial, as images can be blurry, damaged, or discolored, yet still contain sensitive information.
The importance of personalized data classifiers
Netskope’s industry-leading ML classifiers empower automated identification of sensitive data, revolutionizing the categorization of images and documents with exceptional precision. This breakthrough technology detects and safeguards various sensitive data types, including source code, tax forms, patents, identification documents like passports and driver’s licenses, credit and debit cards, as well as full-screen screenshots and application screenshots. The ML classifiers work in conjunction with text-based DLP analysis (like data identifiers, exact matching, document fingerprinting, ML-based NLP and deep learning etc.), complimenting the DLP analysis of a file when text is indecipherable or harder to extract. They greatly enhance the detection accuracy and help enable DLP controls in real-time.
But what if I told you that a set of predefined ML classification templates may still not be enough?
Nowadays, organizations also possess proprietary document types and templates, personalized forms, and industry-specific files that fall outside the realm of standard ML classifiers. Netskope’s Train Your Own Classifiers (TYOC) technology revolutionizes data protection by combining the strength of AI, the adaptability of ML, and the convenience of automation. TYOC automatically identifies and categorizes new data based on a “train and forget” approach. Consider this analogy: your brain can recognize a known document like a passport or a W-2 form, but it won’t identify a new document type you’ve never encountered before. Yet, once your eyes see it and your brain learns its features, you can easily recognize it in the future. This is precisely how TYOC operates.
With TYOC, Netskope has democratized AI and ML data protection, granting customers the power of AI, automation, and adaptive learning as part of the Netskope Intelligent SSE capabilities available today. Organizations can embrace these cutting-edge advancements to safeguard their sensitive data and stay ahead of ever-evolving data protection requirements. This innovation empowers organizations to confidently address today’s most formidable data protection challenges while relieving policy administrators of most manual burdens, allowing them to focus human resources on more critical tasks.
TYOC is part of SkopeAI, the new Netskope suite of artificial intelligence and machine learning (AI/ML) innovations now available across the complete Netskope SASE portfolio. SkopeAI offerings use AI/ML to deliver modern data protection and cyber threat defense, overcoming the limitations of legacy security technologies and delivering AI-speed protection techniques not found in products from other SASE vendors.
If you’d like to learn more, please visit our d