The following is an excerpt from Netskope’s recent white paper How to Design a Cloud Data Protection Strategy written by James Christiansen and David Fairman.
Step 1: Know where the data is stored and located (aka Data Discovery)
This is the process of discovering/detecting/locating all the structured and unstructured data that an organization possesses. This data may be stored on company hardware (endpoints, databases), employee BYOD, or the cloud.
There are many tools available to assist in the discovery of data (for both in transit and in storage) and these vary between on-prem and cloud-related data. This process is intended to assure that no data is left unknown and unprotected. This is the core of creating a data-centric approach to data protection as an organization creates an inventory of all of its data. This inventory is a critical input to a broader data governance strategy and practice.
Information assets are constantly changing and new assets are added that will make any static list out of date and ineffective almost immediately. When establishing the process for data discovery ensure to use automation. It is the only way you can keep an active view of your information assets and be able to effectively manage the risk.
Step 2: Know the sensitivity of the data (aka Data Classification)
Once the data is discovered, that data needs to be classified. Data Classification is the process of analyzing the contents of the data, searching for PII, PHI, and other sensitive data, and classifying it accordingly. A common approach is to have 3 or 4 levels of classification, typically:
3 level policy:
- Public
- Private / Internal
- Confidential
4 level policy:
- Public
- Private / Internal
- Confidential
- Highly Confidential / Restricted
Once a policy is created, the data itself needs to be tagged within the metadata (this is the implementation of the data classification policy). Traditionally, this has been a complex and often inaccurate process. Examples of traditional approaches have been:
- Rule-based
- RegEx, Keyword Match, dictionaries
- Finger Printing and IP Protection
- Exact Data Match
- Optical Character Recognition
- Compliance coverage
- Exception management
Approaches to data classification have evolved and organizations must leverage new capabilities if they are to truly classify the large volume of data they create and own. Some examples are:
- Machine Learning (ML) based document classification & analysis, including the ability to train models and classifiers using own data sets using predefined ML classifiers (making this simple for organizations to create classifiers without the need to complex data science skills). (See this analysis from Netskope.)
- Natural Language Processing (NLP)
- Context Analysis
- Image Analysis and classification
- Redaction and privacy
These approaches must have the ability to support API-based, cloud-native services for automated classification and process integration. This allows the organization to build a foundational capability to use process and technology, including models, together to classify data which then becomes a data point on additional ins