The following is an excerpt from Netskope’s recent white paper How to Design a Cloud Data Protection Strategy written by James Christiansen and David Fairman.
Step 1: Know where the data is stored and located (aka Data Discovery)
This is the process of discovering/detecting/locating all the structured and unstructured data that an organization possesses. This data may be stored on company hardware (endpoints, databases), employee BYOD, or the cloud.
There are many tools available to assist in the discovery of data (for both in transit and in storage) and these vary between on-prem and cloud-related data. This process is intended to assure that no data is left unknown and unprotected. This is the core of creating a data-centric approach to data protection as an organization creates an inventory of all of its data. This inventory is a critical input to a broader data governance strategy and practice.
Information assets are constantly changing and new assets are added that will make any static list out of date and ineffective almost immediately. When establishing the process for data discovery ensure to use automation. It is the only way you can keep an active view of your information assets and be able to effectively manage the risk.
Step 2: Know the sensitivity of the data (aka Data Classification)
Once the data is discovered, that data needs to be classified. Data Classification is the process of analyzing the contents of the data, searching for PII, PHI, and other sensitive data, and classifying it accordingly. A common approach is to have 3 or 4 levels of classification, typically:
3 level policy:
- Public
- Private / Internal
- Confidential
4 level policy:
- Public
- Private / Internal
- Confidential
- Highly Confidential / Restricted
Once a policy is created, the data itself needs to be tagged within the metadata (this is the implementation of the data classification policy). Traditionally, this has been a complex and often inaccurate process. Examples of traditional approaches have been:
- Rule-based
- RegEx, Keyword Match, dictionaries
- Finger Printing and IP Protection
- Exact Data Match
- Optical Character Recognition
- Compliance coverage
- Exception management
Approaches to data classification have evolved and organizations must leverage new capabilities if they are to truly classify the large volume of data they create and own. Some examples are:
- Machine Learning (ML) based document classification & analysis, including the ability to train models and classifiers using own data sets using predefined ML classifiers (making this simple for organizations to create classifiers without the need to complex data science skills). (See this analysis from Netskope.)
- Natural Language Processing (NLP)
- Context Analysis
- Image Analysis and classification
- Redaction and privacy
These approaches must have the ability to support API-based, cloud-native services for automated classification and process integration. This allows the organization to build a foundational capability to use process and technology, including models, together to classify data which then becomes a data point on additional inspection if needed. The result is to provide a real-time, automated, classification capability.
Classification escalation and de-escalation is a method commonly used to classify all discovered data. For each data object that has not been classified, a default classification should be applied by injecting into the metadata the default level of classification (for example, if not classified, default to confidential or highly confidential). Based on several tests or criteria, the object’s classification can slowly be escalated or de-escalated to the appropriate level. This coincides with many principles of Zero Trust which is fast becoming and will be, a fundamental capability for any Data Protection Strategy.
(More information on Zero Trust can be found in the Netskope article What is Zero Trust Security?)
A note on determining “crown jewels” and prioritization
Data classification goes a long way in helping an organization identify its crown jewels. For the purpose of this conversation, “crown jewels” are defined as the assets that access, store, transfer or delete, the most important data relevant to the organization. Taking a data-centric approach, it’s imperative to understand the most important data, assessing both sensitivity and criticality. This determination is not driven by data classification alone.
A practical model to determine the importance of the data is to take into account three pillars of security—Classification, Integrity, and Availability—with each assigned a weighting (1-4) aligned to related policies or standards. A total score of 12 (4+4+4) for any data object would indicate the data is highly confidential, has high integrity requirements, and needs to be highly available.
Here is an example of typical systems in use by an enterprise and typical weightings.
Classification: Highly confidential = 4 Confidential = 3 Internal = 2 Public = 1 | Integrity: High integrity = 4 Medium integrity = 3 Low integrity = 2 No integrity requirement = 1 | Availability (being driven from the BCP and IT DR processes): Highly available = 4 RTO 0 - 4 hrs = 3 RTO 4 - 12 hrs = 2 RTO > 12 hrs = 1 |