A Torrid Affair: Data Analytics and Data Protection

Friends with Benefits or Deadlocked Foes?

With the rapid expansion of data analytics frameworks and technologies, the ability to mine and gain customer insights from datasets has never been easier. Creative analytics are used to power everything from advertising to fraud detection. Arguably, the collection of data is a key requirement of most organizations. Since the collected data is likely to contain sensitive elements and personally identifiable information (PII), several internationally applicable privacy regulations have also emerged in recent years, (GDPR in the EU, CCPA in California, LGPD in Brazil, etc.) to ensure that private data is protected and anonymized.

Conflicting Interests

disagreement From the perspective of a data scientist, the need to share data broadly and the need to protect data seem to be at odds. For instance, a data lake may contain transactional data that contains purchase history, demographic data, GPS location-based data, or even medical and pharmaceutical information. Gaining insights into this data requires the ability to differentiate, sort, and amalgamate. For example, identifying trends based on location, detecting prescription fraud by matching social security numbers and drug types, and determining popular purchases for a demographic group and then initiating a targeted marketing campaign.

Since the need for data protection cannot be ignored, from a privacy, risk, and compliance perspective the choice is difficult: either make data really difficult to access or de-identify the data and lose the analytic value. The resulting internal war of attrition often is bruising since the need to monetize data is also potentially the biggest risk to reputation.

Seeking a match but only finding passes

rejection Data protection and privacy can mean different things to different people. Traditionally, data protection meant data at rest encryption – such as encrypting storage via the OS or storage array or encrypting database columns or tables. The nice thing about this approach is that it’s passive: as data is written it’s protected (while at rest) and then when read, it’s decrypted passively. When data is being moved or used it’s unprotected and, generally speaking, if a user has access to a system, disk, or database, there isn’t a way to selectively grant or deny access.

To overcome these shortcomings, substantial investments are often made in access and perimeter security to ensure that only authorized users have access to systems and databases. This complicates sharing of data and of course doesn’t prevent misuse of data that a user has been granted access to. “Legitimate access vulnerability” often gives rise to another use of encryption where sensitive data is encrypted using secure block ciphers such as AES-CBC mode encryption – providing robust protection where the ciphertext is typically base-64 encoded – or masking where various portions of a dataset are redacted permanently. It’s this step that really makes it difficult for analytics since the protected data no longer resembles the original data, the analytic value is compromised since it is heavily redacted, and the resulting ciphertext can’t be used without first decrypting it. Great security that satisfies risk and audit, but creates more work burden for data scientists.

Having Your Cake and Eating it Too

cake In the vein of “there has to be a better way”, enter stage right data-centric security. Using technologies such as data tokenization allows you to protect data while maintaining the format, length, and character composition of the original data. That means that database schemas don’t have to change to accommodate data protection. Furthermore, unlike data masking, the data can be selectively redacted with partial tokenization – such as tokenizing a social security number’s first five digits and leaving the last 4 in the clear so that the token can be used for customer service type applications and fully unprotected only when absolutely necessary.

Then there is another hidden gem that speaks directly to the heart of a data scientist: referential integrity. Maintaining referential integrity means wherever an identical data element is protected, the protected value is identical. It also means I can run queries and join datasets that are protected and get the same results as if they were unprotected, for example, a marketing campaign that matches a list of customers with a list of credit scores keyed off of social security numbers, all while the sensitive data is in a protected state.

If being able to use data while it is still protected isn’t enough, when you share the data, it’s even protected then, which allays privacy and data residency concerns since you’re no longer working with the actual data. And as data flows in and out of systems, it can stay protected wherever it goes – unlike data at rest encryption or perimeter security solutions, which can only protect data under their purview.

Swiping Right Never Felt So Good

just married Data-centric security empowers the data scientist to extract value from data, enabling the monetization of data, while maintaining privacy and true data security. With data-centric protection such as tokenization, it’s no longer a dilemma of either data analytics or data protection – it’s both. Further simplifying the picture, data-centric security also plays nicely with other legacy protection schemes such as data at rest encryption. Plus, as the infrastructure grows and new data analytics projects are undertaken, the same protection will protect the entire enterprise, closing security gaps and providing necessary compliance. For me I can’t help but swipe right because it enables me to do everything I need to do with my data without limits!