With the rapid expansion of data analytics frameworks and technologies, the ability to mine and gain customer insights from datasets has never been easier. Creative analytics are used to power everything from advertising to fraud detection. Arguably, the collection of data is a key requirement of most organizations. Since the collected data is likely to contain sensitive elements and personally identifiable information (PII), several internationally applicable privacy regulations have also emerged in recent years, (GDPR in the EU, CCPA in California, LGPD in Brazil, etc.) to ensure that private data is protected and anonymized.
Since the need for data protection cannot be ignored, from a privacy, risk, and compliance perspective the choice is difficult: either make data really difficult to access or de-identify the data and lose the analytic value. The resulting internal war of attrition often is bruising since the need to monetize data is also potentially the biggest risk to reputation.
To overcome these shortcomings, substantial investments are often made in access and perimeter security to ensure that only authorized users have access to systems and databases. This complicates sharing of data and of course doesn’t prevent misuse of data that a user has been granted access to. “Legitimate access vulnerability” often gives rise to another use of encryption where sensitive data is encrypted using secure block ciphers such as AES-CBC mode encryption – providing robust protection where the ciphertext is typically base-64 encoded – or masking where various portions of a dataset are redacted permanently. It’s this step that really makes it difficult for analytics since the protected data no longer resembles the original data, the analytic value is compromised since it is heavily redacted, and the resulting ciphertext can’t be used without first decrypting it. Great security that satisfies risk and audit, but creates more work burden for data scientists.
Then there is another hidden gem that speaks directly to the heart of a data scientist: referential integrity. Maintaining referential integrity means wherever an identical data element is protected, the protected value is identical. It also means I can run queries and join datasets that are protected and get the same results as if they were unprotected, for example, a marketing campaign that matches a list of customers with a list of credit scores keyed off of social security numbers, all while the sensitive data is in a protected state.
If being able to use data while it is still protected isn’t enough, when you share the data, it’s even protected then, which allays privacy and data residency concerns since you’re no longer working with the actual data. And as data flows in and out of systems, it can stay protected wherever it goes – unlike data at rest encryption or perimeter security solutions, which can only protect data under their purview.