Big Data Security Series Part 2: How Hard is it to Secure Big Data?

Survey says: exceedingly difficult.

But why?

Years ago, on traditional databases you had complete control over how to implement and protect your servers. You also had a manageable number of servers, let’s say around 10, you had your hardened environment, and you were able to do anything you wanted with the data you had.

Over time, the number of inputs has exploded from web, mobile, digital workplaces, and so on. In order to process all that new data, now we need more storage and more computing power than you can fit in your garage. And by garage, of course I mean state-of-the-art, in-house data center.

The fact is, big data means big resources. In response to this growing demand, cloud technology came along to help reduce the burden of in-house storage and processing, at times taking those roles over entirely. New analytics tools like Hadoop MapReduce framework help manage the growing amount of data. While very practical, cloud services add a lot of complexity to questions of security and compliance. Additionally, some enterprises utilize a multicloud environment to avoid vendor lock in, which can complicate things even further.

Now, don’t get me wrong, cloud computing is great. It offers flexibility and insights like never before. And many companies are getting on board. As of 2018, 73% of global businesses performed big data processing in the cloud. And with IoT we’ll need even more power. Currently, over 7 billion IoT devices are in use and that number is expected to reach 10 billion by the end of 2020.

The more complex the network, the bigger the attack surface

But again, in terms of security, cloud computing and IoT makes things a lot more, well, cloudy. When you’ve deployed Hadoop systems on hundreds of systems, your infrastructure is much larger and much more difficult to protect.

Let me try and illustrate: let’s say sensitive data is collected when someone submits information on a web form. What happens then? For every data set, there might be something like 10 copies in your network. And that data isn’t all stored and analyzed on-premises anymore, nowadays it’s going to be constantly moving between different environments, databases, and applications, which might be on-premises, in the cloud, or in a hybrid environment combining both.

Take the Hadoop MapReduce framework for example. It’s based on groups of standard computers that process data in parallel. Unless care is taken to remove known technical vulnerabilities, manage and secure the multiple administrator accounts as well as the content of the multiple file systems, the framework is at risk from common cyberthreats that could impact the confidentiality of the data as well as the integrity of the analysis. In essence, MapReduce is just a bunch of machines with vulnerabilities like any other.

The point I’m getting at here is that traditional perimeter defenses don’t cut it anymore because the network perimeter has become a lot more, I'll say it again, cloudy. The extent of the network is so broad that a perimeter can no longer be clearly defined and, as a consequence, traditional perimeter security can't fill in all the gaps.

Organizations are struggling to adapt their cybersecurity strategy. A 2019 survey of 1200 IT decision makers across the globe revealed that 78% of organizations had experienced a successful cyberattack in the preceding 12 months.

At this point, if you haven’t been hacked, you should almost be asking yourself, is it because my defenses are that good or is my company just that uninteresting?

Internal and third party threats

Putting external threats aside, there are plenty of internal threats to data security as well. They might be due to the oversight of a partner or just accidental exposure. Where third parties are involved in the analysis process, this can also increase the risks of data being copied or misused. For example, we often hear about misconfigured Amazon S3 instances where data was available in clear text. This is just human error which is nearly unavoidable.

Additionally, the data is at risk before it even arrives on the third parties’ network. In many cases, data in motion is not protected well and too often it’s not protected at all.

Data privacy regulations

So we have a massive amount of valuable data in our network, sensitive data is all over the place, and the lines of our network perimeter are blurred. Could things get any worse?

The answer is yes, things can always get worse.

Enter stage right the four letters of mass destruction: GDPR.

Wherever personal data goes in your network, so too travels your regulatory obligations, and in a big data environment, that’s a lot of locations. And chances are GDPR isn’t the only data privacy regulation you have to comply with.

So what’s next? How can we protect the data in all these instances and comply with all these data privacy regulations? How can we reap the benefits of big data analytics when there are so many challenges? Find out in the next post!