OpenAI’s GDPR Travails Demonstrate Need for Data Security-by-Design

Written by Mirza Salihagic | Apr 6, 2023

After several weeks of incredible growth, OpenAI has come in for a bumpy ride of late. First it revealed details of a data breach exposing a significant number of ChatGPT subscribers. Then the Italian data protection regulator (GPDP) became the first in Europe to ban the product for users in the country, for contravening the GDPR.

Although the fallout from last month’s breach was limited, the regulatory action that followed makes one thing very clear. Firms offering AI services, and corporate customers using and contributing data to these AI models, must first develop a clear data-centric security strategy to mitigate any breach/leakage risks.

What happened at OpenAI?

The ChatGPT maker was forced to take its flagship AI service offline in late March after it discovered a bug in a third-party open source library (Redis) that it uses. The firm said that the issue may have enabled users to view:

The title and first message of a newly created conversation belonging to other users
Other active users’ first and last name, email address, payment address, the last four digits their credit card number, and credit card expiration date

OpenAI was at pains to point out that users would have needed to jump through several hoops to view the sensitive personal, chat and financial information of other ChatGPT users during this incident. It added that the payment-related information of only 1.2% of ChatGPT Plus subscribers was exposed during the nine-hour window. However, such personal information can be a treasure trove for fraudsters, who leverage it in follow-on phishing attacks designed to elicit more data.

That’s why organizations have obligations under the GDPR to ensure any such information is protected at all times from theft and alteration. That’s part of the reason for the temporary ban on ChatGPT in Italy.

The path to secure AI

In short, this particular breach at OpenAI thankfully wasn’t too serious. But the next one might be. That’s especially true of AI providers that use open source code, as most developers do these days to accelerate time to market. Threat actors are proactively inserting malicious packages in these upstream repositories so they can exploit them once downloaded by customers. Last year, experts claimed to have uncovered 88,000 malicious open source packages, a 742% increase on 2019 figures.

Breaches can come from many other sources besides third-party code repositories. And as the OpenAI case has shown, the potential response from regulators can be severe. That should focus minds more keenly on protecting sensitive customer account data with best practice mechanisms like strong encryption and tokenization. This means that if the data is stolen or accidentally leaked there should be minimal regulatory repercussions, because it would be impossible to read or use.

There’s a secondary consideration here for corporate customers of AI tools like ChatGPT. They might want to submit their own data to help train the model into providing more accurate and relevant outputs. Others may type sensitive information in and unwittingly have it stored and shared by the AI vendor. Any decision to do so should therefore be carefully weighed against the potential risk of breach or exposure by the third-party AI service provider.

Lawyers at compliance expert Cordery had the following advice:

“If you’re inputting your data into the tool, do you know what that data will be used for? Will your data be added to the data training pot? If so, are you happy with that especially given the questionable ownership of some AI solutions? How can you guarantee the security of the data, which is also a GDPR requirement?”

This is where comforte can help, with data-centric security that:

Discovers and automatically classifies all sensitive data
Searches locations like cloud data stores which are often hidden
Offers multiple protection methods including tokenization, which preserve data utility for AI analytics and other use cases
Integrates seamlessly with data flows and applications for rapid time-to-value

View full post