Ponderings on personal data processing in AI models following the EDPB's AI Opinion

Ponderings on personal data processing in AI models following the EDPB's AI Opinion

Any self-respecting Englishman or woman will tell you that, to make a really good cup of tea, the tea must steep in its teapot for a couple of minutes before pouring. Sometimes the same is true of regulatory guidance: to really get to grips with it, you need to let it steep in the back of your mind for a few days first.

That, at least, is my excuse for taking a couple of weeks to post my thoughts on the EDPB's Opinion 28/2024 on "certain data protection aspects related to the processing of personal data in the context of AI models", published shortly before Christmas. To my mind, it was one of the most eagerly-anticipated, and impactful, items of guidance published in 2024 for a variety of reasons, namely the EDPB's authoritative status as the arbiter of all things GDPR, the ever-growing presence of AI in our daily lives, and organisations' desperate need to better understand how to apply GDPR rules in the context of AI.

Much was riding on this Opinion, then, and the importance of "getting it right" was clearly recognised by the EDPB who held a stakeholder event in November to discuss some of the issues under consideration. You can see my thoughts on that event here.

Now that it's arrived, how did the Opinion fare?

In short, it provides some helpful guidance, while also not really saying anything too unexpected and dodging the most difficult questions that practitioners face. Much is answered with an "it depends" and the need to conduct a "case-by-case assessment".

This will disappoint many but, as the saying goes, you must "be careful what you wish for" - sometimes it's better to have a bit of uncertainty and room for manoeuvre than to receive explicit instruction that is not to your liking.

It's also important to recognise that the EDPB was operating under strict constraints - namely, its Opinion had to respond to specific questions asked by the Irish DPC under the Article 64(2) procedure, and had to complete within strict timelines required by that procedure (a maximum period of 14 weeks).

Keeping that in mind, it really was no small feat on the part of the EDPB to produce an Opinion on such a complex and broad issue in such a short space of time.

Pondering what the Opinion DOES address

In broad terms, the Opinion answers the following questions (which I'll paraphrase here for simplicity, although there is a bit more detail to the specific questions themselves):

  • If an AI model is trained on personal data, is the AI Model itself personal data? Answer: Likely, but it depends and requires a case-by-case analysis. (More on this below)
  • Can a data controller rely on legitimate interests to develop and deploy the AI Model? Answer: Possibly, but it depends and requires a case-by-case analysis.
  • Where personal data is unlawfully processed to develop an AI model, does this impact the lawfulness of the subsequent deployment of the model? Answer: Possibly, but it depends and requires a case-by-case analysis.

As you'll see, each of these questions is ultimately met with an "it depends" answer. Not really that surprising, given the vast breadth of data, models and use cases to which AI can apply.

Pondering what the Opinion does NOT address

What's more interesting is not what the Opinion addresses, but what it intentionally does not address (see para 17 of the Opinion).

Most notably, it excludes any discussion on processing of special categories of data and and on purpose limitation. In practice, these two points often prove to be the most thorny issues in AI development. This is because:

  • special category data is often needed to help de-bias an AI model but can generally be used only with explicit consent under the GDPR, presenting obvious challenges (e.g. what if you have no direct relationship with the individuals to ask for their consent, what if individuals' refusal of consent itself introduces bias - because certain groups are more or less likely to consent, etc?), and
  • purpose limitation issues often arise when a service provider has collected certain user data to provide its services and now wishes to train on that historic user data to make service improvements or even offer new services. Is training considered a compatible use of the data, or an incompatible use for which data subject consent should be sought? Does it make any difference if you disclosed or did not disclose this possibility to data subjects previously?

A further, critical, issue the Opinion does not address (but does not list among its exclusions) is the use of personal data for AI model training by a data processor - e.g. where an enterprise service provider (processor) wants to use its enterprise customers' (controllers') data for AI training. What impact does this have on the processor's role - does it become a controller? Does it require authorisation from the controller, the controller's users (data subjects) and/or both? We've seen guidance from some national DPAs on this, but guidance at an EDPB level would be welcome.

Regrettably - though understandably, for reasons already explained - the Opinion is silent on the above points. We can only hope that further guidance may be forthcoming in time.

Pondering AI models and anonymity

On the issue of whether an AI model can be considered anonymous, the EDPB concludes that "AI models trained on personal data cannot, in all cases, be considered anonymous. Instead, the determination of whether an AI model is anonymous should be assessed, based on specific criteria, on a case-by-case basis". Again, "it depends".

So when can an AI model be considered anonymous? According to the EDPB, only when "using reasonable means, both (i) the likelihood of direct (including probabilistic) extraction of personal data regarding individuals whose personal data were used to train the model; as well as (ii) the likelihood of obtaining, intentionally or not, such personal data from queries, should be insignificant for any data subject"

Put another way, if you can't get personal data out of it, the model is anonymous. If you can, it isn't.

Personally, I find this a peculiar conclusion for a few reasons:

  • First, it appears to conflate the output of a model with the model itself. Let's assume for a moment that the EDPB is correct, and that an AI model is (or can contain) personal data. Then why should the model be considered anonymous (i.e. cease to be personal data) if personal data cannot be output by it? To give a crude analogy, if I use a jam jar to capture a few bees, the jam jar doesn't cease to contain bees simply because I've attached a secure lid to it so that no bees can escape. The EDPB's logic seems a little shaky. If a model is to be considered personal data, then surely it can only become anonymous once the identifying information has been removed from within the model itself, not simply blocked at its output.
  • Second, while I have absolutely no issue accepting that AI models can be trained on, take as input, and produce as output, personal data, I do struggle accepting a view that the model itself is personal data. Ignoring, for a moment, that a model is a mathematical algorithm and not data in the traditional computing sense, classifying a model as personal data has some serious GDPR consequences - including that a data subject could require the model itself (not just the data it is trained upon, or the output data it produces) to be corrected, deleted or restricted. Since there is no guaranteed and practical way to determine where, if at all, an individual's data is captured within the weights of a complex machine learning model, if the model itself is considered to be personal data then the only way to fulfil the data subject's request would seem to be to require that model to be scrapped and retrained - a disproportionate outcome in response to a single data subject's request, and with serious environmental considerations.
  • Third, if you accept that a model is not personal data, the data subject can still exercise its rights against output produced by the model, which unquestionably will be data and can be personal - and any such output data can be blocked through the use of filters or similar techniques applied to the model (while accepting these can have their own limitations). Put another way, the same outcome can be achieved (i.e. preventing the extraction or obtaining of personal data) without needing to treat the model itself as personal data.

There have, of course, been well-made arguments to the contrary that I acknowledge and respect, and these arguments may ultimately prevail in the ongoing debate around AI model anonymity. Indeed, my own views are somewhat moot in light of the EDPB Opinion. However, I'd like to think that proponents of these arguments would accept that AI models are something the GDPR never envisaged at the time of its drafting, and regulators are now attempting to squeeze a square peg (AI models) through a round hole (the concept of personal data).

Pondering practical steps for compliance

It's easier to criticise than to create, though - and it would be unfair of me to overlook the positives in the EDPB's guidance.

These include that reliance on legitimate interests is possible in AI development and deployment (subject to satisfying the three-part legitimate interests test) and that the unlawful development of an AI model doesn't necessarily make the subsequent use of that model by a deployer unlawful (subject to appropriate due diligence and/or anonymisation).

Most helpfully, the EDPB lists a number of practical measures that developers can take to support anonymisation. Even if anonymisation is likely to prove an unachievable goal for many developers, these measures are likely to become de facto regulatory standard for demonstrating responsible model development against GDPR principles - so implementing and documenting these measure will be critical, namely:

  • Carefully choosing sources to train an AI model, so as to limit the collection of personal data.
  • Using data preparation processes, such as anonymisation or pseudonymisation, data minimisation, and data filtering, to reduce the amount of personal data processed.
  • Using AI training methodologies that encourage generalisation and avoid overfitting (and memorisation), and using privacy-preserving techniques (like differential privacy) where possible.
  • Applying measures to model outputs to reduce the likelihood of personal data being output, e.g. through filtering
  • Implementing effective engineering governance, including audits to evaluate the effectiveness of the measures implemented.
  • Adversarial testing against the model to assess its resistance to personal data extraction through, e.g., attribution and membership inference, exfiltration, model inversion and so on
  • Maintaining appropriate GDPR documentation, including DPIAs, with evidence of any advice or feedback provided by DPOs.

There's much more but, to avoid this article becoming too long, I'll end it here. No doubt there will be further pondering to be done in the weeks and months ahead...

Thanks for sharing your thoughts on the EDPB's Opinion. It's interesting to see how much the conversation around AI and personal data is evolving. What do you think are some key aspects that should have been included in the discussion?

Like
Reply

Imho, gen-AI models 'contain' (entail the processing of) personal data, they are not 'themselves' personal data: https://arxiv.org/abs/2202.05262

To the point on ambiguity, isn't is inviting more internal due diligence and raising the need for additional documentation?- something that the EDPB continues to implicitly and explicitly come back to in the opinion. And perhaps I am reading too much into it, but I was struck by similarities to the California/US approach to AI (geolocation sensitivity and the presence of PI in AI models under AB 1008). This was a clear and helpful analysis, thank you!

Daniel Florian

Public Policy, Government Relations & Policy Comms for AI | Berlin → EMEA → Global | Principal @ Eightfold | Speaker | Work/Code newsletter

9mo

While I totally agree with your initial statement that sometimes, a case-by-case approach is the best way forward, this seems like a "Schrödinger's cat" opinion where a model may either be considered to contain personal data or not and we only know when we know it. That is all intellectually interesting and the practical measures at the end of your post are definitely something to watch very closely (even though some of them are very hard to meet). But how sure can you be exactly that this will help you in case of an investigation? And if you're not sure, will you develop your model in the first place or make it available in Europe?

To view or add a comment, sign in

More articles by Phil Lee

Explore content categories