Ponderings on personal data processing in AI models following the EDPB's AI Opinion
Any self-respecting Englishman or woman will tell you that, to make a really good cup of tea, the tea must steep in its teapot for a couple of minutes before pouring. Sometimes the same is true of regulatory guidance: to really get to grips with it, you need to let it steep in the back of your mind for a few days first.
That, at least, is my excuse for taking a couple of weeks to post my thoughts on the EDPB's Opinion 28/2024 on "certain data protection aspects related to the processing of personal data in the context of AI models", published shortly before Christmas. To my mind, it was one of the most eagerly-anticipated, and impactful, items of guidance published in 2024 for a variety of reasons, namely the EDPB's authoritative status as the arbiter of all things GDPR, the ever-growing presence of AI in our daily lives, and organisations' desperate need to better understand how to apply GDPR rules in the context of AI.
Much was riding on this Opinion, then, and the importance of "getting it right" was clearly recognised by the EDPB who held a stakeholder event in November to discuss some of the issues under consideration. You can see my thoughts on that event here.
Now that it's arrived, how did the Opinion fare?
In short, it provides some helpful guidance, while also not really saying anything too unexpected and dodging the most difficult questions that practitioners face. Much is answered with an "it depends" and the need to conduct a "case-by-case assessment".
This will disappoint many but, as the saying goes, you must "be careful what you wish for" - sometimes it's better to have a bit of uncertainty and room for manoeuvre than to receive explicit instruction that is not to your liking.
It's also important to recognise that the EDPB was operating under strict constraints - namely, its Opinion had to respond to specific questions asked by the Irish DPC under the Article 64(2) procedure, and had to complete within strict timelines required by that procedure (a maximum period of 14 weeks).
Keeping that in mind, it really was no small feat on the part of the EDPB to produce an Opinion on such a complex and broad issue in such a short space of time.
Pondering what the Opinion DOES address
In broad terms, the Opinion answers the following questions (which I'll paraphrase here for simplicity, although there is a bit more detail to the specific questions themselves):
As you'll see, each of these questions is ultimately met with an "it depends" answer. Not really that surprising, given the vast breadth of data, models and use cases to which AI can apply.
Pondering what the Opinion does NOT address
What's more interesting is not what the Opinion addresses, but what it intentionally does not address (see para 17 of the Opinion).
Most notably, it excludes any discussion on processing of special categories of data and and on purpose limitation. In practice, these two points often prove to be the most thorny issues in AI development. This is because:
A further, critical, issue the Opinion does not address (but does not list among its exclusions) is the use of personal data for AI model training by a data processor - e.g. where an enterprise service provider (processor) wants to use its enterprise customers' (controllers') data for AI training. What impact does this have on the processor's role - does it become a controller? Does it require authorisation from the controller, the controller's users (data subjects) and/or both? We've seen guidance from some national DPAs on this, but guidance at an EDPB level would be welcome.
Regrettably - though understandably, for reasons already explained - the Opinion is silent on the above points. We can only hope that further guidance may be forthcoming in time.
Pondering AI models and anonymity
On the issue of whether an AI model can be considered anonymous, the EDPB concludes that "AI models trained on personal data cannot, in all cases, be considered anonymous. Instead, the determination of whether an AI model is anonymous should be assessed, based on specific criteria, on a case-by-case basis". Again, "it depends".
So when can an AI model be considered anonymous? According to the EDPB, only when "using reasonable means, both (i) the likelihood of direct (including probabilistic) extraction of personal data regarding individuals whose personal data were used to train the model; as well as (ii) the likelihood of obtaining, intentionally or not, such personal data from queries, should be insignificant for any data subject"
Put another way, if you can't get personal data out of it, the model is anonymous. If you can, it isn't.
Personally, I find this a peculiar conclusion for a few reasons:
There have, of course, been well-made arguments to the contrary that I acknowledge and respect, and these arguments may ultimately prevail in the ongoing debate around AI model anonymity. Indeed, my own views are somewhat moot in light of the EDPB Opinion. However, I'd like to think that proponents of these arguments would accept that AI models are something the GDPR never envisaged at the time of its drafting, and regulators are now attempting to squeeze a square peg (AI models) through a round hole (the concept of personal data).
Pondering practical steps for compliance
It's easier to criticise than to create, though - and it would be unfair of me to overlook the positives in the EDPB's guidance.
These include that reliance on legitimate interests is possible in AI development and deployment (subject to satisfying the three-part legitimate interests test) and that the unlawful development of an AI model doesn't necessarily make the subsequent use of that model by a deployer unlawful (subject to appropriate due diligence and/or anonymisation).
Most helpfully, the EDPB lists a number of practical measures that developers can take to support anonymisation. Even if anonymisation is likely to prove an unachievable goal for many developers, these measures are likely to become de facto regulatory standard for demonstrating responsible model development against GDPR principles - so implementing and documenting these measure will be critical, namely:
There's much more but, to avoid this article becoming too long, I'll end it here. No doubt there will be further pondering to be done in the weeks and months ahead...
Thanks for sharing your thoughts on the EDPB's Opinion. It's interesting to see how much the conversation around AI and personal data is evolving. What do you think are some key aspects that should have been included in the discussion?
personal views
9moImho, gen-AI models 'contain' (entail the processing of) personal data, they are not 'themselves' personal data: https://arxiv.org/abs/2202.05262
To the point on ambiguity, isn't is inviting more internal due diligence and raising the need for additional documentation?- something that the EDPB continues to implicitly and explicitly come back to in the opinion. And perhaps I am reading too much into it, but I was struck by similarities to the California/US approach to AI (geolocation sensitivity and the presence of PI in AI models under AB 1008). This was a clear and helpful analysis, thank you!
Public Policy, Government Relations & Policy Comms for AI | Berlin → EMEA → Global | Principal @ Eightfold | Speaker | Work/Code newsletter
9moWhile I totally agree with your initial statement that sometimes, a case-by-case approach is the best way forward, this seems like a "Schrödinger's cat" opinion where a model may either be considered to contain personal data or not and we only know when we know it. That is all intellectually interesting and the practical measures at the end of your post are definitely something to watch very closely (even though some of them are very hard to meet). But how sure can you be exactly that this will help you in case of an investigation? And if you're not sure, will you develop your model in the first place or make it available in Europe?