A practical assessment of Differential Privacy & Federated Learning within the medical context.

(Bing AI generated image, original, full ownership)
The necessity for data privacy appears to be generally relaxed nowadays within the era of enormous language models trained on every little thing from the general public web, no matter actual mental property which their respective company leaders openly admit.
But there’s a far more sensitive parallel universe in relation to patients’ data, our health records, that are undoubtedly far more sensitive and in need of protection.
Also the regulations are getting stronger everywhere in the world, the trend is unanimously towards more stricter data protection regulations, including AI.
There are obvious ethical reasons which we don’t have to elucidate, but from the enterprise levels regulatory and legal reasons that require pharmaceutical corporations, labs and hospitals to make use of cutting-edge technologies to guard data privacy of patients.
Federated analytics and learning are great options to give you the chance to research data and train models on patients’ data without accessing any raw data.
In case of federated analytics it means, as an illustration, we are able to get correlation between blood glucose and patients BMI without accessing any raw data that could lead on to patients re-identification.
Within the case of machine learning, let’s use the instance of diagnostics, where models are trained on patients’ images to detect malignant changes of their tissues and detect early stages of cancer, as an illustration. That is literally a life saving application of machine learning. Models are trained locally on the hospital level using local images and labels assigned by skilled radiologists, then there’s aggregation which mixes all those local models right into a single more generalized model. The method repeats for tens or lots of of rounds to enhance the performance of the model.
Fig. 1. Federated learning in motion, sharing model updates, not data.
The reward for every individual hospital is that they’ll profit from a greater trained model capable of detect disease in future patients with higher probability. It’s a win-win situation for everybody, especially patients.
In fact there’s quite a lot of federated network topologies and model aggregation strategies, but for the sake of this text we tried to concentrate on the standard example.
It’s believed that vast amounts of clinical data aren’t getting used on account of a (justified) reluctance of knowledge owners to share their data with partners.
Federated learning is a key technique to construct that trust backed up by the technology, not only on contracts and faith in ethics of particular employees and partners of the organizations forming consortia.
To begin with, the info stays on the source, never leaves the hospital, and shouldn’t be being centralized in a single, potentially vulnerable location. Federated approach means there aren’t any external copies of the info that could be hard to remove after the research is accomplished.
The technology blocks access to raw data due to multiple techniques that follow defense in depth principle. Each of them is minimizing the danger of knowledge exposure and patient re-identification by tens or hundreds of times. All the pieces to make it economically unviable to find nor reconstruct raw level data.
Data is minimized first to reveal only the vital properties to machine learning agents running locally, PII data is stripped, and we also use anonymization techniques.
Then local nodes protect local data against the so-called too curious data scientist threat by allowing only the code and operations accepted by local data owners to run against their data. As an example model training code deployed locally on the hospital as a package is allowed or not by the local data owners. Distant data scientists cannot just send any code to distant nodes as that will allow them as an illustration to return raw level data. This requires a brand new, decentralized way of pondering to embrace different mindset and technologies for permission management, an interesting topic for an additional time.
Assuming all those layers of protection are in place there’s still concern related to the security of model weights themselves.
There’s growing concern within the AI community about machine learning models because the super compression of the info, not as black-boxy as previously considered, and revealing more information concerning the underlying data than previously thought.
And that signifies that with enough skills, time, effort and powerful hardware a motivated adversary can attempt to reconstruct the unique data, or at the least prove with high probability that a given patient was within the group that was used to coach the model (Membership Inference Attack (MIA)) . Other kinds of attacks possible resembling extraction, reconstruction and evasion.
To make things even worse, the progress of generative AI that all of us admire and profit from, delivers latest, more practical techniques for image reconstruction (for instance, lung scan of the patients). The identical ideas which are utilized by all of us to generate images on demand might be utilized by adversaries to reconstruct original images from MRI/CT scan machines. Other modalities of knowledge resembling tabular data, text, sound and video can now be reconstructed using gen AI.
Differential privacy (DP) algorithms promise that we exchange among the model’s accuracy for much improved resilience against inference attacks. That is one other privacy-utility trade-off that’s value considering.
Differential privacy means in practice we add a really special kind of noise and clipping, that in return will lead to a superb ratio of privacy gains versus accuracy loss.
It will possibly be as easy as least effective Gaussian noise but nowadays we embrace the event of far more sophisticated algorithms resembling Sparse Vector Technique (SVT), Opacus library as practical implementation of differentially private stochastic gradient descent (DP-SGD), plus venerable Laplacian noise based libraries (i.e. PyDP).
Fig. 2. On device differential privacy that all of us use on a regular basis.
And, by the way in which, all of us profit from this system without even realizing that it even exists, and it is going on immediately. Our telemetry data from mobile devices (Apple iOS, Google Android) and desktop OSes (Microsoft Windows) is using differential privacy and federated learning algorithms to coach models without sending raw data from our devices. And it’s been around for years now.
Now, there’s growing adoption for other use cases including our favourite siloed federated learning case, with relatively few participants with large amounts of knowledge in on-purpose established consortia of various organizations and corporations.
Differential privacy shouldn’t be specific to federated learning. Nevertheless, there are different strategies of applying DP in federated learning scenarios in addition to different number of algorithms. Different algorithms which work higher for federated learning setups, different for local data privacy (LDP) and centralized data processing.
Within the context of federated learning we anticipate a drop in model accuracy after applying differential privacy, but still (and to some extent hopefully) expect the model to perform higher than local models without federated aggregation. So the federated model should still retain its advantage despite added noise and clipping (DP).
Fig. 3. What we are able to expect based on known papers and our experiences.
Differential privacy might be applied as early as on the source data (Local Differential Privacy (LDP)).
Fig. 4, different places where DP might be applied to enhance data protection
There are also cases of federated learning inside a network of partners who’ve all data access rights and are less concerned about data protection levels so there could be no DP in any respect.
Alternatively when the model goes to be shared with the surface world or sold commercially it could be an excellent idea to use DP for the worldwide model as well.
At Roche’s Federated Open Science team, NVIDIA Flare is our tool of alternative for federated learning as essentially the most mature open source federated framework in the marketplace. We also collaborate with the NVIDIA team on future development of NVIDIA Flare and are glad to assist to enhance an already great solution for federated learning.
We tested three different DP algorithms:
We applied differential privacy for the models using different strategies:
- Every federated learning round
- Only the primary round (of federated training)
- Each Nth round (of federated training)
for 3 different cases (datasets and algorithms):
- FLamby Tiny IXI dataset
- Breast density classification
- Higgs classification
So, we tried three dimensions of algorithm, strategy and dataset (case).
The outcomes are conforming with our expectations of model accuracy degradation that was larger with lower privacy budgets (as expected).
(Dataset source: https://owkin.github.io/FLamby/fed_ixi.html)
Fig. 5. Models performance without DP
Fig. 6. Models performance with DP on first round
Fig. 7. SVT applied every second round (with decreasing threshold)
We observe significant improvement of accuracy with SVT applied on the primary round in comparison with SVT filter applied to each round.
(Dataset source Breast Density Classification using MONAI | Kaggle)
Fig. 8. Models performance without DP
Fig. 9. DP applied to the primary round
We observe a mediocre accuracy loss after applying a Gaussian noise filter.
This dataset was essentially the most troublesome and sensitive to DP (major accuracy loss, unpredictability).
(Dataset source HIGGS — UCI Machine Learning Repository)
Fig. 10. Models performance with percentile value 95.
Fig. 11. Percentile value 50.
We observe minor, acceptable accuracy loss related to DP.
Essential lesson learned is that differential privacy outcomes are very sensitive to parameters of a given DP algorithm and it’s hard to tune it to avoid total collapse of model accuracy.
Also, we experienced some kind of tension, based on the impression of probably not really knowing how much privacy protection we now have gained for what price. We only saw the “cost” side (accuracy degradation).
We needed to rely to a significant extent on known literature, that claims and demonstrated, that even small amounts of DP-noise are helping to secure data.
As engineers, we’d prefer to see some kind of automatic measure that will prove how much privacy we gained for the way much accuracy lost, and perhaps even some type of AutoDP tuning. Appears to be far, distant from the present state of technology and knowledge.
Then we applied privacy meters to see if there’s a visual difference between models without DP versus models with DP and we observed changes within the curve, however it’s really hard to quantify how much we gained.
Some algorithms didn’t work in any respect, some required many attempts to tune them properly to deliver viable results. There was no clear guidance on learn how to tune different parameters for particular dataset and ML models.
So our current opinion is that DP for FL is difficult, but totally feasible. It requires a variety of iterations, and trial and error loops to realize acceptable results while believing in privacy improvements of orders of magnitude based on the trust in algorithms.
Federated learning is an awesome option to enhance patients’ outcomes and treatment efficacy due to improved ML models while preserving patients’ data.
But data protection never comes with none price and differential privacy for federated learning is an ideal example of that trade-off.
It’s great to see improvements in algorithms of differential privacy for federated learning scenarios to reduce the impact on accuracy while maximizing resilience of models against inference attacks.
As with all trade-offs the choices must be made balancing usefulness of models for practical applications against the risks of knowledge leakage and reconstruction.
And that’s where our expectation for privacy meters are growing to know more precisely what we’re selling and we’re “buying”, what the exchange ratio is.
The landscape is dynamic, with higher tools available for each those that want to higher protect their data and those that are motivated to violate those rules and expose sensitive data.
We also invite other federated minds to construct upon and contribute to the collective effort of advancing patient data privacy for Federated Learning.