Deep neural networks have achieved and established state-of-the-art results in many domains. However, deep learning models are data-intensive, i.e., they often require millions of training examples to learn effectively. Medical images may contain confidential and sensitive information about patients that often cannot be shared outside the institutions of their origin, especially when complete de-identification cannot be guaranteed. The European General Data Protection Regulation (GDPR) and the United States Health Insurance Portability and Accountability Act (HIPAA) enforce guidelines and regulations for storing and exchanging personally identifiable data and health data. Ethical guidelines also encourage respecting privacy, that is, the ability to retain complete control and secrecy about one’s personal information^{1}. As a result, large archives of medical data from various consortia remain widely untapped sources of information. For instance, histopathology images cannot be collected and shared in large quantities due to the aforementioned regulations, as well as due to data size constraints given their high resolution and gigapixel nature. Without sufficient and diverse datasets, deep models trained on histopathology images from one hospital may fail to generalize well on data from a different hospital (out-of-distribution)^{2}. The existence of bias or the lack of diversity in images from a single institution brings about the need for a collaborative approach which does not require data centralization. One way to overcome this problem is by collaborative data sharing (CDS) or federated learning among different hospitals^{3}.

In this paper, we explore federated learning (FL) as a collaborative learning paradigm, in which models can be trained across several institutions without explicitly sharing patient data. We study the impact of data distribution on the performance of FL, i.e., when hospitals have more or less data, and IID or non-IID data. We also show that using federated learning with additional privacy preservation techniques can improve the performance of histopathology image analysis compared to training without collaboration and quantitatively measure the privacy using Rényi Differential Privacy Accountant^{4}. We discuss its benefits, drawbacks, potential weaknesses, as well as technical implementation considerations. Finally, we use lung cancer images from The Cancer Genome Atlas (TCGA) dataset^{5} to construct a simulated environment of several institutions to validate our approach.

### Federated learning (FL)

Federated learning algorithms learn from decentralized data distributed across various client devices, in contrast to conventional learning algorithms. In most examples of FL, there is a *centralized server* which facilitates training a shared model and addresses critical issues such as data privacy, security, access rights, and heterogeneity^{6}. In FL, every client locally trains a copy of the centralized model, represented by the model weights *ω*, and reports its updates back to the server for aggregation across clients, without disclosing local private data. Mathematically, FL can be formulated as:

$$mathop {min }limits_{omega in R} fleft( omega right)quad {text{with}}quad fleft( omega right) = frac{1}{n}sumlimits_{i = 1}^{n} {fileft( omega right)} ,$$

(1)

where *f* (*ω*) represents the total loss function over *n* clients, and *f*_{i}(*ω*) represents the loss function with respect to client *i*’s local data. The objective is to find weights *ω* that minimize the overall loss. McMahan et al.^{6} introduced federated averaging, or *FedAvg* (Algorithm 1), in which each client receives the current model *ω*_{t} from the server, and computes ∇ *f*_{i}(*ω*_{t}), the average gradient of the loss function over its local data. The gradients are used to update each client’s model weights using stochastic gradient descent (SGD) as (omega_{t + 1}^{i} leftarrow omega_{t} – eta nabla f_{i} (omega_{t} )) according to the learning rate *η*. Next, the central server receives the updated weights (omega_{{t + {1}}}^{i} leftarrow omega sumlimits_{i = 1}^{n} {frac{{n_{i} }}{n}omega_{{t + {1}}}^{i} }), where *n* is *t *+ 1 from all participating clients and averages them to update the central model, *t* + 1* ← i* = 1 *n* the number of data points used by client *i*. To reduce communication costs, several local steps of SGD can be taken before communication and aggregation, however, this affects the convergence properties of FedAvg^{7}.

Other methods for FL have also been proposed. Yurochkin et al.^{8} proposed a Bayesian framework for FL. Claici et al.^{9} used KL divergence to fuse different models. Much work has also been done to improve the robustness of FL algorithms. Pillutla et al.^{10} proposed a robust and secure aggregation oracle based on the geometric median using a constant number of calls to a regular non-robust secure average oracle. Andrychowicz et al.^{11} proposed a meta-learning approach to coordinate the learning process in client/server distributed systems by using a recurrent neural network in the central server to learn how to optimally aggregate the gradients from the client models. Li et al.^{12} proposed a new framework for robust FL where the central server learns to detect and remove malicious updates using a spectral anomaly detection model, leading to targeted defense. Most of the algorithms cannot be directly compared or benchmarked as they address different problems in FL such as heterogeneity, privacy, adversarial robustness, etc. FedAvg is most commonly used because of its scalability to large datasets and comparable performance to other FL algorithms.

### Federated learning in histopathology

FL is especially important for histopathology departments, as it facilitates collaboration among institutions without sharing private patient data. One prominent challenge when applying FL to medical images, and specifically histopathology, is the problem of *domain adaptation*. Since most hospitals have diverse imaging methods and devices, images from a group of hospitals will be markedly different, and machine learning methods risk overfitting to non-semantic differences between them. Models trained using FL can suffer from serious performance drops when applied to images from previously unseen hospitals. Several recent works have explored applications of FL in histopathology, and grapple with this problem. Lu et al.^{13} demonstrated the feasibility and effectiveness of FL for a large-scale computational pathology studies. FedDG proposed by Liu et al.^{14} is a privacy-preserving solution to learn a generalizable FL model through an effective continuous frequency space interpolation mechanism across clients. Sharing frequency domain information enables the separation of semantic information from noise in the original images. Li et al.^{15} tackles the problem of domain adaptation with a physics-driven generative approach to disentangle the information about model and geometry from the imaging sensor^{6}.

### Differential privacy

While FL attempts to provide privacy by keeping private data on client devices, it does not provide a meaningful privacy guarantee. Updated model parameters are still sent from the clients to a centralized server, and these can contain private information^{16}, such that even individual data points can be reconstructed^{17}. *Differential privacy* (DP) is a formal framework for quantifying the privacy that a protocol provides^{18}. The core idea of DP is that privacy should be viewed as a resource, something that is used up as information is extracted from a dataset. The goal of private data analysis is to extract as much useful information as possible while consuming the least privacy. To formalize this concept, consider a *database D*, which is simply a set of datapoints, and a probabilistic function *M* acting on databases, called a *mechanism*. The mechanism is said to be (*ε, δ*)-*differentially private* if for all subsets of possible outputs (S subseteq {text{Range}}(M)), and for all pairs of databases *D* and *D′* that differ by one element,

$$Pr [M(D) in S] le exp (varepsilon )Pr [M(D^{prime}) in S] + delta .$$

(2)

When both *ε* and *δ* are small positive numbers, Eq. (2) implies that the outcomes of *M* will be almost unchanged in distribution if one datapoint is changed in the database. In other words, adding one patient’s data to a differentially private study will not affect the outcomes, with high probability.

The advantage of DP is that it is quantitative. It yields a numerical guarantee on the amount of privacy that can be expected, in the stochastic sense, where lower *ε* and *δ* implies that the mechanism preserves more privacy. The framework also satisfies several useful properties. When multiple DP-mechanisms are composed, the total operation is also a DP-mechanism with well defined *ε* and *δ*^{19}. Also, once the results of a DP-mechanism are known, no amount of post-processing can change the (*ε, δ*) guarantee^{20}. Hence, while FL alone does not guarantee privacy, we can apply FL in conjunction with DP to give rigorous bounds on the amount of privacy afforded to clients and patients who participate in the collaboration.

The simplest way to create a DP-mechanism is by adding Gaussian noise to the outcomes of a deterministic function with bounded sensitivity^{21}. This method can be used in the context of training a machine learning model by clipping the norm of gradients to bound them, then adding noise, a process called *differentially private stochastic gradient descent* (DP-SGD)^{22}. McMahan et al.^{23} applied this at scale to FL.

### Differential privacy for medical imaging

Past works have noted the potential solution DP provides for machine learning in the healthcare domain. Kaissis et al.^{1} surveyed privacy-preservation techniques to be used in conjunction with machine learning, which were then implemented for classifying chest X-rays and segmenting CT scans^{24,25}. In histopathology, Lu et al.^{13} reported DP guarantees for a neural network classifier trained with FL, following Li et al.^{26}. Their treatment involved adding Gaussian noise to trained model weights, however, neural networks weights do not have bounded sensitivity making their DP guarantee vacuous. A meaningful guarantee would require clipping the model weights before adding noise. We propose the more standard approach of DP-SGD, which clips gradient updates and adds noise, for use in histopathology.

### Multiple instance learning (MIL)

MIL is a type of supervised learning approach which uses a set of instances known as a bag. Instead of individual instances having an associated label, only the bag as a whole has one^{27}. MIL is thus a natural candidate for learning to classify WSIs which must be broken into smaller representations due to size limitations. Permutation invariant operators for MIL were introduced by Tomczak et al.^{28} and successfully applied to digital pathology images. Isle et al.^{29} used MIL for digital pathology and introduced a different variety of MIL pooling functions, while Sudarshan et al.^{30} used MIL for histopathological breast cancer image classification. Graph neural networks (GNNs) have been used for MIL applications because of their permutation invariant characteristics. Tu et al.^{31} showed that GNNs can be used for MIL, where each instance acts as a node in a graph. Adnan et al.^{32} demonstrated an application of graph convolution neural networks to MIL in digital pathology and achieved state of the art accuracy on a lung sub cancer classification task.