Federated Learning for Healthcare: The Ideas Behind My Research

Some of the most valuable training data in the world sits on hospital servers that are never allowed to talk to each other. A large part of my PhD is about getting good models out of that situation without ever moving the data. Here is the intuition behind the problem, and the methods I’ve worked on to chip away at it.

The problem: data that can’t move

Modern medical AI is hungry for data, and a single hospital rarely has enough of it — especially for the rare conditions where help is needed most. The obvious fix is to pool everyone’s records together. The obvious fix is also illegal in most of the world.

Privacy law (HIPAA in the US, GDPR in Europe, India’s DPDP Act), institutional policy, and plain ethics all mean that patient records do not leave the building. Often two hospitals in the same city cannot share a single scan. So we are left with the worst of both worlds: an enormous amount of data in aggregate, locked in isolated silos, none of it individually sufficient.

The core idea of federated learning

Federated learning (FL) flips the usual setup. Instead of bringing the data to the model, you bring the model to the data. A coordinating server sends the current model to each hospital; each hospital trains on its own data for a little while; they send back only the updates — the changes to the model’s weights — not the data; the server averages those updates into a new global model; and the whole thing repeats. The canonical recipe is called FedAvg.

The model travels; the data does not. Each round, hospitals receive the global model, train locally, and send back only weight updates — which the server averages.

That is the whole trick, and it sounds almost too easy. The difficulty lives in everything that the word “average” quietly hides.

The hard part: every hospital is different

Averaging assumes the things being averaged are roughly comparable. In healthcare they emphatically are not. Scanners differ, patient populations differ, and even the way conditions are labelled differs from site to site. In the jargon, the data is non-IID. Naively averaging updates from very different distributions can drag the global model toward a compromise that is good for nobody — an effect known as client drift.

A lot of my work targets this heterogeneity head-on. In FedStein, we borrow a classic idea from statistics — the James–Stein estimator — to shrink each client’s statistics toward a shared estimate, which stabilises training across multiple domains. In UniVarFL, we regularise the uniformity and variance of client representations, so features stay well-spread and comparable across clients even when their label distributions are wildly imbalanced.

One global model isn’t enough

Even if we fix the optimisation, a single global model is often the wrong goal. The clinically useful model for one hospital’s population may genuinely differ from another’s. The answer is personalization: a model that is mostly shared but adapts to each site.

A tool I keep returning to is the hypernetwork — a small network that generates the weights of the main model, conditioned on a compact description of each client. In FedNeuro, we use exactly this idea to produce site-personalized, privacy-enhanced models for multi-site fMRI analysis, where every scanner and cohort looks a little different. One shared backbone, many tailored heads.

Private by design — and provably so

Federated learning keeps raw data local, but that is not the same as keeping it private. The updates themselves can leak information: given a gradient, an attacker can sometimes reconstruct an approximation of the data that produced it. So “the data stays home” is a starting point, not a guarantee.

That is why we layer on differential privacy (adding calibrated noise under a formal, accountable privacy budget) and secure aggregation (so the server only ever sees a sum of updates, never an individual one). In FedHypeVAE, we push this further: rather than sharing model updates directly, clients share differentially private embeddings produced by hypernetwork-conditioned conditional VAEs — giving the collaboration useful signal while keeping a formal privacy guarantee attached to it.

Privacy in FL isn’t a feature you bolt on at the end — it’s a constraint you design the entire training procedure around.

Generalizing to hospitals you’ve never seen

A model can ace every hospital it trained on and then fall apart at a new clinic with an unfamiliar scanner. This is the problem of domain generalization, and federation makes it both harder (you never get to see all the domains together) and more important (the entire point is to deploy widely).

In FedAlign (CVPR’25), we align feature representations across clients so the model latches onto what is common to the disease rather than what is idiosyncratic to a machine. In FedVR, a variance-regularized hypernetwork discourages the model from over-fitting to any single domain. The common thread is to penalise the model whenever it leans on a site-specific shortcut.

The long tail: rare cases matter most

Medicine is long-tailed. The rare diagnosis is frequently the one you most need the model to catch, and it is exactly the one with the fewest examples — scattered thinly across clients. Standard training quietly optimises for the common case and lets the rest blur into the background.

In FEDTAIL (ICML’25, oral), we combine long-tailed learning with domain generalization, using sharpness-guided gradient matching so that rare classes are not steamrolled by frequent ones during aggregation. In earlier work, Taming the Tail (BMVC’24), we used an asymmetric loss and a Padé approximation to handle extreme class imbalance in medical images. Different tools, same commitment: the tail is not noise.

Making it actually run

None of this matters if training is too slow or too heavy to run across a real network. Communication is the bottleneck in FL — every round ships model updates over the wire, often to many clients on modest hardware. Two directions I’ve worked on: reaching a good model in fewer rounds, and making each message smaller.

FLeNS (IEEE BigData’24, oral) brings a Nesterov-accelerated, sketched Newton method to FL — much of the benefit of second-order optimisation without paying its full cost — so we converge in fewer communication rounds. On the message-size side, Sequential Compression Layers (ICLR’25 workshop) shrink what foundation-model clients need to exchange during federated fine-tuning.

Where this is going

The frontier right now is federated foundation models: adapting large vision-language and language models across institutions, parameter-efficiently, without ever centralising the data. My recent work pushes in this direction — BiPrompt for debiasing vision-language models, and federated cross-modal prompt generation — alongside multi-agent federated systems, where decentralized agents collaborate rather than a single server orchestrating passive clients.

The long-term goal I care about is a specific combination: foundation-model capability, hospital-grade privacy, fairness across populations, and reliability on the rare cases that matter. None of those is optional in a clinic.

Closing thoughts

Federated learning is sometimes pitched as a privacy gimmick. I think it is something more interesting: a forcing function. Because you cannot look at the data, you are pushed to build models that are robust to heterogeneity, honest about uncertainty, fair across populations, and frugal with communication — which, not coincidentally, are the properties you would want from a clinical model in the first place.

If you’re working on any of this — or you think I’ve got something wrong — I’d genuinely love to hear from you: sunnygupta@iitb.ac.in.

Sunny Gupta

PhD Scholar, IIT Bombay · Federated & trustworthy AI for healthcare