Most papers on federated and foundation-model training end where the real work begins: an algorithm that is correct on a whiteboard still has to survive contact with real GPUs, real networks, and jobs that run for days. I spend a lot of my time on that side of the line — as a researcher and as the person who keeps a lab’s clusters alive. This is a tour of the systems layer underneath distributed model training, and why it decides what is actually possible.

Why the systems layer matters

There is a comfortable fiction in machine learning that the model is the hard part and everything else is plumbing. In practice the plumbing sets the ceiling. How big a model you can train, how many clients you can include, how often you can afford to communicate, how gracefully you recover when a node dies at 3 a.m. — these are systems questions, and they bound the science.

I see both sides of this daily. Alongside my PhD I serve as System Administrator for the MedaL Lab, managing NVIDIA DGX systems, Linux servers, and GPU clusters that other people’s research runs on. That dual role — designing federated algorithms and then being the one paged when a training run falls over — is what this piece is really about.

GPU clusters and the hardware reality

A modern training cluster is a hierarchy of bottlenecks. Inside a node, GPUs talk to each other over fast interconnects (NVLink, NVSwitch); across nodes they fall back to slower fabric. Memory is the first wall you hit — a foundation model’s weights, gradients, and optimizer state rarely fit on one device, so the model has to be split. Data parallelism, tensor parallelism, pipeline parallelism, and sharded optimizers each trade compute, memory, and communication differently, and choosing badly can leave expensive hardware idle.

The unglamorous truth is that utilization is the metric that matters. A cluster that is “in use” but sitting at thirty percent GPU utilization is burning money and time. Most of the systems work below is, ultimately, a fight to keep those GPUs busy with useful work instead of waiting.

Communication is the bottleneck

In ordinary distributed training, gradients are synchronized every step; in federated training, updates cross a network that may be slow, metered, or unreliable. Either way, communication — not computation — is usually what you are actually paying for. Every round ships parameters over the wire, and for a large model that is a lot of bytes multiplied by a lot of participants.

There are two honest ways to win: send fewer rounds, or send smaller messages. On the first, second-order ideas like the Nesterov-accelerated sketched Newton method in FLeNS reach a good model in fewer communication rounds. On the second, Sequential Compression Layers shrink what foundation-model clients need to exchange during federated fine-tuning. Parameter-efficient adaptation helps enormously here too: if you only ever communicate small adapters instead of the full backbone, the bandwidth problem becomes tractable.

Almost every practical advance in distributed learning is, underneath, a way to move less data or move it less often.

Aggregating updates at scale

Aggregation sounds trivial — average the updates — and is anything but. At scale you have to decide how to weight participants, how to handle stragglers who report late or not at all, and how to combine updates from clients whose data is wildly non-identical without letting a few dominate. Do it synchronously and the slowest client sets the pace; do it asynchronously and you inherit staleness and consistency headaches.

This is where the algorithm and the system stop being separable. The aggregation strategy I choose for statistical reasons — say, protecting rare classes the way FEDTAIL does, or stabilising heterogeneous clients the way FedStein does — directly shapes the communication pattern and the failure modes the system has to tolerate.

Checkpointing and fault tolerance

Train anything for long enough on enough hardware and something will fail: a GPU throws an error, a node reboots, the network blips. A multi-day run with no recovery story is a multi-day run you will eventually lose. Checkpointing — periodically persisting model and optimizer state so you can resume — is what turns a fragile job into a durable one.

And it is its own engineering problem. Checkpoints for large models are huge; writing them too often stalls training on I/O, too rarely and a crash costs hours of progress. Sharded and asynchronous checkpointing, fast parallel filesystems, and storage that can absorb the bursts all matter. Half of keeping a cluster productive is making sure that when — not if — something breaks, the cost is minutes, not days.

Orchestration and scheduling

A shared cluster is a contended resource. Many users, many jobs, finite GPUs. The scheduler decides who runs when, how resources are isolated, and how a large multi-node job gets placed without fragmenting the cluster. Get this wrong and you get the worst outcome in research computing: idle GPUs next to a queue of people waiting to use them.

I came at orchestration from an unusual angle during my internship at Samsung, working on an AI operating system for managing LLM-based agents — resource isolation, scheduling, memory management, and tool-execution separation through a central kernel. The framing stuck with me: training infrastructure and agent infrastructure are both, fundamentally, about governing who gets which resources, when, and under what limits.

Profiling: where the time goes

You cannot fix what you cannot see. Before optimizing anything, you have to know whether a run is bound by compute, memory bandwidth, communication, or I/O — and intuition is wrong often enough that guessing is expensive. Profiling tools, GPU utilization traces, and timeline views turn a vague “it’s slow” into a specific, fixable bottleneck.

A practical loop: profile the run, find the dominant cost, attack just that one, then profile again. The bottleneck always moves — the discipline is to keep chasing it rather than optimizing whatever is most fun to optimize.

This is also where monitoring earns its keep. Cluster-wide metrics — utilization, memory pressure, temperatures, job health — are what let you catch a degraded run before it wastes a day, and what tell you whether the cluster is actually being used well or just being busy.

Toward a federated runtime

All of these pieces — placement, communication, aggregation, checkpointing, recovery, monitoring — want to live together in a coherent runtime rather than be reinvented per project. That conviction is what I’m building toward with ErdosFC, a research-oriented federated computing runtime for training models across distributed data owners without centralizing raw data, with privacy-preserving learning, heterogeneous clients, and communication-efficient aggregation as first-class concerns rather than afterthoughts.

The companion direction, ErdosFAI, carries the same systems mindset into agents — building, deploying, and governing them across decentralized environments with the observability and control that anything running in production needs.

Closing thoughts

It is tempting to treat infrastructure as someone else’s job — the layer you assume away so you can get to the model. I’ve found the opposite to be true: the systems layer is where federated foundation models are won or lost. The cleverest aggregation scheme is useless if it cannot survive a flaky network, and the most capable model is irrelevant if you cannot keep the GPUs fed long enough to train it.

Bridging the algorithm and the machine it runs on is the part of this work I find most satisfying. If you work on distributed training systems — or you’re fighting one of these bottlenecks right now — I’d be glad to compare notes: sunnygupta@iitb.ac.in.

Portrait of Sunny Gupta

Sunny Gupta

PhD Scholar & System Administrator, IIT Bombay · Distributed training & AI infrastructure