Understanding Modern Distributed ML: From HPC Schedulers to LLM Serving Engines

Machine learning engineers today juggle a diverse set of tools and concepts for training and deploying models at scale. This article provides a deep dive into several key topics: running jobs with Deep Learning Containers (DLC), using HPC batch schedulers like Volcano (VOLC) and SLURM, efficient LLM serving with vLLM and SGLang, typical training frameworks and parameters, operational modes (training vs. inference), parallelism strategies (DPTPPP – data, pipeline, tensor parallelism), the role of routers and controllers in distributed systems, and data loading strategies for high throughput. We’ll explain each concept, provide examples (with code and config samples), and offer practical insights in a technical, precise manner. Let’s dive in.

Deep Learning Containers (DLC) and Running ML Jobs

Deep Learning Containers (DLC) refer to pre-built Docker container images that come with popular deep learning frameworks and dependencies optimized and ready to run. For example, AWS provides DLCs for TensorFlow, PyTorch, MXNet, etc., which include optimized builds (often with GPU support, libraries like CUDA/cuDNN installed, and even network optimizations like EFA for multi-node training). These containers ensure a consistent environment so researchers don’t need to manually install frameworks on each machine. According to AWS, DLCs are available as Docker images on Amazon ECR (Elastic Container Registry), and each image is tailored for a specific framework version and task (training or inference) (Build high-performance ML models using PyTorch 2.0 on AWS – Part 1 | AWS Machine Learning Blog). This means you can pick a container that matches your desired framework (say PyTorch 2.0 with CUDA 11) and be confident it has all the right libraries.

How it operates: In practice, using a DLC involves pulling the container image and running your training or inference inside it. This can be done on a cloud VM or on-premise server with Docker installed. For instance, after launching an EC2 GPU instance, one might do:

# Step 1: Login to AWS ECR public (if required) and pull the DLC image
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin <aws_account>.dkr.ecr.us-west-2.amazonaws.com
docker pull <aws_account>.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-ec2

# Step 2: Run the container with a training script
docker run --gpus all -v /data:/data -it <aws_account>.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-ec2 \
    python /data/train.py --epochs 5 --batch-size 32

In the above example, we pulled an AWS PyTorch 2.0 training DLC and then executed a training script (train.py) inside it. The --gpus all flag gives the container access to NVIDIA GPUs on the host, and -v /data:/data mounts a host data directory into the container. This approach ensures the environment inside the container has the correct version of PyTorch, CUDA, etc., which simplifies running the job. It’s also portable – the same container can run on any machine with Docker, so experiments are reproducible.

When to use DLCs: DLCs are especially handy on cloud platforms and managed services. For example, Amazon SageMaker training jobs under the hood use DLC images so that you can just specify the image and your training code, and the platform handles the rest. Even on HPC clusters, teams sometimes use Singularity or Docker to run jobs in a containerized environment for consistency. In summary, DLCs streamline the “it works on my machine” problem by providing a consistent runtime for deep learning tasks. They come with tested, optimized libraries (which can yield performance boosts – e.g., AWS reported up to 42% speedup using their optimized PyTorch 2.0 DLC on certain instances (Build high-performance ML models using PyTorch 2.0 on AWS – Part 1 | AWS Machine Learning Blog)).

Volcano (VOLC) – Kubernetes Batch Scheduling for AI Jobs

Volcano (VOLC) is a batch scheduling system built on Kubernetes, designed to run high-performance computing (HPC) and AI workloads in a cloud-native environment. While Kubernetes’ default scheduler is great for microservices, it lacks some capabilities needed for deep learning jobs (like gang scheduling, queue management, and priority scheduling). Volcano addresses these by providing a custom scheduler and job management CRDs (Custom Resource Definitions) on top of Kubernetes (Volcano: Collision between containers and batch computing | CNCF). In essence, Volcano enables Kubernetes to behave more like an HPC cluster scheduler for batch jobs.

What it is: Volcano was introduced to bridge containers and batch computing. It supports frameworks like TensorFlow, PyTorch, Spark, and MPI by allowing users to submit jobs that require multiple resources (e.g., a job that needs 8 GPUs across 2 nodes) and ensuring those resources are allocated together before the job starts (Volcano: Collision between containers and batch computing | CNCF). This “gang scheduling” ensures that distributed training jobs (which may spawn many pods) don’t start until all required pods can start, preventing a situation where half the job is running and the other half is waiting (which would waste resources). Volcano also provides fairness policies, priority queues, and the ability to co-schedule mixed workloads.

How it operates: Volcano integrates with Kubernetes as a scheduling plugin. Users typically define a Volcano Job YAML, which specifies tasks and their replica counts, resource needs, etc. For example, a YAML might declare a job with 4 replicas, each needing 1 GPU, and a minAvailable: 4 (meaning schedule this job only when 4 pods can be placed). When submitted, Volcano’s scheduler will find space on the cluster for all 4 pods and launch them simultaneously. If only 3 GPUs are free, it will wait until a 4th is free (rather than starting 3 now and 1 later). This is crucial for distributed training frameworks like Horovod or PyTorch DDP, which expect all ranks to be up at once for synchronization.

Volcano’s architecture includes a central controller and scheduling plugins. It considers scheduling algorithms like fair-sharing, priority, etc., through a plugin mechanism. For example, it can enforce queueing policies (so certain jobs don’t starve others) and topology-aware scheduling (spreading jobs across nodes or racks for performance). From the user perspective, using Volcano feels like using Kubernetes but with a different API for jobs and the assurance that your ML job will be scheduled holistically. In short, VOLC turns Kubernetes into an HPC-aware scheduler (Volcano: Collision between containers and batch computing | CNCF), unifying the convenience of containers with the power of batch job orchestration.

Example use case: Suppose you have a Kubernetes cluster with GPU nodes and you want to run an MPI-based distributed training job. With Volcano, you can submit an MPI Job (Volcano integrates with MPI Operator) requesting (say) 2 pods each with 4 GPUs. Volcano will ensure both pods start together on two nodes and have 4 GPUs each. It will also handle if a pod fails, rescheduling the entire job if needed, to maintain gang semantics. This way, your MPI mpirun command inside the pods can reliably launch across both pods. Without Volcano, the default scheduler might start one pod and then delay the second until resources free up, which would cause the first pod’s MPI process to hang or timeout.

SLURM: Classic HPC Job Scheduling (Deep Dive)

SLURM (Simple Linux Utility for Resource Management) is a widely used open-source job scheduler for HPC clusters. It functions as the “operating system” of a cluster, allocating resources (CPU cores, GPUs, memory, nodes) to jobs, queuing jobs until resources are available, and starting and monitoring those jobs. Slurm is highly scalable and is used in many top supercomputers. It provides a cluster management and job scheduling tool that focuses on efficiently matching jobs to available resources (Choosing the Right Orchestration Tool for ML Workloads: Slurm vs. Kubernetes | Nscale).

How SLURM works: A Slurm cluster consists of a central controller (slurmctld) that manages the queue and scheduling, and agent daemons (slurmd) running on each compute node to launch and oversee tasks. Users interact with Slurm via commands like sbatch (to submit a batch job script), salloc (to request an interactive allocation), or srun (to launch parallel tasks). Slurm maintains a list of partitions (think of them as named queues or groups of nodes), each with certain limits or hardware characteristics, and it schedules jobs onto nodes in those partitions according to configured policies (priority, fairness, backfill scheduling, etc.).

Job submission example: Below is an example Slurm batch job script (train.sbatch), which requests resources and runs a training program:

#!/bin/bash
#SBATCH --job-name=train_model       # Job name
#SBATCH --nodes=1                    # Run on a single node
#SBATCH --ntasks=4                   # Total tasks (processes) = 4
#SBATCH --gres=gpu:4                 # Request 4 GPUs (on one node)
#SBATCH --cpus-per-task=4            # 4 CPU cores per task (16 cores total)
#SBATCH --mem=64G                    # 64 GB of memory for the job
#SBATCH --time=02:00:00              # Time limit hh:mm:ss
#SBATCH --partition=ml_gpu           # Partition (queue) name

module load anaconda/2023a           # Load any necessary modules (e.g., Anaconda)
source activate myenv               # Activate virtual env if needed
echo "Running on $SLURM_NNODES node(s) with $SLURM_NTASKS tasks..."
srun python train.py --epochs 10 --batch-size 128

In this script, the #SBATCH lines are directives to Slurm. We request 1 node with 4 GPUs and set the job to run python train.py via srun. When we run sbatch train.sbatch, Slurm will queue the job. Once a node in partition ml_gpu with 4 free GPUs and 16 free CPU cores is available, Slurm will allocate that node to the job, start the job, and srun will launch 4 tasks (since --ntasks=4). If this were a distributed training scenario using MPI or PyTorch distributed, those 4 tasks could correspond to 4 workers (each assigned one GPU). Slurm takes care of launching them with the proper environment variables for MPI or for torch.distributed (if configured to do so).

Using Slurm effectively: Slurm provides many features such as job arrays (to submit many similar jobs easily), dependency chains (start job B after job A finishes), and resource profiling. For ML, a common pattern is to request --gres=gpu:N to get N GPUs, and use srun or MPI to spawn N processes. Slurm ensures all those processes run on the allocated nodes and can communicate (often it sets up hostnames in SLURM_HOSTNAMES and MPI can use those). Slurm also allows scheduling policies; for instance, jobs can be preempted or backfilled. Backfill scheduling is useful in HPC to maximize utilization: a short job might jump ahead in the queue if it can fit in a gap between big jobs. As an engineer, when you have large training jobs, you might request a longer walltime; but if you can break the work into shorter chunks or checkpoint and restart, you may utilize backfill to get them run sooner in pieces.

In summary, Slurm is a powerful tool for running batch ML jobs on clusters. It is lower-level than Kubernetes or cloud services – you typically SSH into a login node and use sbatch to submit – but it gives fine-grained control. Many researchers run PyTorch or TensorFlow on Slurm clusters by simply writing Slurm scripts that launch their training code, benefiting from the cluster’s scheduling of GPU resources.

vLLM: High-Throughput LLM Inference Engine

As large language models (LLMs) moved from research to production, serving them efficiently became a challenge. vLLM is an open-source library and engine designed for fast, cost-effective LLM inference. Developed at UC Berkeley’s Sky Computing Lab, vLLM introduces a novel memory management technique called PagedAttention to optimize how the model’s attention key-value cache is stored and accessed (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog). The result is significantly higher throughput (requests per second) compared to traditional implementations. In fact, vLLM achieves up to 24× higher throughput than the baseline Hugging Face Transformers library on GPT-style model serving (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog), all without requiring changes to the model architecture.

How vLLM works: In autoregressive generation (the typical way LLMs generate text token by token), a cache of past attention keys and values is maintained for each sequence. This KV cache grows with the sequence length and can consume a lot of GPU memory (e.g., ~1.7 GB for a single long sequence on LLaMA-13B (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog)). Most frameworks allocate a contiguous block for the maximum possible length, leading to a lot of unused space (fragmentation). vLLM’s PagedAttention instead treats the KV cache like virtual memory pages, allocating it in blocks and allowing non-contiguous storage (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog). This way, memory can be managed flexibly: sequences that finish generating free up their pages which can be reused for new sequences. It dramatically reduces memory waste (by 60–80% in typical cases) and allows vLLM to handle more concurrent sequences than other systems.

Moreover, vLLM implements continuous batching: it can add new incoming requests into the batch on the fly, even while other requests are partway through generation. Traditional systems often process a fixed batch of requests from start to finish for each generation step; vLLM’s scheduler instead is fine-tuned to merge requests whenever possible, keeping the GPUs busy. It also uses optimized CUDA graphs to avoid Python overhead in the serving loop, supports both NVIDIA and AMD GPUs, and integrates with Hugging Face Transformers (so you can load models by name and serve them).

Example usage: Using vLLM feels similar to using a high-level inference server. You can either use it programmatically or via an API server. For instance, programmatically:

from vllm import LLM, SamplingParams

# Load a 7B model (assuming weights are available locally or via HuggingFace Hub)
llm = LLM(model="facebook/opt-6.7b", tensor_parallel_size=1)  # tensor_parallel_size can >1 to split model on GPUs
prompts = [
    "User: Hello, how are you?\nAssistant:",
    "User: What is the capital of France?\nAssistant:"
]
# Generate with certain decoding parameters
outputs = llm.generate(prompts, sampling_params=SamplingParams(top_p=0.95, max_tokens=100))
for out in outputs:
    print("Prompt:\n", out.prompt)
    print("Completion:\n", out.outputs[0].text)

This code creates an LLM instance and generates responses for two prompts in a batch. Under the hood, vLLM will use PagedAttention to manage the KV cache for these prompts and may even batch them together if possible. The outputs will contain the completions for each prompt. One could also launch vLLM as a server (which provides an OpenAI-compatible REST API) using a command like python -m vllm.entrypoints.openai.api_server --model your_model_name. This makes it easy to integrate vLLM with applications expecting an OpenAI API (just point them to this server).

Why it matters: vLLM essentially pushes the throughput limits of LLM serving. If you have a fixed GPU budget, serving more requests per second means lower cost per request. The 24× improvement reported is in scenarios where many concurrent requests are generated with relatively long outputs (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog). Even in less extreme cases, vLLM often yields multiple-fold speedups over naive implementations, and about 3× over Hugging Face’s TGI server in many settings (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog). It also supports advanced decoding algorithms (like beam search, parallel sampling) efficiently (vLLM and PagedAttention: A Comprehensive Overview | by Abonia Sojasingarayar | Medium). For an engineer, adopting vLLM can be as simple as swapping your inference code to use vLLM’s API (or running its server). The benefit is handling more users or reducing latency without buying more GPUs.

SGLang: Structured Generation Language and Serving Framework

While vLLM focuses on raw throughput of single-step prompt completion, SGLang tackles the problem of orchestrating complex, structured interactions with LLMs efficiently. SGLang (short for Structured Generation Language) is a system that combines a frontend DSL (Domain-Specific Language) for describing multi-step LLM programs with a highly optimized backend runtime for executing them ([2312.07104] SGLang: Efficient Execution of Structured Language Model Programs) (SGLang: A Deep Dive into Efficient LLM Program Execution - DEV Community). It’s as if SGLang is both a programming language for prompting (with features like loops, conditionals, and parallel calls) and a serving engine that ensures those programs run fast.

What it is: The core idea behind SGLang is that many real applications require multiple calls to an LLM and structured outputs. For example, an AI agent might plan steps and call the LLM for each step, or you might require the LLM to output JSON with a specific schema. Doing this naively can be slow: multiple calls incur overhead and the LLM might repeat processing the same prompt parts, etc. SGLang’s frontend lets you write a single “script” that can include multiple generation calls, branching logic, and integration of external tools. The SGLang compiler/runtime will then execute this efficiently. The runtime introduces optimizations like RadixAttention (an algorithm for reusing KV cache across prompt prefixes, similar in spirit to vLLM’s paging but geared toward shared prompt parts) and a compressed finite-state machine for output grammar (to handle structured outputs faster) ([2312.07104] SGLang: Efficient Execution of Structured Language Model Programs). In plain terms, RadixAttention means if you have many requests sharing a common prefix (e.g., system or few-shot prompt), SGLang will compute that part once and reuse it, making things like chatbots or retrieval-augmented generation much faster by avoiding redundant work (What is SGLang and Why Does It Matter? | by Emad Dehnavi | Medium). The structured output FSM means if you expect, say, a JSON with fixed keys, SGLang can skip generating the fixed punctuation/keys token by token and jump ahead, since it knows those parts are deterministic (What is SGLang and Why Does It Matter? | by Emad Dehnavi | Medium). This gives a 3×+ speed boost for generating long JSON or XML outputs (What is SGLang and Why Does It Matter? | by Emad Dehnavi | Medium).

How it operates: SGLang consists of two parts – a Python-based DSL and a runtime. The DSL allows users to write something like:

from sglang import sg  # hypothetical interface

# Pseudocode: define a structured generation flow
with sg.session(model="Llama-2-13b-chat") as sess:
    # A simple program with two sequential LLM calls and a structured output
    user_input = "Translate the following English text to French and give sentiment: I love this product."
    sg.prompt(f"User asks: {user_input}\nYou are a translator and sentiment analyzer.")
    translation = sg.generate("First, translate to French:")
    sentiment = sg.generate("Now, analyze the sentiment of the original text:")
    sg.return_json({"translation": translation, "sentiment": sentiment})

(Note: The above is illustrative pseudocode. SGLang’s actual syntax might differ, but conceptually it allows sequential and parallel generation and returning structured data.)

When this SGLang "program" runs, the runtime will take over: it might execute the first generation call to translate, then the second for sentiment, and finally assemble a JSON. Under the hood, it uses one or more LLM inference backends (with optimizations akin to vLLM). Because the two prompts share the initial instruction context, SGLang can reuse that prefix’s computation for the second call. And when returning JSON, if the JSON keys and format are predetermined by sg.return_json, it can ensure those are output without spending multiple tokens on braces/commas.

Key features: SGLang’s authors highlight features including zero-overhead scheduling (its scheduler for orchestrating multiple calls adds virtually no extra latency), cache-aware load balancing (it can route requests to workers in a way that maximizes cache hits), and multi-modal support (it can handle vision-language models as well, e.g., LLaVA for images) (GitHub - sgl-project/sglang: SGLang is a fast serving framework for large language models and vision language models.) (GitHub - sgl-project/sglang: SGLang is a fast serving framework for large language models and vision language models.). It also supports common efficiency tricks: quantization (int8/4-bit etc.), speculative decoding (generating multiple tokens in advance to then verify, which can speed up long generations), and multi-model orchestration (e.g., using one model for one part of the program and another model for a different part). Essentially, it’s an all-in-one framework to write LLM-powered applications and run them with high performance.

Performance and use cases: SGLang has shown up to 6.4× throughput improvements over state-of-the-art systems on complex tasks ([2312.07104] SGLang: Efficient Execution of Structured Language Model Programs). Consider a retrieval-augmented generation (RAG) pipeline: normally, you might vector search for context, then prepend it to a prompt, then generate an answer. With SGLang, you could express this whole sequence (vector search result feeding into a prompt template, then generation) as a single program. The runtime could parallelize some steps if possible and reuse cached components. Similarly, for chatbots that have a conversation history, SGLang can reuse the invariant parts of the prompt across turns (rather than reprocessing the entire conversation every time). For structured outputs like ensuring the model’s answer follows a JSON schema, SGLang’s finite state machine approach can prevent the model from drifting off format and do it faster by injecting the fixed syntax. From an engineering standpoint, SGLang offers both a productivity boost (via the DSL) and a performance boost (via the optimized execution) for building complex LLM workflows.

Who’s behind it: SGLang is an open-source project (backed by a community including researchers from Stanford/Berkeley, and companies like LinkedIn, etc., as per their acknowledgments). It has been used in production by companies like ByteDance and xAI (SGLang: A Deep Dive into Efficient LLM Program Execution - DEV Community). It’s relatively new (the arXiv paper came in late 2023), but it’s gaining traction for advanced LLM applications. If your use case goes beyond single prompt-completion pairs – say you need an agent that calls the model multiple times or you need ultra-fast structured responses – exploring SGLang could be worth it. It does have more moving parts than a simple server, but it can automate a lot of the optimization that you would otherwise have to build yourself.

Training Frameworks and Typical Training Parameters

When we talk about training frameworks, we refer to the libraries and tools used to implement model training. The major ones are PyTorch, TensorFlow, JAX, and higher-level interfaces like PyTorch Lightning or Hugging Face Transformers Trainer, as well as distributed training frameworks like Horovod, DeepSpeed, or ColossalAI that sit on top of these. Regardless of the framework, there is a common set of training parameters (hyperparameters) that practitioners must manage. These are the knobs that define the training setup and influence model convergence and performance.

Typical training parameters include:

Learning Rate (LR): Perhaps the most critical hyperparameter. This controls the step size in gradient descent. Too high a learning rate and training might diverge; too low and it converges slowly or gets stuck in a suboptimal result. Often we use learning rate schedules (like cosine decay, linear warmup then decay, etc.) to adjust LR over time.
Batch Size: Number of samples the model sees before its weights are updated. There is per-device batch size and global batch size (global = per-device * number of devices * gradient accumulation steps). Larger batches can speed up training (more parallelism) but may require a lower learning rate or careful tuning to maintain generalization. Memory is a limiting factor for batch size.
Number of Epochs or Iterations: Epochs = how many passes over the entire dataset. Alternatively, one might set a total number of training steps if data is streaming. You typically train until the model converges (monitors a validation metric).
Optimization Algorithm: e.g., SGD, Adam, AdamW, RMSprop, etc. Each has parameters like momentum (for SGD) or beta1/beta2 (for Adam). For instance, AdamW is Adam with weight decay (L2 regularization) which is very common for transformers.
Weight Decay: A regularization parameter that controls L2 penalty on weights (to reduce overfitting). It’s often set around 0.01 or 0.1 for large language models.
Dropout Rate: If the model architecture includes dropout layers, this rate (usually between 0.1 and 0.5) decides the fraction of neurons to drop during training. It’s active in training mode, and automatically off in eval mode.
Gradient Clipping: A threshold (like clipping norm to 1.0 or 5.0) to avoid gradient explosion in RNNs/transformers. This isn’t exactly a hyperparameter of the model but of the training process.
Gradient Accumulation steps: If your batch size is limited by GPU memory, you can accumulate gradients over a few mini-batches before updating weights. For example, if you want an effective batch of 256 but can only fit 64 at once, you can accumulate over 4 mini-batches. This effectively multiplies the batch size without extra memory (but one must adjust learning rate accordingly).
Distributed Training parameters: If using multiple GPUs or nodes, parameters like number of processes, or for Horovod/Distributed Data Parallel, the communication backend (NCCL for GPUs), etc., come into play. Many frameworks let you largely ignore this (they figure it out from environment), but you might specify world_size (total number of processes) and rank (process ID) when launching manually.

These parameters are often specified in code or config files. For example, using PyTorch:

import torch
import torch.optim as optim

model = MyModel()
learning_rate = 3e-4
optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)
num_epochs = 10
clip_grad_norm = 1.0

for epoch in range(num_epochs):
    model.train()  # set training mode (enables dropout, etc.)
    for batch in train_loader:
        inputs, labels = batch
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = compute_loss(outputs, labels)
        loss.backward()
        # Clip gradients to prevent explosion
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip_grad_norm)
        optimizer.step()

In this snippet, we see learning rate (3e-4), weight decay (0.01 in AdamW), and we would have a batch size defined in train_loader. If using learning rate scheduling, we might add:

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs * len(train_loader))
...
scheduler.step()

to adjust LR each iteration or epoch. If using multiple GPUs with Distributed Data Parallel (DDP), we would have launched the script with torchrun --nproc_per_node=4 train.py or similar, and each process would run this loop on a different subset of data.

Choosing values: The optimal hyperparameters can vary. Often, papers or open-source models provide default values (for instance, BERT was famously trained with batch size 256, 1e-4 LR with warmup and decay, 0.01 weight decay, etc.). A good practice is to perform hyperparameter tuning for critical ones like learning rate and batch size. Tools like Optuna or Ray Tune can help automate that. Modern frameworks (like Hugging Face Trainer) allow passing these in a config or command-line, e.g. --learning_rate 3e-5 --num_train_epochs 3 --per_device_train_batch_size 32 ... for fine-tuning a transformer.

In distributed settings, note that if you increase the number of GPUs, you might increase batch size (to keep per-GPU batch constant). There’s a concept of learning rate scaling: linear scaling rule suggests if you double batch size, also scale up learning rate a bit. But beyond certain sizes, this can hurt convergence, so techniques like learning rate warm-up or gradual decay become important.

Training frameworks specifics:

TensorFlow uses static graphs (unless in eager mode) and you might define these parameters in a tf.keras.Model.compile(optimizer=..., metrics=...).
PyTorch Lightning allows defining hyperparams in a LightningModule and can take them from a config or CLI.
DeepSpeed and others provide config JSONs where you specify things like optimizer type, LR, batch size, parallelism strategy (more on that below).
Horovod requires setting up the optimizer via hvd.DistributedOptimizer which wraps your optimizer to average gradients across workers.

Ultimately, understanding these typical parameters is crucial. They are often exposed in configuration files or scripts for a training job. It’s common to see YAML or JSON config files like:

training:
  epochs: 50
  batch_size: 128
  learning_rate: 0.001
  optimizer: "Adam"
  momentum: 0.9
  weight_decay: 0.0005
  lr_schedule: "cosine"
  warmup_epochs: 5

which then get parsed into the training framework. A solid grasp of what each does helps in tuning and debugging training runs.

Operational Modes: Training vs. Evaluation vs. Inference

Machine learning models operate in different modes depending on the phase of the workflow: training, evaluation (validation/testing), or inference (production). It’s important to understand the distinctions, as the behavior of certain components (and the code we write) differs in each mode.

Training Mode: This is when the model is learning from data. In training mode, we perform forward passes and backward passes (computing gradients and updating weights). Some layers behave differently in training – for example, Dropout layers drop out random activations (to regularize and prevent overfitting), and Batch Normalization layers update running statistics of mean/variance. In most frameworks, you explicitly set the model to training mode. In PyTorch, model.train() puts the model in training mode (affecting dropout, batchnorm, etc.). In Keras (TensorFlow), the model.fit() call handles this, or you pass a training=True flag to layers if using them manually. During training, we also typically enable gradient computation. In code, you’ll see loss.backward() and optimizer.step() as shown earlier, which are only done in training.
Evaluation Mode: This refers to when you are measuring the model’s performance, typically on a validation or test dataset. No learning happens here – we do forward passes only, and we do not update weights. Importantly, we disable or alter certain layers’ behavior: dropout is turned off (so all neurons are active, because we want deterministic output), and batch norm layers use the learned running statistics instead of batch stats (to ensure consistency). In PyTorch, model.eval() switches to evaluation mode (disabling dropout etc.). Additionally, we often wrap the evaluation code in a with torch.no_grad(): block (or disable gradients in TF) to save memory and compute, since we don’t need gradients for evaluation. Evaluation mode is used during training (to periodically check validation accuracy) and after training (to test the model). In code it looks like:
```
model.eval()
total_correct = 0
for batch in val_loader:
    inputs, labels = batch
    with torch.no_grad():           # no grad since no backward
        outputs = model(inputs)     # forward pass
    preds = outputs.argmax(dim=1)
    total_correct += (preds == labels).sum().item()
accuracy = total_correct / len(val_dataset)
```
Here we see model.eval() and torch.no_grad() to properly do evaluation. In this mode, dropout layers would not drop neurons (effectively dropout acts like identity function), ensuring the full capacity of the model is used.
Inference Mode: In practice, “inference” is very similar to evaluation – it’s a forward pass on new data – but typically when we say inference we mean deploying the model in a production setting to make predictions on unseen data (often one sample or a few at a time, rather than a whole batch of a static test set). Inference thus is usually the same as evaluation behavior: no dropout, etc., but it might involve additional concerns like latency optimization, serving infrastructure, etc. Many frameworks don’t have a separate “inference” switch; you just use eval() mode. However, when deploying, you might use specialized inference libraries (TensorRT for NVIDIA GPUs, ONNX Runtime, TorchScript compiled model, etc.) to speed it up.

In summary, the core distinction is train vs. not-train. The model has internal flags for this. If you forget to set the mode, you can get incorrect results – e.g., evaluating with dropout still on will typically degrade accuracy because the model is randomly dropping neurons even on validation data. Conversely, if you accidentally leave the model in eval mode during training, dropout won’t happen and your training metrics might look better than they should, and then you get a shock when turning dropout on. So it’s standard practice in training scripts to toggle modes appropriately.

Example illustrating difference: Suppose you have a model with a dropout layer. If you input the same data twice in training mode, you may get different outputs each time (because dropout randomly zeroes different neurons). In eval mode, the same input will produce the same output consistently. This deterministic behavior is what you want for measuring performance and for deploying to production (you don’t want your product to return different answers each time unpredictably!).

Another example is batch norm: in training, it normalizes using the batch’s mean and variance and also updates its running estimates; in eval, it uses the running mean/var accumulated during training to normalize. That way, the model’s output is not dependent on the specific batch of input when doing inference – it’s fixed by those learned statistics.

Additional Modes: Sometimes you hear “fine-tuning mode” or others, but those aren’t separate in code – fine-tuning a model is just training mode (with perhaps some layers frozen). Some libraries have a notion of “predict” vs “eval” (Keras model.predict() vs model.evaluate()), but under the hood they both set the model to eval mode; the difference is just whether they compute metrics or not.

Finally, gradient tracking is implicitly linked to mode: in training we need grads, in eval/inference we often turn them off. This can be a gotcha – for example in PyTorch, forgetting torch.no_grad() in evaluation won’t change correctness but will use more memory. In TensorFlow 2 (eager), it won’t compute grads unless you’re in a tf.GradientTape, so it’s less manual. But frameworks aside, always conceptually separate the concerns: training (learn) vs inference (use). This helps design the code and the system (for instance, in production you’d deploy only the forward pass and maybe quantize the model for speed, etc., which you wouldn’t do in training).

Parallelism Strategies in Training (DPTPPP: Data, Pipeline, Tensor Parallelism)

Training modern deep models often requires splitting the work across multiple devices or machines, because a single GPU’s memory or compute might be insufficient. DPTPPP isn’t a standard acronym on its own, but it hints at the trio of parallelism strategies used in large-scale training: Data Parallelism, Tensor Parallelism, Pipeline Parallelism (and combinations thereof). Let’s break these down:

Data Parallelism (DP): This is the most common form of parallelism. The idea is simple: you replicate the entire model on N devices (GPU or TPU etc.), and you split each batch of data into N smaller micro-batches, one for each device (Parallelism in Distributed Deep Learning · Better Tomorrow with Computer Science). Each device processes its data (forward and backward pass) independently. Then, at synchronization points (usually after computing gradients), the gradients from all devices are aggregated (by summing or averaging) so that all model replicas stay in sync with the same parameters. This aggregation can be done via a parameter server architecture or more commonly via all-reduce in an all-to-all fashion. After gradient averaging, each device has the same updated weights, and you proceed to the next batch. Data parallelism is relatively easy to implement (PyTorch’s DistributedDataParallel does exactly this). The key is that each device handles a fraction of the data, so you can effectively train with batch size = N * (batch per device), utilizing N times the memory and compute. The downside is if N is very large, the effective batch size might become so big that the model’s convergence worsens (you might need to adjust learning rate or use techniques like learning rate warmup). Also, communication overhead for syncing gradients can become a bottleneck if the model is huge.
Tensor Parallelism (TP): Sometimes called model parallelism (intra-layer model parallelism). Here, instead of replicating the whole model on each device, you split the model’s layers across devices (Parallelism in Distributed Deep Learning · Better Tomorrow with Computer Science). For example, consider a Transformer layer that has a weight matrix of shape (hidden_dim, hidden_dim). In tensor parallelism, if you have two GPUs, each GPU could store half of that matrix (split along the hidden_dim). During a forward pass, each GPU computes its part of the matrix multiplication and then they exchange results (since each GPU has partial output). Essentially, a single layer’s computation is partitioned. This way, a single model can be spread across multiple GPUs – useful when the model is so large it doesn’t fit in one GPU’s memory. NVIDIA’s Megatron-LM is an example that uses tensor (model) parallelism by slicing the weight matrices across GPUs (Parallelism in Distributed Deep Learning · Better Tomorrow with Computer Science). The benefit is you can train extremely large models without running out of memory on one GPU. The complexity is that almost every layer’s computation now involves communication between GPUs (synchronizing partial results), which can be tricky and slower if not done efficiently. TP is often used in conjunction with DP (e.g., each node might have 8 GPUs doing tensor parallel for one model replica, and then you have multiple nodes data-parallel on different data). Unlike DP where each GPU had a full copy of parameters, in TP each GPU holds only a fraction of the parameters (Parallelism in Distributed Deep Learning · Better Tomorrow with Computer Science).
Pipeline Parallelism (PP): Also known as inter-layer model parallelism. This slices the model by layers and assigns different contiguous layers to different devices (Parallelism in Distributed Deep Learning · Better Tomorrow with Computer Science). For example, in a 24-layer Transformer, GPU0 could hold layers 1-12, and GPU1 holds layers 13-24. When a batch comes in, GPU0 processes the first 12 layers, then sends the intermediate activations to GPU1, which continues through layers 13-24 to produce the output. During backpropagation, the gradients flow reverse: GPU1 computes grads for layers 13-24, sends gradients of the interface activations back to GPU0 so it can compute grads for layers 1-12. Pipeline parallelism reduces memory load per device (each device only needs to load a subset of the model), enabling large models to be trained, similar to tensor parallelism (Parallelism in Distributed Deep Learning · Better Tomorrow with Computer Science). However, a naive pipeline has a lot of idle time – while GPU0 is processing batch 2, GPU1 might be idle because it can’t start on batch 2 until GPU0 sends it forward activations. This idle time is called the pipeline bubble. Solutions like micro-batching address this: you break each batch into micro-batches and pipeline them. For instance, while GPU1 is working on micro-batch 1, GPU0 can start micro-batch 2, and so on, so both GPUs stay busy in a staggered fashion (Parallelism in Distributed Deep Learning · Better Tomorrow with Computer Science). With sufficient micro-batches, the pipeline can be kept near full utilization, though some bubble (idle time) is inevitable (Parallelism in Distributed Deep Learning · Better Tomorrow with Computer Science). Pipeline parallelism often requires careful coordination but is supported in frameworks like DeepSpeed (which automates splitting layers and scheduling micro-batches).
Hybrid (3D) Parallelism: In practice, these strategies are combined to scale to hundreds or thousands of GPUs. For example, you might use DP across nodes, and within each node use TP and PP to split a giant model across GPUs. This is sometimes called “3D parallelism.” If a model has to be split across 8 GPUs via 2-way tensor parallel and 4-way pipeline parallel, and you have 4 such model replicas data-parallel, you are using all three. Each adds complexity and overhead, but together they make training trillion-parameter models feasible. Research and frameworks like Megatron (by NVIDIA) or DeepSpeed (by Microsoft) provide recipes for hybrid parallelism.

Choosing a parallelism strategy: For moderate model sizes (that fit on one GPU), pure Data Parallelism is simplest and often best. As model sizes grow, you first try to use multi-GPU data parallel if you can (just need more GPUs for more data). But if the model itself doesn’t fit or you want to reduce per-GPU memory, you introduce model parallelism: pipeline or tensor.

Tensor parallel is very efficient for architectures like Transformers because you can split the matrix multiplies, but it requires fast interconnect (GPUs on the same server or NVLink) to share intermediate results.
Pipeline parallel is easier to implement conceptually (just cut the model), but it introduces latency (pipeline fill/flush). It also complicates training with things like batch norm across stages (though there are workarounds).
With pipeline parallel, you also have to deal with the fact that different GPUs will hold different parts of the model, so if one GPU is much slower or has issues, it affects the whole pipeline.

Example configuration: Using DeepSpeed, one can configure parallelism in a JSON config:

{
  "train_batch_size": 2048,
  "gradient_accumulation_steps": 1,
  "bf16": {"enabled": true},
  "pipeline": {
    "pipeline_parallel_size": 4
  },
  "tensor_parallel": {
    "tensor_parallel_size": 2
  }
}

This implies 4-way pipeline parallel and 2-way tensor parallel (so total 8 GPUs for one model replica). If you had, say, 32 GPUs available, you could then have 4 data parallel replicas of this 8-GPU setup (4*8 = 32).

Synchronization and communication: In DP, the communication is syncing gradients (which can often be overlapped with computation – e.g., while processing next batch’s forward pass, you all-reduce last batch’s grads). In TP, communication happens at every layer’s forward/backward (synchronizing partial results). In PP, communication happens between stages for each micro-batch. All three add overhead, but hardware like NVIDIA’s InfiniBand or NVLink and software like NCCL are designed to minimize that.

To put in perspective, if training a model like GPT-3 (175B parameters):

You might use 8-way tensor parallel within a node (if the node has 8 GPUs) so each GPU holds ~1/8th of the model (which might still be ~20B params per GPU).
Then you might use 8 pipeline stages across 8 such nodes, so now model is split into 8 segments of layers, each segment itself is on 8 GPUs tensor-parallel.
That covers 64 GPUs to hold one model replica.
Then you use 8-way data parallel across 8 groups of those 64 (total 512 GPUs) to train on different data in parallel. This 8x8x8 scheme is what we mean by hybrid/3D parallelism. It’s extremely complex to orchestrate by hand; that’s why libraries like Megatron and DeepSpeed exist to handle the heavy lifting.

In sum, DPTPPP (Data, Pipeline, Tensor Parallelism) are fundamental techniques enabling the training of large models. Data parallelism handles scaling with more data, tensor parallelism handles splitting within layers (horizontal split of tensors), and pipeline parallelism handles splitting across layers (vertical split of the network). Understanding these helps in selecting the right distributed training approach for a given scenario. If your model fits in one GPU and you just need to go faster, use DP on more GPUs. If it doesn’t fit, model parallel (tensor or pipeline or both) is the way to go.

Router and Controller in Distributed AI Architectures

In distributed training and especially in distributed inference/serving, we often talk about a controller component and a router (or routing) component. These play different roles in a multi-node or multi-process architecture:

Controller: A controller is typically the brains of the system – a centralized (or logically centralized) component that manages metadata, coordination, and high-level decisions. In a distributed training context, a controller might coordinate the training process (for example, orchestrating when to start/stop training, aggregate metrics, handle fault tolerance). In distributed inference/serving, the controller often manages the global state: which models are loaded on which servers, how many replicas of each model, health of workers, scaling decisions, etc. One example is Ray Serve’s controller: Ray Serve (a framework to serve models at scale on a Ray cluster) has a Serve Controller actor which holds the configuration and handles autoscaling decisions (Architecture — Ray 2.44.1). It keeps track of all the model deployments and their replicas. The controller doesn’t directly handle user requests; instead it supervises the system and can instruct workers to scale up or down, update routing tables, etc.
Router: A router is responsible for directing incoming requests or data to the appropriate worker(s). In inference serving, this is often an HTTP server or load balancer that receives API calls and then forwards them to one of possibly many model instances. Continuing with the Ray Serve example, there’s an HTTP Proxy actor (one per node by default) which serves as the router – it runs an HTTP server (Uvicorn) that accepts incoming requests and then looks up where to send each request (which replica of a model) and forwards it (Architecture — Ray 2.44.1). Once the worker (replica) finishes processing, the router sends the response back to the client. In training, explicit routers are less common since the “routing” of work (like which GPU handles which batch) is usually baked into the training script or library. But in distributed training with a parameter server, the parameter server could be seen as a kind of controller, and the workers route their gradient updates to it, while it routes updated parameters back.

To clarify with an analogy: if we think of a distributed inference service like a web service, the router is like a web server’s load balancer, and the controller is like a traffic cop or an orchestrator that decides how many servers behind the load balancer, and monitors them. The router deals with each request, the controller deals with the overall system configuration.

Why separate them? Separating controller and router (or data plane vs control plane) provides modularity and scalability. The router can be lightweight and fast, just passing through data (requests), while the controller can be smarter but doesn’t need to handle every request. For example, in a large cluster, you might have one controller but many routers (perhaps one per node or one per region). The controller might update a routing table (like: model X has replicas at {IP1, IP2, IP3}) and each router caches that. Then each incoming request goes to a router, which picks one target (maybe round-robin or based on load, etc.) and forwards it. If a replica dies, the controller notices (heartbeats fail) and updates the routing info, possibly spinning up a new replica. It informs routers of the change (or routers ask the controller periodically).

Real-world example – Ray Serve:

Serve Controller: a single actor that holds the state (list of applications, routes, replicas) and runs the autoscaling policy. If you change the configuration (e.g., deploy a new model or change the number of replicas), you talk to the controller.
HTTP Proxy (Router): one per node, which receives HTTP requests. When a request comes in for, say, the “sentiment-analysis” model, the proxy consults a local map (populated from the controller’s info) to find available replicas for “sentiment-analysis,” then forwards the request to one of those replicas (which are Python workers running the model). The proxies also handle batching if configured (accumulating several small requests into one batch before forwarding to improve throughput).

Another example: NVIDIA Triton Inference Server – it’s not explicitly split into router/controller in user terms (it’s more monolithic), but conceptually Triton’s scheduler could be seen as a controller deciding which thread/GPU runs the next request, and the networking part is routing. In multi-node deployment, you’d often put a load balancer in front of multiple Triton instances – that load balancer is effectively a router, while each Triton instance has an internal controller for scheduling models on device.

In distributed training, if using Parameter Server (PS) architecture:

Parameter server nodes act as controllers for parameters (maintaining the master copy of weights).
Workers route their gradient updates to the PS and fetch updated weights from it. In All-Reduce architecture (like DDP), there’s less of an explicit controller; all workers collectively act (though one might be rank 0 that orchestrates some things like saving checkpoints). Some systems have a concept of a chief node or master for coordination (e.g., TensorFlow’s chief worker that writes checkpoints), which is a bit of a controller role.

Controller responsibilities: manage meta tasks – e.g., in training, initiating a checkpoint save, deciding to stop early, etc.; in inference, scaling up new instances when load increases, rolling out updated model versions, monitoring health.

Router responsibilities: manage data flow – e.g., distributing training data shards to workers (could be considered a form of routing in distributed data loading), or routing inference queries to a model instance.

For an engineer, understanding these roles helps in designing systems. If your inference service is slow at the network level, you might need more router instances or a more efficient router. If your system isn’t scaling, maybe the controller is a bottleneck (maybe it’s doing too much per request). Many serving frameworks decouple these so you can scale them independently (e.g., multiple routers for high request volume, but one controller is fine if it only handles occasional control messages).

Summing up: A Router is about where to send work, and a Controller is about what work to do and when/where to have workers. In distributed inference, routers handle the online traffic and controllers handle the orchestration. In distributed training, the notion is a bit blurred but controllers orchestrate (job schedulers, parameter servers, etc.) and routing is often in how data/gradients are shuttled around. By designing architectures with these in mind, one can achieve scalability (routers scale horizontally to handle more requests; controllers ensure system coherence and can be replicated or made fault-tolerant via leader election if needed).

Data Loading Strategies: Slicing vs. High-Concurrency, and Maximizing Throughput

Feeding data efficiently to GPUs is a critical aspect of training (and sometimes inference for data-heavy tasks). If your GPUs are starving waiting for data, you’re not utilizing them fully. Two broad strategies in data loading are data slicing and high-concurrency loading, and often a combination of both is used.

Data slicing refers to partitioning the dataset so that different workers (processes or nodes) handle different portions of the data. In distributed training with multiple processes (or multiple machines), slicing ensures each process gets a unique subset of data each epoch, rather than everyone loading the same data redundantly. For example, PyTorch’s DistributedSampler will split the indices of the dataset among the $N$ processes such that each process sees 1/N of the dataset per epoch. This not only improves efficiency (no duplicate reading) but is necessary for correctness (so each epoch every sample is processed exactly once across the whole cluster, not N times). If you have 4 workers and 1 million samples, each worker might get 250k. This is horizontal slicing of data by worker.

Even on a single machine with multiple data loader threads, you might slice by sharding files. For instance, if you have 1000 image files and 4 worker threads, you could assign 250 files to each thread to read exclusively, reducing contention (this can be done manually or by how you partition the dataset indices).

High-concurrency data loading means using multiple threads or processes to read and preprocess data in parallel, overlapping with the training computation. Most deep learning frameworks allow you to spawn multiple data loader workers. For instance, PyTorch DataLoader(num_workers=k) will spawn k worker processes. While the GPU is training on the current batch, those workers can concurrently load and preprocess the next batch in the background. This pipeline keeps the GPU fed. High concurrency is useful if each sample or batch takes non-trivial time to load (e.g., reading images from disk and decoding JPEGs, or reading large arrays from a database). By doing these operations in parallel, you utilize multiple CPU cores and hide the I/O latency.

There’s a trade-off: spawning more workers can increase throughput up to a point, but after a certain number, you get diminishing returns (and potentially more overhead or even slowdown due to thread contention or too many disk seeks).

Maximizing throughput tips:

Use separate workers (threads/processes) for I/O and computation. If the training loop is purely synchronous and uses only the main thread to load data, the GPU will sit idle during data load. Always try to overlap them. For example, in PyTorch the default DataLoader with num_workers=0 does everything on main thread (slow); setting num_workers=4 or 8 offloads data loading to those processes and the main thread just waits for batches to be ready (much faster).
Pinning memory: If using GPUs, copying data from CPU to GPU can be a bottleneck. PyTorch’s pin_memory=True in DataLoader can speed up host-to-device transfer by using page-locked memory.
Prefetching: Many data loaders (PyTorch, TensorFlow’s tf.data) allow prefetching batches. E.g., PyTorch’s DataLoader automatically prefetches next batch by default (it’s tied to the number of workers and a prefetch_factor). TF’s dataset.prefetch() explicitly preloads batches. This means even if your pipeline has some bubble, you always keep one or two batches ready ahead of time.
Slicing to avoid duplicate reads: In a multi-process setting, if not using a proper sampler, each process might inadvertently read the same file. Always shard or use distributed samplers to ensure each data fragment is read once. For example, in TensorFlow you might use tf.data.Dataset.shard(num_workers, worker_index) to slice the dataset per worker.
Concurrency vs locality: If data is on spinning disk HDD, having too many concurrent reads might cause a lot of random seeking, hurting throughput. In that case, fewer workers that read larger contiguous chunks (slicing by file or by chunk) might be better (sequential access). If data is on SSD or in memory, concurrency can help saturate bandwidth. If data is coming over network (like reading from S3 or a remote storage), concurrency can help hide latency, but too much might overwhelm the network or throttle.
Number of workers: A rule of thumb is to start with number of CPU cores or a multiple thereof. If you have, say, 16 CPU cores, try 8 or 16 data loader workers and see. Sometimes hyperthreading means you can use more workers than cores for I/O bound tasks. But more workers also mean more memory (each may hold a prefetched batch) and overhead of context switching. You can experiment: double the workers and measure throughput until it stops increasing. Many find an optimum in the single-digit or low double-digit workers. For very heavy CPU preprocessing (like complex augmentations), you might benefit from even more.
Vectorized loading or caching: Another strategy is to structure your data to load more efficiently. For instance, if reading from an HDF5 file or a memory mapped file, reading big chunks sequentially is faster than lots of tiny reads. “Slicing” could also mean slicing each file into a big contiguous chunk. Alternatively, using a caching approach: if the dataset is small enough, load it entirely into RAM (or LMDB) to avoid disk latency. If it’s huge but you have repeated epochs, consider caching on local SSD from a remote store.

Example in PyTorch:

from torch.utils.data import DataLoader, DistributedSampler

# Assume we are in a distributed training environment with world_size and rank set.
sampler = DistributedSampler(train_dataset, num_replicas=world_size, rank=rank, shuffle=True)
loader = DataLoader(train_dataset, batch_size=256, sampler=sampler,
                    num_workers=8, pin_memory=True, prefetch_factor=2)

In this code, DistributedSampler will slice the dataset indices for each process (each process gets 1/world_size of the data). We use 8 worker processes to load data, pin memory for faster transfer, and prefetch 2 batches per worker. During training, each process will load different data, and within each process, 8 workers will load batches in parallel. This typically results in a very high throughput pipeline where the GPU is nearly always busy.

Slicing vs high-concurrency in practice: They are not mutually exclusive – use both. Slicing ensures scalability across workers/nodes (no duplicate work, linear scaling of data pipeline with number of nodes), while concurrency ensures each worker/node is fed efficiently. If you only did slicing (say 4 workers each reading 1/4 of data, but each worker is single-threaded), each might still be slow to fetch its part. If you only did concurrency without slicing (say 8 threads but all 8 threads on every node read the entire dataset without coordination), you’d get speed but also redundant reads (if multi-node) and maybe skew (some samples seen too often). So typically, in distributed training:

each node/process is responsible for a shard of data (slice by data distribution),
and each node uses multiple threads to read its shard.

Maximizing throughput beyond basics:

Parallel storage: If reading from network storage (NFS, S3), watch out for it becoming a bottleneck. You might need to use local SSDs or parallel file systems (e.g., Lustre). Alternatively, stagger data access patterns so not all workers hit the storage at the same time.
Compression trade-offs: Reading compressed data (like JPEG images) means less disk bytes but more CPU to decompress. If CPU is the bottleneck, consider storing data uncompressed to shift load to I/O (or vice versa).
Batches vs individual files: If each sample is a tiny file, opening/closing files can dominate. Solutions include packing many samples into a single file (TFRecord for TensorFlow, or concatenated archives) so that each I/O operation yields many samples.
Asynchronous data staging: In some workflows, you might even have a separate process that preloads data for the next epoch or next step while the current one is running, especially when data is streamed from a slow source.

Throughput measurement: Always measure your input pipeline. A good practice is to benchmark how many samples per second your DataLoader can produce (without the training part). In PyTorch, you can iterate through the DataLoader in a loop and time it. If your data loader is slower than your model can consume, that’s when to apply more concurrency or optimize it.

To conclude, efficient data loading is crucial for large-scale training. Using slicing ensures distributed training isn’t doing redundant work, and using concurrency and prefetch ensures that batch loading overlaps with model computation. By tuning these (number of loader workers, how data is partitioned, prefetch sizes), you can often reduce training time significantly without changing the model at all. It’s not glamorous compared to model architecture changes, but it’s highly effective engineering. Keep an eye on GPU utilization: if it’s not close to 100%, your data pipeline could be the culprit – then apply these strategies to fill the gap.

Maximizing throughput ultimately means finding the right balance of read parallelism and resource usage such that the data is always ready just in time for the compute. With the right setup, you can ensure your expensive accelerators are never waiting idle for data.

Understanding Modern Distributed ML: From HPC Schedulers to LLM Serving Engines