JobSet Troubleshooting: Why Follower Pods Hit

JobSet Troubleshooting: Why Follower Pods Hit "node selector not set"

When you use JobSet to run distributed training on multi-slice TPU or GPU workloads, its leader-follower mechanism first schedules the leader Pod, then uses a mutating webhook to read the topology label from the leader’s node and patch the corresponding nodeSelector onto the follower Pods. This ensures that the Pods in the same group are scheduled into the same topology domain. The design works well under normal conditions. But when accelerator capacity is insufficient and the leader Pod stays stuck in Pending, you may see a confusing error:

admission webhook "vpod.kb.io" denied the request: follower pod node selector not set

Your first instinct may be to inspect the follower Pod’s node selector configuration, but that is not where the real problem lies. After spending several hours debugging this, I found that the error message is really a false lead. It hides the true root cause and makes you think something is wrong with the JobSet configuration, when in reality the cluster may simply lack the capacity needed to schedule the leader Pod. This misleading error appears because the validating webhook in pkg/webhooks/pod_webhook.go checks the follower’s NodeSelector before it verifies whether the leader has already been scheduled.

After tracking this down, I opened Issue #1187 and submitted two PRs to address it from different angles. This article draws on that real debugging experience to explain JobSet’s leader-follower scheduling model, clarify how the mutating and validating webhooks interact, and show why changing the validation order turns a misleading error into useful diagnostic information.

The code excerpts in this article are based on the JobSet main branch after the refactor in PR #1159. That PR merged logic previously split across pod_mutating_webhook.go and pod_admission_webhook.go into a single pod_webhook.go, using the controller-runtime admission.Defaulter and admission.Validator interfaces. In release-0.11 and earlier, the same logic lives in two separate files, but the validation-order issue is the same.

Overview of the JobSet Leader-Follower Mechanism

When JobSet handles TPU multi-slice workloads or any workload that requires exclusive placement, it uses a leader-follower scheduling model. The core idea is simple: the leader Pod (completion index 0) is scheduled first, and then, when follower Pods are created and enter the admission flow, the mutating webhook reads the topology label from the leader’s node and patches the corresponding nodeSelector onto the follower Pods. This ensures that the followers are scheduled into the same topology domain.

Mutating Webhook: Patching the Node Selector

JobSet’s mutating webhook (mpod.kb.io) intercepts every Pod creation request. For a follower Pod that uses exclusive placement, it calls setNodeSelector:

  1. It first calls leaderPodForFollower to find the corresponding leader Pod, then checks whether the leader’s Spec.NodeName has been set.
  2. If the leader has already been scheduled, the webhook calls topologyFromPod to read the topology label from the leader’s node, such as the node pool name, and patches that value into the follower Pod’s nodeSelector.
  3. But if the leader is still Pending and Spec.NodeName is empty, the mutating webhook returns nil immediately and does not patch anything. The code comments explicitly say that rejecting the follower is left to the validating webhook.
func (p *podWebhook) setNodeSelector(ctx context.Context, pod *corev1.Pod) error {
	log := ctrl.LoggerFrom(ctx)
	// Find leader pod (completion index 0) for this job.
	leaderPod, err := p.leaderPodForFollower(ctx, pod)
	if err != nil {
		log.Error(err, "finding leader pod for follower")
		// Return no error, validation webhook will reject creation of this follower pod.
		return nil
	}

	// If leader pod is not scheduled yet, return error to retry pod creation until leader is scheduled.
	if leaderPod.Spec.NodeName == "" {
		// Return no error, validation webhook will reject creation of this follower pod.
		return nil
	}

	// Get the exclusive topology value for the leader pod (i.e. name of nodepool, rack, etc.)
	topologyKey, ok := pod.Annotations[jobset.ExclusiveKey]
	if !ok {
		return fmt.Errorf("pod missing annotation: %s", jobset.ExclusiveKey)
	}
	topologyValue, err := p.topologyFromPod(ctx, leaderPod, topologyKey)
	if err != nil {
		log.Error(err, "getting topology from leader pod")
		return err
	}

	// Set node selector of follower pod so it's scheduled on the same topology as the leader.
	if pod.Spec.NodeSelector == nil {
		pod.Spec.NodeSelector = make(map[string]string)
	}
	log.V(2).Info(fmt.Sprintf("setting nodeSelector %s: %s to follow leader pod %s", topologyKey, topologyValue, leaderPod.Name))
	pod.Spec.NodeSelector[topologyKey] = topologyValue
	return nil
}

This behavior is expected. If the leader has not been scheduled yet, there is no way to know the topology label of the node it will land on, so there is nothing to patch.

Validating Webhook: Verifying Follower Pod Validity

After the mutating webhook finishes, the vpod.kb.io validating webhook runs ValidateCreate. In pkg/webhooks/pod_webhook.go, that validation flow contains two important checks:

  1. NodeSelector check: verifies that the follower Pod already has a node selector
  2. Leader scheduling check: verifies that the leader Pod has already been scheduled
// ValidateCreate validates that follower pods (job completion index != 0) part of a JobSet using exclusive
// placement are only admitted after the leader pod (job completion index == 0) has been scheduled.
func (p *podWebhook) ValidateCreate(ctx context.Context, pod *corev1.Pod) (admission.Warnings, error) {
	...
	// Do not validate anything else for leader pods, proceed with creation immediately.
	if placement.IsLeaderPod(pod) {
		return nil, nil
	}
	// If a follower pod node selector has not been set, reject the creation.
	if pod.Spec.NodeSelector == nil {
		return nil, fmt.Errorf("follower pod node selector not set")
	}
	if _, exists := pod.Spec.NodeSelector[topologyKey]; !exists {
		return nil, fmt.Errorf("follower pod node selector for topology domain not found. missing selector: %s", topologyKey)
	}
	// For follower pods, validate leader pod exists and is scheduled.
	leaderScheduled, err := p.leaderPodScheduled(ctx, pod)
	if err != nil {
		return nil, err
	}
	if !leaderScheduled {
		return nil, fmt.Errorf("leader pod not yet scheduled, not creating follower pod. this is an expected, transient error")
	}
	return nil, nil
}

The core problem is the order of these checks. If the validating webhook checks whether NodeSelector exists before it checks whether the leader has been scheduled, the error message no longer reflects the real problem. A more useful transient error is completely masked, and what the user sees instead is a misleading failure.

The Validation Order Hides the Real Error

What the Failure Looks Like

While running a multi-slice TPU training job on GKE, all follower Pods were rejected with the following message:

admission webhook "vpod.kb.io" denied the request: follower pod node selector not set

At the same time, the webhook server logs from the mutating webhook sometimes contained a much more informative message. When the leader Pod had not yet been created, leaderPodForFollower returned an error that was logged as:

ERROR  admission  finding leader pod for follower
{"error": "expected 1 leader pod (example-job-0-0), but got 0. this is an expected, transient error"}

The difference matters. In most event output, users see only the first error returned by the validating webhook and assume the follower Pod is misconfigured. The second message, which comes from the mutating webhook, is the one with real diagnostic value. In stricter environments, detailed mutating webhook logs may be visible only to cluster admins, which makes it even easier for ordinary users to be misled by the admission error.

Root Cause: The Validation Order

Here is what happens when the leader Pod is stuck in Pending because the cluster does not have enough accelerator capacity:

  1. A follower Pod is created and enters the admission flow.
  2. The mutating webhook’s setNodeSelector sees that leaderPod.Spec.NodeName is empty and returns nil, so no patch is applied. Rejection is intentionally left to the validating webhook.
  3. The follower Pod enters the validating webhook via ValidateCreate.
  4. The validating webhook checks NodeSelector first, sees that it is missing, and immediately rejects the Pod with follower pod node selector not set.
  5. The leader scheduling check never runs, because execution already stopped at step 4.

As a result, the error message ends up pointing to a downstream symptom rather than the actual root cause:

Insufficient infrastructure capacity (not enough TPU/GPU resources)
└── Leader pod stays Pending (not scheduled)
    └── Mutating webhook correctly skips the patch (expected behavior)
        └── Follower pod is created without a node selector
            └── Validating webhook checks NodeSelector first and rejects it
                ├── User sees: "follower pod node selector not set"
                └── Useful transient error is completely masked ✘

JobSet already contains a well-designed transient error for this situation, but because of the validation order, it never gets surfaced. If the checks were reversed and the leader’s scheduling state were validated first, the user would see:

leader pod not yet scheduled, not creating follower pod. this is an expected, transient error

That message points you directly to the leader Pod’s scheduling state instead of sending you to the wrong place.

The diagram below compares three scenarios: the normal flow, where the leader is scheduled first; the current failure mode, where the validation order masks the useful error; and the corrected flow in the proposed PR, where reordering the checks surfaces the transient error properly.

Troubleshooting the Issue

If you run into this error yourself, the key is simple: do not let the error message send you in the wrong direction. Check the upstream leader Pod first.

First, confirm that the leader Pod exists and has been scheduled:

kubectl get pods -n <namespace> | grep <jobset-name>.*-0-0

If the leader Pod is still Pending, inspect it with describe to see why:

kubectl describe pod <leader-pod-name> -n <namespace>

You will usually find events like this, example:

Warning  FailedScheduling  0/10 nodes are available: insufficient google.com/tpu resources

At that point, the root cause for this example becomes clear. The issue is not the follower Pod configuration. The cluster simply does not have enough infrastructure capacity, so the leader cannot be scheduled. The fix is straightforward: make sure the cluster has enough accelerator resources for the leader Pod. Once the leader is successfully scheduled, the mutating webhook will automatically patch the correct node selector onto the follower Pods, and the validating webhook will pass as expected.

The PRs: Improvements in Docs and Code

After identifying the real root cause, I opened Issue #11871 and submitted two PRs to address the problem from different angles:

  • PR #11892 (Documentation): adds troubleshooting guidance to the official docs. It has already been merged.
  • PR #11883 (Code fix): reorders the validating webhook checks so the more helpful transient error can be surfaced correctly.

PR #1189: Add Troubleshooting Documentation

The first PR improves the documentation. It adds a new troubleshooting section to the official JobSet guide4, explaining that when you see follower pod node selector not set, you should first inspect the leader Pod’s scheduling state instead of changing the follower configuration.

PR #1188: Reorder the Validation Checks

The second PR is the real code-level fix. In pkg/webhooks/pod_webhook.go, it changes the order of the checks inside ValidateCreate. Once merged, when the leader Pod has not been scheduled yet, users will see:

leader pod not yet scheduled, not creating follower pod. this is an expected, transient error

instead of the original misleading follower pod node selector not set.

Conclusion

In distributed systems, bugs with a long chain of cause and effect are common. In the Kubernetes ecosystem, the execution order and interaction between mutating and validating webhooks make it especially easy for the visible symptom to drift away from the real root cause. This example shows why that interaction matters, and how a small change in validation order can turn a misleading error into a much more useful diagnostic signal.

References

Eason Cao
Eason Cao Eason is an engineer working at FANNG and living in Europe. He was accredited as AWS Professional Solution Architect, AWS Professional DevOps Engineer and CNCF Certified Kubernetes Administrator. He started his Kubernetes journey in 2017 and enjoys solving real-world business problems.
comments powered by Disqus