Solving Stuck GKE Upgrades: The Hidden Admission Webhook
Keeping a Google Kubernetes Engine (GKE) cluster up to date is part of normal operations. Upgrades patch security issues, keep the control plane healthy, and unlock new Kubernetes features. But in practice, they can still stall with confusing messages in Logs Explorer, such as Internal error or DeployPatch failed1.
In many of those cases, the real blocker is an admission webhook installed in the cluster. A third-party or custom webhook, such as Gatekeeper or Kyverno, can accidentally intercept the system-level resource changes that GKE needs to make during a control-plane upgrade. When that happens, the upgrade fails even though the webhook was never meant to block GKE itself.
This article is a focused follow-up to Managing Kubernetes Webhook Failures: From Diagnosis to Solutions. It explains why a poorly scoped webhook can break a GKE upgrade, then walks through fixes from least to most disruptive.
Why Admission Webhooks Can Block a GKE Upgrade
During a control-plane upgrade, GKE recreates core control-plane components and reconciles system resources such as ClusterRoles and ClusterRoleBindings. If your webhook matches those requests and uses a strict policy, the API server must call the webhook before it can finish the update.
That becomes risky during an upgrade because several things can change at once:
- the control plane is being restarted or reconfigured;
- webhook Pods may restart or become temporarily unavailable;
konnectivity-agentscheduling or availability can be affected, causing errors such asNo agent available;- network paths between the control plane and the webhook service can be disrupted.
If the webhook uses failurePolicy: Fail, any failed call is treated as an admission failure. The API server then rejects the request, and the upgrade can stall.
GKE’s troubleshooting guide calls out this pattern directly and recommends making sure webhooks do not intercept requests for system resources with the system: prefix1.
Fixes, From Least to Most Disruptive
1. Use CEL matchConditions to Skip System-Prefixed Requests
In Kubernetes v1.30 and later, matchConditions on MutatingWebhookConfiguration and ValidatingWebhookConfiguration let you filter requests with CEL before the webhook is called2.
For webhooks that can match system resources, a common pattern is to exclude GKE-managed resources whose names start with system::
webhooks:
- name: validate.your-webhook-name
matchConditions:
- name: "exclude-system-prefixes"
expression: "!request.name.startsWith('system:')"
This filters the request at the API server layer. When the resource name starts with system:, the API server skips the webhook so GKE system components are not blocked during the upgrade.
Adjust the expression to match your own webhook scope. The key idea is simple: keep strict validation for user workloads, but skip the system-level paths that GKE needs during upgrade.
This is the lowest-impact option because it preserves failurePolicy: Fail for everything else. If you deploy the webhook with Helm, Argo CD, or another GitOps pipeline, make sure the temporary change is not overwritten during the upgrade window.
2. Temporarily Set failurePolicy to Ignore
If you cannot roll out matchConditions immediately, a safer temporary workaround is to switch the webhook to Ignore during the upgrade window.
webhooks:
- name: validate.your-webhook-name
failurePolicy: Ignore
With Ignore, the API server skips the webhook when the call fails or cannot reach the service, and the request is allowed to continue3.
Use this only as a temporary bridge. Once the upgrade completes and the cluster is stable again, switch the policy back to Fail so the webhook continues to enforce the intended controls.
3. Remove the Blocking Webhook Configuration Temporarily
If the upgrade is urgent and you do not have time to adjust the policy safely, the last resort is to delete the webhook configuration, finish the upgrade, and then reapply it.
# Back up and delete a mutating webhook configuration
kubectl get MutatingWebhookConfiguration [NAME] -o yaml > mutating-webhook-config.yaml
kubectl delete MutatingWebhookConfiguration [NAME]
# Back up and delete a validating webhook configuration
kubectl get ValidatingWebhookConfiguration [NAME] -o yaml > validating-webhook-config.yaml
kubectl delete ValidatingWebhookConfiguration [NAME]
This usually unblocks the upgrade, but it is the most disruptive option because the cluster loses that webhook’s protection while it is removed. Restore the configuration as soon as the upgrade is done.
Quick Comparison
| Option | Risk | Best Use Case |
|---|---|---|
CEL matchConditions exclusion |
Low | Best default when the cluster version supports it |
failurePolicy: Ignore |
Medium | Short-term workaround during the upgrade window |
| Delete the webhook configuration | High | Emergency last resort |
Takeaways
Admission webhooks are an important safety layer, but they can also become the hidden reason a GKE upgrade gets stuck. The safest pattern is to make your webhook scope explicit and keep system resources out of the admission path when your policy does not need to inspect them.
- Exclude Google-managed namespaces such as
kube-systemandkube-node-leasewhen your webhook does not need to inspect them. - Prefer CEL
matchConditionswhen your cluster version supports them. - Exclude GKE-managed system resources, especially resources with the
system:prefix. - Use
failurePolicy: Ignoreor deletion only as temporary recovery steps.
If you treat webhook scope as part of your upgrade design, you can keep the cluster secure without turning a control-plane upgrade into an incident.
References