Solving Stuck GKE Upgrades: The Hidden Admission Webhook

Keeping a Google Kubernetes Engine (GKE) cluster up to date is part of normal operations. Upgrades patch security issues, keep the control plane healthy, and unlock new Kubernetes features. But in practice, they can still stall with confusing messages in Logs Explorer, such as Internal error or DeployPatch failed¹.

In many of those cases, the real blocker is an admission webhook installed in the cluster. A third-party or custom webhook, such as Gatekeeper or Kyverno, can accidentally intercept the system-level resource changes that GKE needs to make during a control-plane upgrade. When that happens, the upgrade fails even though the webhook was never meant to block GKE itself.

This article is a focused follow-up to Managing Kubernetes Webhook Failures: From Diagnosis to Solutions. It explains why a poorly scoped webhook can break a GKE upgrade, then walks through fixes from least to most disruptive.

Why Admission Webhooks Can Block a GKE Upgrade

During a control-plane upgrade, GKE recreates core control-plane components and reconciles system resources such as ClusterRoles and ClusterRoleBindings. If your webhook matches those requests and uses a strict policy, the API server must call the webhook before it can finish the update.

That becomes risky during an upgrade because several things can change at once:

the control plane is being restarted or reconfigured;
webhook Pods may restart or become temporarily unavailable;
konnectivity-agent scheduling or availability can be affected, causing errors such as No agent available;
network paths between the control plane and the webhook service can be disrupted.

If the webhook uses failurePolicy: Fail, any failed call is treated as an admission failure. The API server then rejects the request, and the upgrade can stall.

GKE’s troubleshooting guide calls out this pattern directly and recommends making sure webhooks do not intercept requests for system resources with the system: prefix¹.

Fixes, From Least to Most Disruptive

1. Use CEL `matchConditions` to Skip System-Prefixed Requests

In Kubernetes v1.30 and later, matchConditions on MutatingWebhookConfiguration and ValidatingWebhookConfiguration let you filter requests with CEL before the webhook is called².

For webhooks that can match system resources, a common pattern is to exclude GKE-managed resources whose names start with system::

webhooks:
  - name: validate.your-webhook-name
    matchConditions:
      - name: "exclude-system-prefixes"
        expression: "!request.name.startsWith('system:')"

This filters the request at the API server layer. When the resource name starts with system:, the API server skips the webhook so GKE system components are not blocked during the upgrade.

Adjust the expression to match your own webhook scope. The key idea is simple: keep strict validation for user workloads, but skip the system-level paths that GKE needs during upgrade.

This is the lowest-impact option because it preserves failurePolicy: Fail for everything else. If you deploy the webhook with Helm, Argo CD, or another GitOps pipeline, make sure the temporary change is not overwritten during the upgrade window.

2. Temporarily Set `failurePolicy` to `Ignore`

If you cannot roll out matchConditions immediately, a safer temporary workaround is to switch the webhook to Ignore during the upgrade window.

webhooks:
  - name: validate.your-webhook-name
    failurePolicy: Ignore

With Ignore, the API server skips the webhook when the call fails or cannot reach the service, and the request is allowed to continue³.

Use this only as a temporary bridge. Once the upgrade completes and the cluster is stable again, switch the policy back to Fail so the webhook continues to enforce the intended controls.

3. Remove the Blocking Webhook Configuration Temporarily

If the upgrade is urgent and you do not have time to adjust the policy safely, the last resort is to delete the webhook configuration, finish the upgrade, and then reapply it.

# Back up and delete a mutating webhook configuration
kubectl get MutatingWebhookConfiguration [NAME] -o yaml > mutating-webhook-config.yaml
kubectl delete MutatingWebhookConfiguration [NAME]

# Back up and delete a validating webhook configuration
kubectl get ValidatingWebhookConfiguration [NAME] -o yaml > validating-webhook-config.yaml
kubectl delete ValidatingWebhookConfiguration [NAME]

This usually unblocks the upgrade, but it is the most disruptive option because the cluster loses that webhook’s protection while it is removed. Restore the configuration as soon as the upgrade is done.

Quick Comparison

Option	Risk	Best Use Case
CEL `matchConditions` exclusion	Low	Best default when the cluster version supports it
`failurePolicy: Ignore`	Medium	Short-term workaround during the upgrade window
Delete the webhook configuration	High	Emergency last resort

Takeaways

Admission webhooks are an important safety layer, but they can also become the hidden reason a GKE upgrade gets stuck. The safest pattern is to make your webhook scope explicit and keep system resources out of the admission path when your policy does not need to inspect them.

Exclude Google-managed namespaces such as kube-system and kube-node-lease when your webhook does not need to inspect them.
Prefer CEL matchConditions when your cluster version supports them.
Exclude GKE-managed system resources, especially resources with the system: prefix.
Use failurePolicy: Ignore or deletion only as temporary recovery steps.

If you treat webhook scope as part of your upgrade design, you can keep the cluster secure without turning a control-plane upgrade into an incident.

References

30 May 2026

« AWS Firecracker Paper Reading: Why AWS Chose microVMs for Serverless

Rethinking On-Call: Incident and Postmortem Lessons from Top Dropbox and AWS Engineers »

Eason Cao Follow Eason is an engineer working at FANNG and living in Europe. He was accredited as AWS Professional Solution Architect, AWS Professional DevOps Engineer and CNCF Certified Kubernetes Administrator. He started his Kubernetes journey in 2017 and enjoys solving real-world business problems.

Solving Stuck GKE Upgrades: The Hidden Admission Webhook

Why Admission Webhooks Can Block a GKE Upgrade

Fixes, From Least to Most Disruptive

1. Use CEL `matchConditions` to Skip System-Prefixed Requests

2. Temporarily Set `failurePolicy` to `Ignore`

3. Remove the Blocking Webhook Configuration Temporarily

Quick Comparison

Takeaways

References

Table of Content

Newsletter

Sign up to get the update

Solving Stuck GKE Upgrades: The Hidden Admission Webhook

Why Admission Webhooks Can Block a GKE Upgrade

Fixes, From Least to Most Disruptive

1. Use CEL matchConditions to Skip System-Prefixed Requests

2. Temporarily Set failurePolicy to Ignore

3. Remove the Blocking Webhook Configuration Temporarily

Quick Comparison

Takeaways

References

Table of Content

Newsletter

Sign up to get the update

1. Use CEL `matchConditions` to Skip System-Prefixed Requests

2. Temporarily Set `failurePolicy` to `Ignore`