EKS vs. GKE: Which Scales Better? A Deep Performance Comparison of Kubernetes at Scale (100K+ Nodes)

EKS vs. GKE: Which Scales Better? A Deep Performance Comparison of Kubernetes at Scale (100K+ Nodes)

Explosive growth in AI training is pushing Kubernetes clusters to unprecedented scale. When a single training job may require tens of thousands of GPUs to work in concert, “how large can one cluster get?” stops being a theoretical question and becomes a practical infrastructure decision with real cost implications.

By late 2025, Google Cloud and AWS had each offered a very different answer: GKE demonstrated a 130,000-node cluster1, while EKS made support for 100,000-node ultra-scale clusters generally available2. Both target the next generation of massive AI workloads, but they get there through very different architectural choices. GKE replaces etcd with Spanner, while EKS keeps etcd and deeply re-engineers it3.

This article compares the two approaches across scale, storage architecture, scheduling, node lifecycle operations, and operational tradeoffs.

Architecture Overview

flowchart TD
    subgraph GKE["☁️ GKE — 130K Nodes"]
        direction TB
        GA["API Server"] --> GS["Spanner"]
        GS --> GS1["Distributed by design<br>No sharding required"]
        GS --> GS2["1M+ Objects<br>13,000 QPS"]
        GA --> GK["Kueue"]
        GK --> GK1["Gang Scheduling<br>Preempt 39K Pods in 93s"]
        GA --> GN["Node Management"]
        GN --> GN1["Node Problem Detector<br>+ Auto-repair"]
    end
    subgraph EKS["☁️ EKS — 100K Nodes"]
        direction TB
        EA["API Server"] --> ES["etcd (deeply modified)"]
        ES --> ES1["Raft consensus offloaded<br>BoltDB on tmpfs"]
        ES --> ES2["Key-space partitioning<br>10M+ Objects / 32 GB"]
        EA --> EK["Karpenter"]
        EK --> EK1["Node Autoscaling<br>2,000 nodes/min"]
        EA --> EM["Node Management"]
        EM --> EM1["Auto-repair<br>100 min rolling update"]
    end

Scale Comparison

Based on public information published by Google Cloud and AWS, the table below summarizes the most important comparison points:

Metric GKE (130K nodes) EKS (100K nodes)
Pod scheduling throughput ~1,000 pods/sec ~500 pods/sec (staged deployment, no single-batch metric)
Time to start 130K Pods 3 min 40 sec No corresponding published number
Node join rate Not explicitly highlighted 2,000 nodes/min (100K in 50 minutes)
Storage backend Spanner (distributed) etcd (BoltDB moved from EBS to tmpfs to reduce I/O latency)
Database object count 1,000,000+ 10,000,000+
Lease update QPS 13,000 QPS Not explicitly published
Gang scheduling Yes, with Kueue (preempted 39K Pods in 93 sec) Not discussed
Automatic node repair Not emphasized in the benchmark report Yes
100K-node rolling update Not emphasized in the benchmark report ~100 min (via Karpenter disruption budgets)

On the headline numbers alone, GKE comes out ahead in both scheduling speed and absolute node count. More importantly, GKE appears to absorb more complexity into the platform, while EKS exposes more of it to the operator. For EKS, that tuning falls into two layers:

  • AWS platform-level modifications: key-space partitioning in etcd, moving BoltDB onto tmpfs, offloading Raft consensus into an internal AWS journal system, and fixing informer cache lock contention inside the API server path. These are control-plane changes made by AWS, not something users configure themselves.
  • User-level tuning: Karpenter disruption budget settings, baking SOCI Snapshotter into the node AMI, container image pull strategy, and workload-specific resource tuning. These remain the user’s responsibility if they want to hit the upper end of platform performance.

In short, GKE hides more complexity inside the platform, while EKS exposes more tuning freedom and more tuning responsibility to the user. That starts with their fundamentally different storage architectures.

Storage Architecture

This is the single most important architectural difference in the comparison, and likely the biggest factor shaping each platform’s scalability ceiling.

GKE: Spanner

GKE takes the bolder path: it swaps out etcd as the Kubernetes control-plane backing store and uses Google’s Spanner instead1. Spanner is a distributed database designed for large-scale consistency and horizontal growth, so GKE avoids fighting etcd’s usual bottlenecks and starts from a fundamentally different baseline.

In the published benchmark, GKE reached 13,000 QPS for lease updates at 130K nodes, without obvious signs of hitting a hard wall. The control plane also handled more than 1 million objects, which is comparatively routine territory for Spanner.

The advantage of this design is lower engineering pressure around sharding strategy and memory optimization, because Spanner already solves distributed consistency and scale-out at the database layer. The cost is distance from upstream Kubernetes: standard Kubernetes assumes etcd, so replacing it with Spanner makes the GKE control plane a far more customized implementation.

EKS: Reworked etcd

EKS takes the opposite route. It retains etcd as the control-plane store, but heavily modifies the surrounding implementation:

  • Raft consensus offload: etcd’s Raft coordination is pushed into an internal AWS journal system to reduce consensus latency
  • BoltDB on memory-backed storage: the underlying BoltDB engine is moved from disk to tmpfs, dramatically reducing I/O latency
  • Key-space partitioning: etcd’s key space is partitioned to achieve up to a 5x increase in write throughput

With those changes, EKS reports an etcd footprint of 32 GB across partitions and a database object count above 10 million3. Even so, because these two benchmark reports do not use perfectly comparable conditions or reporting methodology, the safer interpretation is that EKS publicly demonstrated a higher object-count order of magnitude, not that it definitively has 10x the storage capacity of GKE.

The benefit of staying with etcd is that EKS remains closer to upstream Kubernetes interfaces and is less likely to run into conformance surprises. The tradeoff is higher engineering complexity, and every deep modification requires substantial validation to preserve stability.

Scheduling

At ultra-large scale, storage architecture determines whether the control plane can stay upright. But for AI training workloads, scheduler behavior has a more direct effect on GPU utilization and end-to-end training efficiency. Traditional Kubernetes scheduling is essentially one-pod-at-a-time. That works well for many microservice workloads, but it becomes problematic for large-scale distributed training, where partial placement can easily waste expensive accelerator capacity.

Imagine a distributed training job that needs 1,000 GPUs. If the scheduler places only 800 and the rest of the capacity is consumed elsewhere, those 800 GPUs may sit idle waiting for the rest of the gang to arrive. At cluster sizes in the tens or hundreds of thousands of nodes, that kind of fragmentation gets amplified quickly.

That is why gang scheduling matters so much for AI training: a set of Pods must start together, or not at all.

GKE shows a clear advantage here. By integrating Kueue, the Kubernetes-native queueing framework, GKE demonstrated the ability to preempt 39,000 Pods in 93 seconds to free capacity for higher-priority training jobs1. That number matters in real AI environments. If you need to reclaim capacity for an urgent training run, 93 seconds is very different from waiting several minutes or longer.

By contrast, EKS’s public materials do not describe a gang scheduling design or benchmark. That does not mean EKS cannot support gang scheduling, since Kueue can also run on EKS. It does mean GKE currently has the stronger out-of-the-box integration story and a published large-scale validation point.

Node Lifecycle Operations

If scheduling is where GKE looks strongest, then Day-2 node operations are where EKS stands out more clearly. EKS published a more complete set of node lifecycle results:

  • Node join rate: 2,000 nodes per minute, reaching 100K nodes in 50 minutes
  • Automatic node repair: failed nodes are detected and repaired automatically
  • 100K-node rolling update: the full cluster rolling update completes in about 100 minutes using Karpenter disruption budgets

The test environment also preloaded SOCI Snapshotter into the node AMI to defer container image loading and reduce burst bandwidth pressure during large-scale startup. At the API server layer, the EKS team also benefited from strongly consistent reads from cache introduced in v1.31, and contributed fixes upstream for informer cache lock contention (PR #132132 and #130767)3.

GKE published relatively little Day-2 operational detail in the 130K-node benchmark, but that should not be read as evidence that GKE is weak at node repair or automation. More likely, the benchmark narrative was optimized around control-plane throughput and scaling limits rather than full production lifecycle operations.

In practice, GKE already includes several platform-native capabilities that cover the same general class of auto-repair scenarios emphasized in the EKS report:

  • Node Problem Detector plus automatic replacement: GKE enables Node Problem Detector by default, allowing nodes to report common OS, kubelet, and runtime failures. When health remains bad or recovery fails, managed node pool behavior can drain and recreate the node automatically.
  • Managed Instance Groups (MIGs) and node pool control loops: for many VM-based GKE node pools, the underlying managed instance group is what actually preserves desired node count and replaces unhealthy nodes.
  • Controlled rolling replacement with auto-upgrade and auto-repair: GKE already has node upgrade and replacement mechanisms built around cordon, drain, PodDisruptionBudget behavior, and graceful termination defaults. Google simply did not position “how long does a 100K-node rolling update take?” as a headline benchmark metric in this publication4.

The more precise framing is this: EKS published more directly comparable large-scale Day-2 numbers, while GKE published less detail in that area. But GKE does not lack the underlying primitives for node health detection and auto-repair.

Reading the Benchmarks

Before drawing conclusions, one point needs to be made explicit: these two benchmark reports are not perfectly apples-to-apples.

GKE’s 130K-node result is closer to an engineering showcase designed to demonstrate control-plane limits under favorable conditions. The emphasis is on Pod scheduling throughput, Spanner-backed storage behavior, and Kueue-enabled gang scheduling1.

EKS’s 100K-node result is closer to a production-oriented operational simulation, including node join, rolling update, and failure recovery data2. That said, the EKS test only disabled 1,000 nodes to simulate failure, which is just 1% of the cluster and may be conservative relative to harsher real-world fault domains.

Both reports come from the cloud providers themselves, so naturally each one highlights the strongest available results. In practice, you need to pay attention not only to what is published, but also to what is missing. GKE did not publish rolling update numbers, but that does not mean it performs poorly there. EKS did not publish gang scheduling numbers, but that does not mean open-source tooling cannot fill the gap.

Conclusion

Stepping back, GKE and EKS are taking different routes toward the same goal: keeping the Kubernetes control plane stable at 100K-node scale while still scheduling and operating massive AI training workloads effectively.

The high-level comparison looks like this:

Area GKE 130K EKS 100K
Storage architecture Spanner is distributed by design, avoiding etcd’s classic single-system limits etcd is heavily reworked with Raft offload, BoltDB on tmpfs, and key-space partitioning; object count reaches 10M+
Pod scheduling throughput ~1,000 pods/sec, with 130K Pods started in 3 min 40 sec ~500 pods/sec in staged deployment
Gang scheduling Demonstrated Kueue integration, including preempting 39K Pods in 93 sec Not discussed in the published report, though Kueue can still be deployed
Node lifecycle operations Node Problem Detector, auto-repair, and MIG-backed healing exist, but no public benchmark figure at this scale Karpenter disruption budgets, 100-minute rolling update at 100K nodes, plus node monitoring agent and auto-repair
Kubernetes conformance alignment Control plane is more customized because Spanner replaces etcd Closer to upstream assumptions because etcd remains the backing store
Platform engineering complexity More complexity is absorbed by the platform More user-visible tuning remains necessary, including Karpenter settings, SOCI Snapshotter, and image strategy

It is also worth calling out that automatic node repair is not unique to EKS. GKE already enables Node Problem Detector and auto-repair behavior for detecting OS, kubelet, and runtime failures and replacing broken nodes automatically5. EKS, meanwhile, has added node monitoring agent and auto-repair features of its own6. The capabilities are converging. The bigger difference is that EKS published a fuller large-scale Day-2 operational story, whereas GKE’s 130K-node write-up focused more tightly on control-plane throughput and scheduling ceilings12.

So which one should you choose?

  • If your primary concern is bursty AI training and scheduling density, GKE with Spanner plus Kueue currently shows the higher published ceiling.
  • If your primary concern is Kubernetes conformance and predictable Day-2 operations, EKS offers a more complete and reproducible public operational data set7.

If you care about both, then the hardest problem is not whether a single cluster can be stretched this far, but whether your team can operate a cluster this large reliably. At this scale, divide and conquer is still often the most practical strategy. Splitting workloads across multiple clusters based on workload behavior, team boundaries, and fault domains is frequently more robust, easier to debug, and easier to evolve than concentrating everything into one enormous cluster. Reaching 130K nodes is one thing. Operating 130K nodes sustainably is another.

References

Eason Cao
Eason Cao Eason is an engineer working at FANNG and living in Europe. He was accredited as AWS Professional Solution Architect, AWS Professional DevOps Engineer and CNCF Certified Kubernetes Administrator. He started his Kubernetes journey in 2017 and enjoys solving real-world business problems.
comments powered by Disqus