Streamlining EBS CSI Driver Troubleshooting with AWS Support Automation Workflow

In modern Kubernetes environments, persistent storage is a critical component for stateful applications. Amazon EKS users rely heavily on the Amazon EBS Container Storage Interface (CSI) driver to provision and manage persistent volumes. However, when this integration encounters issues, troubleshooting can become complex and time-consuming, potentially leading to application downtime and data availability concerns.

Introduction to AWS Support Automation Workflow

AWS Support Automation Workflow (SAW) offers a powerful solution to this challenge through pre-built runbooks like the AWSSupport-TroubleshootEbsCsiDriversForEks ¹ automation document. This runbook is specifically designed to diagnose and resolve issues with Amazon EBS volume mounts in Amazon EKS clusters and EBS CSI driver configurations.

For EBS CSI driver troubleshooting specifically, SAW delivers value by helping teams solve complex issues since:

Storage issues require investigation across multiple AWS services (EKS, EBS, EC2) and Kubernetes resources
Diagnosing CSI driver problems requires specialized knowledge of both Kubernetes and AWS storage services
Manual troubleshooting involves executing dozens of commands with complex output interpretation
Storage failures directly impact application availability and data persistence

Technical Overview of SAW Implementation

Core Architecture Components

The AWSSupport-TroubleshootEbsCsiDriversForEks runbook integrates several AWS services:

AWS Systems Manager (SSM): The execution engine that runs the troubleshooting runbook
Amazon EKS: The managed Kubernetes service where the EBS CSI driver operates
Amazon EBS: The block storage service providing persistent volumes
AWS Lambda: Used to create a proxy for Kubernetes API calls
AWS CloudFormation: Manages the creation and cleanup of troubleshooting resources
Amazon S3: Optional storage for troubleshooting logs and reports

How the Runbook Works

The AWSSupport-TroubleshootEbsCsiDriversForEks runbook executes the following high-level steps:

Verify cluster status: Confirms the target EKS cluster exists and is in an active state
Deploy authentication resources: Sets up necessary components for making Kubernetes API calls
Perform EBS CSI controller health checks: Evaluates controller pod status and configuration
Check IAM permissions: Verifies node roles and service account roles have proper permissions
Diagnose persistent volume creation issues: Analyzes PVC/PV problems for the specified pod
Examine pod scheduling: Checks node-to-pod scheduling and analyzes pod events
Collect relevant logs: Gathers Kubernetes and application logs
Perform node health checks: Verifies EC2 instance health and connectivity to required endpoints
Review volume attachments: Checks persistent volume block device attachment and mounting status
Clean up resources: Removes authentication infrastructure created during troubleshooting
Generate report: Produces a comprehensive troubleshooting report with all diagnostic results

Step-by-Step Implementation Guide

Setting Up for EBS CSI Driver Troubleshooting

Before running the runbook, you need to prepare your environment:

Create an IAM role for SSM automation:

Create a role named TroubleshootEbsCsiDriversForEks-SSM-Role with the following trust relationship:

 {
     "Version": "2012-10-17",
     "Statement": [
         {
             "Sid": "",
             "Effect": "Allow",
             "Principal": {
                 "Service": "ssm.amazonaws.com"
             },
             "Action": "sts:AssumeRole"
         }
     ]
 }

Attach the permissions policy:

The runbook requires specific IAM permissions to function properly. Note that the OptionalRestrictPutObjects statement can be removed if you are not planning to store the SAW diagnostic report to an Amazon S3 bucket. Here’s an example policy outlining the necessary permissions:

 {
     "Version": "2012-10-17",
     "Statement": [
         {
             "Sid": "OptionalRestrictPutObjects",
             "Effect": "Allow",
             "Action": ["s3:PutObject"],
             "Resource": ["arn:{partition}:s3:::BUCKET_NAME/*"]
         },
         {
             "Effect": "Allow",
             "Action": [
                 "ec2:DescribeIamInstanceProfileAssociations",
                 "ec2:DescribeInstanceStatus",
                 "ec2:GetEbsEncryptionByDefault",
                 "eks:DescribeAddon",
                 "eks:DescribeAddonVersions",
                 "eks:DescribeCluster",
                 "iam:GetInstanceProfile",
                 "iam:GetOpenIDConnectProvider",
                 "iam:GetRole",
                 "iam:ListOpenIDConnectProviders",
                 "iam:SimulatePrincipalPolicy",
                 "s3:GetBucketLocation",
                 "s3:GetBucketPolicyStatus",
                 "s3:GetBucketPublicAccessBlock",
                 "s3:GetBucketVersioning",
                 "s3:ListBucket",
                 "s3:ListBucketVersions",
                 "ssm:DescribeInstanceInformation",
                 "ssm:GetAutomationExecution",
                 "ssm:GetDocument",
                 "ssm:ListCommandInvocations",
                 "ssm:ListCommands",
                 "ssm:SendCommand",
                 "ssm:StartAutomationExecution"
             ],
             "Resource": "*"
         },
         {
             "Sid": "SetupK8sApiProxyForEKSActions",
             "Effect": "Allow",
             "Action": [
                 "cloudformation:CreateStack",
                 "cloudformation:DeleteStack",
                 "cloudformation:DescribeStacks",
                 "cloudformation:UpdateStack",
                 "ec2:CreateNetworkInterface",
                 "ec2:DeleteNetworkInterface",
                 "ec2:DescribeNetworkInterfaces",
                 "ec2:DescribeRouteTables",
                 "ec2:DescribeSecurityGroups",
                 "ec2:DescribeSubnets",
                 "ec2:DescribeVpcs",
                 "eks:DescribeCluster",
                 "iam:CreateRole",
                 "iam:DeleteRole",
                 "iam:GetRole",
                 "iam:TagRole",
                 "iam:UntagRole",
                 "lambda:CreateFunction",
                 "lambda:DeleteFunction",
                 "lambda:GetFunction",
                 "lambda:InvokeFunction",
                 "lambda:ListTags",
                 "lambda:TagResource",
                 "lambda:UntagResource",
                 "lambda:UpdateFunctionCode",
                 "logs:CreateLogGroup",
                 "logs:CreateLogStream",
                 "logs:DescribeLogGroups",
                 "logs:DescribeLogStreams",
                 "logs:ListTagsForResource",
                 "logs:PutLogEvents",
                 "logs:PutRetentionPolicy",
                 "logs:TagResource",
                 "logs:UntagResource",
                 "ssm:DescribeAutomationExecutions",
                 "tag:GetResources",
                 "tag:TagResources"
             ],
             "Resource": "*"
         },
         {
             "Sid": "PassRoleToAutomation",
             "Effect": "Allow",
             "Action": "iam:PassRole",
             "Resource": [
                 "arn:*:iam::*:role/TroubleshootEbsCsiDriversForEks-SSM-Role",
                 "arn:*:iam::*:role/Automation-K8sProxy-Role-*"
             ],
             "Condition": {
                 "StringLikeIfExists": {
                     "iam:PassedToService": [
                         "lambda.amazonaws.com",
                         "ssm.amazonaws.com"
                     ]
                 }
             }
         },
         {
             "Sid": "AttachRolePolicy",
             "Effect": "Allow",
             "Action": [
                 "iam:AttachRolePolicy",
                 "iam:DetachRolePolicy"
             ],
             "Resource": "*",
             "Condition": {
                 "StringLikeIfExists": {
                     "iam:ResourceTag/AWSSupport-SetupK8sApiProxyForEKS": "true"
                 }
             }
         }
     ]
 }

Configure EKS Cluster Access:

The recommended approach is to create an Access Entry in your EKS cluster:
- Navigate to your cluster in the Amazon EKS console
- Verify your access configuration is set to API_AND_CONFIG_MAP or API
- Choose “Create access entry”
- For IAM principal ARN, select the SSM automation role you created
- For Type, select “Standard”
- Add an access policy with “Cluster” scope and AmazonEKSAdminViewPolicy
Prepare an S3 bucket (optional):

If you plan to store troubleshooting logs, create a private S3 bucket in the same region as your EKS cluster.

Running the AWSSupport-TroubleshootEbsCsiDriversForEks Runbook

Run this Automation (console) ²
Execute the automation with the following parameters:
- AutomationAssumeRole (Optional): ARN of the IAM role you created, e.g., arn:aws:iam::123456789012:role/TroubleshootEbsCsiDriversForEks-SSM-Role
- EksClusterName: Name of your EKS cluster experiencing issues
- ApplicationPodName: Name of the Kubernetes pod having issues with the EBS CSI driver
- ApplicationNamespace: Namespace of the application pod having issues
- EbsCsiControllerDeploymentName (Optional): Deployment name for the EBS CSI controller pod (default: ebs-csi-controller)
- EbsCsiControllerNamespace (Optional): Namespace for the EBS CSI controller (default: kube-system)
- S3BucketName (Optional): S3 bucket name for uploading troubleshooting logs
- LambdaRoleArn (Optional): IAM role ARN for the Lambda function the runbook creates

Alternative: Execute via AWS CLI:

 aws ssm start-automation-execution \
   --document-name "AWSSupport-TroubleshootEbsCsiDriversForEks" \
   --parameters "EksClusterName=my-production-cluster,ApplicationPodName=my-stateful-app,ApplicationNamespace=default,AutomationAssumeRole=arn:aws:iam::123456789012:role/TroubleshootEbsCsiDriversForEks-SSM-Role"

Monitor the execution in the AWS Systems Manager console or via CLI:

 aws ssm get-automation-execution \
   --automation-execution-id "execution-id-from-previous-step"

Use Case

Diagnosing Persistent Volume Claim (PVC) Stuck in Pending State

Scenario: An application team running their patient data service on EKS encountered an issue where newly created PVCs remained in a “Pending” state indefinitely. The applications were unable to start properly because the required storage volumes could not be provisioned. This issue typically required hours of manual investigation across multiple components:

Inspecting PVC and StorageClass configurations
Checking EBS CSI controller pod logs
Verifying IAM role permissions for the service account

To reproduce the scenario, I create a shell script to create a sample scenario (scenario1.sh) with a misconfigured policy. This will help demonstrate how the AWS Support Automation Workflow identifies and resolves EBS CSI driver-related issues:

export AWS_REGION=eu-west-1
./scenario1.sh

With the test scenario ready, I execute the reproduction script which creates a pod requires PV. As expected, the pod remains in a Pending state due to volume attachment issues.

$ kubectl get pod -n default
NAME   READY   STATUS    RESTARTS   AGE
app    0/1     Pending   0          48m

Using AWS Support Automation Workflow (SAW)

In this case, I can add an access entry for the SSM role to my EKS cluster, then start the troubleshooting workflow by executing the AWSSupport-TroubleshootEbsCsiDriversForEks runbook. The automation will systematically check for common issues, including IAM permissions, pod scheduling problems, and volume attachment status. This comprehensive diagnostic approach ensures we can quickly identify and resolve any EBS CSI driver related issues.

With the AWSSupport-TroubleshootEbsCsiDriversForEks runbook, you only need to provide the cluster name, affected pod name, and namespace. The automation can:

Verified the EBS CSI controller deployment health and configuration
Identified a misconfigured IAM role for the EBS CSI driver service account
Confirmed the StorageClass parameters were correct
Verified that the node IAM role was missing required EBS permissions
Generated a comprehensive report identifying the exact missing permissions

Results:

The troubleshooting time decreased from hours to just 15 minutes. The generated output quickly pinpointed the root cause—a missing IAM permission—and provided clear steps to fix the IAM role.

Output:

==================================================
1. IAM and Service Account permissions checks
==================================================

Checked service account role AmazonEKSEBSCSIDriverRoleLab1 and there is no trust policy for federated

arn:aws:iam::123456789:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/C746A744DA16B63394A03370F964F05A

with condition

{
  'StringEquals': {
    'oidc.eks.eu-west-1.amazonaws.com/id/C746A744DA16B63394A03370F964F05A:aud': 'sts.amazonaws.com'
  }
}.

Check AWS documentation https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html

The report pointed out that the OIDC provider condition was incorrect. In my reproduction, I was using www.sts.amazonaws.com instead of sts.amazonaws.com. This small but critical configuration error prevented the EBS CSI driver service account from assuming its IAM role, which caused volume provisioning failures. After correcting the trust policy, the PVC was successfully created and the application pod started normally.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::123456789:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/C746A744DA16B63394A03370F964F05A"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "oidc.eks.eu-west-1.amazonaws.com/id/C746A744DA16B63394A03370F964F05A:aud": "www.sts.amazonaws.com", <-- Incorrect setting
                    "oidc.eks.eu-west-1.amazonaws.com/id/C746A744DA16B63394A03370F964F05A:sub": "system:serviceaccount:kube-system:ebs-csi-controller-sa"
                }
            }
        }
    ]
}

Advanced Tip: Streamlining Your EBS CSI Troubleshooting Workflow

To maximize the effectiveness of the AWS Support Automation Workflow for EBS CSI driver troubleshooting, set up automated responses for common failure patterns. You can integrate EventBridge rules to trigger actions when monitoring systems detect issues like failed application health checks. This automation helps teams reduce manual intervention and resolve issues more quickly.

The diagram shows how the automation workflow handles EBS CSI driver issues—starting from the initial trigger, moving through comprehensive diagnostics, and ending with final reporting and external system integration.

Streamlining Your EBS CSI Troubleshooting Workflow

Ensure systems are prepared for automation: While the runbook works with standard EKS configurations, some advanced diagnostics require worker nodes to be Systems Manager managed instances. The EKS-optimized Amazon Linux AMI includes the SSM agent by default. For nodes not using the EKS-optimized AMI, you’ll need to register them as managed instances to enable all diagnostic capabilities.
Create specific CloudWatch alarms to detect EBS CSI issues early: Next, establish a monitoring mechanism to track stateful system health. This can include custom CloudWatch metrics that reflect your stateful application’s health. Set up CloudWatch alarms with appropriate thresholds to detect potential problems early.
Create EventBridge rules to trigger the automation when issues are detected: Using CloudWatch alarm states, you can create specific EventBridge rules to monitor EBS CSI driver events and trigger the automation workflow with AWS Lambda when issues arise.

aws events put-rule \
  --name "EBSCSITroubleshootingTrigger" \
  --event-pattern '{
    "source": ["aws.cloudwatch"],
    "detail-type": ["CloudWatch Alarm State Change"],
    "resources": ["arn:aws:cloudwatch:us-east-1:123456789012:alarm:EBSVolumeProvisioningFailures"],
    "detail": {
      "state": {
        "value": ["ALARM"]
      }
    }
  }'

However, if you have your own alerting system, you can directly invoke the SSM runbook using the StartAutomationExecution API:

aws ssm start-automation-execution \
  --document-name "AWSSupport-TroubleshootEbsCsiDriversForEks" \
  --parameters \
    "EksClusterName=my-production-cluster,\
    ApplicationPodName=my-stateful-app,\
    ApplicationNamespace=default,\
    AutomationAssumeRole=arn:aws:iam::123456789012:role/TroubleshootEbsCsiDriversForEks-SSM-Role"

Integration with Monitoring and Notification Systems

Feed automation results to incident management platforms:

You can set up a Lambda function that processes the automation output and updates your incident management system with findings and recommended actions:

 import boto3
 import json
 import requests

 def lambda_handler(event, context):
     ssm_client = boto3.client('ssm')
     execution_id = event['detail']['automation-execution-id']

     # Get automation execution output
     response = ssm_client.get_automation_execution(
         AutomationExecutionId=execution_id
     )

     outputs = response['AutomationExecution']['Outputs']
     diagnosis_report = json.loads(outputs['DiagnosisReport'][0])

     # Update PagerDuty incident
     headers = {
         'Content-Type': 'application/json',
         'Authorization': 'Token token=YOUR_API_KEY'
     }

     payload = {
         'incident': {
             'id': event['incident_id'],
             'type': 'incident',
             'status': 'acknowledged',
             'notes': [{
                 'content': f"EBS CSI Diagnosis Results:\\n{json.dumps(diagnosis_report, indent=2)}"
             }]
         }
     }

     requests.put(
         f"<https://api.pagerduty.com/incidents/{event['incident_id']}>",
         headers=headers,
         data=json.dumps(payload)
     )

     return {
         'statusCode': 200,
         'body': 'PagerDuty updated with diagnosis results'
     }

Create a centralized repository for storage troubleshooting outcomes:

Store the results of each automation run in a searchable format (such as Elasticsearch) to identify patterns and improve storage configuration over time.

Conclusion

The AWSSupport-TroubleshootEbsCsiDriversForEks runbook provides a robust solution for diagnosing and resolving EBS CSI driver issues in EKS environments. By automating the troubleshooting process across AWS services and Kubernetes components, it reduces the time needed to identify and fix problems.

This post demonstrated how automation can significantly reduce troubleshooting time while providing comprehensive diagnostics. By integrating this workflow with monitoring systems and incident management platforms, teams can create a proactive approach to storage management in their Kubernetes environments.

References

24 May 2025

« How I Conquered the AWS AI Practitioner Exam in 24 Hours

AWS Taipei Region (ap-east-2) Now Available: Setup Guide & Cost Analysis »

Eason Cao Follow Eason is an engineer working at FANNG and living in Europe. He was accredited as AWS Professional Solution Architect, AWS Professional DevOps Engineer and CNCF Certified Kubernetes Administrator. He started his Kubernetes journey in 2017 and enjoys solving real-world business problems.