
Streamlining EBS CSI Driver Troubleshooting with AWS Support Automation Workflow
In modern Kubernetes environments, persistent storage is a critical component for stateful applications. Amazon EKS users rely heavily on the Amazon EBS Container Storage Interface (CSI) driver to provision and manage persistent volumes. However, when this integration encounters issues, troubleshooting can become complex and time-consuming, potentially leading to application downtime and data availability concerns.
Introduction to AWS Support Automation Workflow
AWS Support Automation Workflow (SAW) offers a powerful solution to this challenge through pre-built runbooks like the AWSSupport-TroubleshootEbsCsiDriversForEks
1 automation document. This runbook is specifically designed to diagnose and resolve issues with Amazon EBS volume mounts in Amazon EKS clusters and EBS CSI driver configurations.
For EBS CSI driver troubleshooting specifically, SAW delivers value by helping teams solve complex issues since:
- Storage issues require investigation across multiple AWS services (EKS, EBS, EC2) and Kubernetes resources
- Diagnosing CSI driver problems requires specialized knowledge of both Kubernetes and AWS storage services
- Manual troubleshooting involves executing dozens of commands with complex output interpretation
- Storage failures directly impact application availability and data persistence
Technical Overview of SAW Implementation
Core Architecture Components
The AWSSupport-TroubleshootEbsCsiDriversForEks
runbook integrates several AWS services:
- AWS Systems Manager (SSM): The execution engine that runs the troubleshooting runbook
- Amazon EKS: The managed Kubernetes service where the EBS CSI driver operates
- Amazon EBS: The block storage service providing persistent volumes
- AWS Lambda: Used to create a proxy for Kubernetes API calls
- AWS CloudFormation: Manages the creation and cleanup of troubleshooting resources
- Amazon S3: Optional storage for troubleshooting logs and reports
How the Runbook Works
The AWSSupport-TroubleshootEbsCsiDriversForEks
runbook executes the following high-level steps:
- Verify cluster status: Confirms the target EKS cluster exists and is in an active state
- Deploy authentication resources: Sets up necessary components for making Kubernetes API calls
- Perform EBS CSI controller health checks: Evaluates controller pod status and configuration
- Check IAM permissions: Verifies node roles and service account roles have proper permissions
- Diagnose persistent volume creation issues: Analyzes PVC/PV problems for the specified pod
- Examine pod scheduling: Checks node-to-pod scheduling and analyzes pod events
- Collect relevant logs: Gathers Kubernetes and application logs
- Perform node health checks: Verifies EC2 instance health and connectivity to required endpoints
- Review volume attachments: Checks persistent volume block device attachment and mounting status
- Clean up resources: Removes authentication infrastructure created during troubleshooting
- Generate report: Produces a comprehensive troubleshooting report with all diagnostic results
Step-by-Step Implementation Guide
Setting Up for EBS CSI Driver Troubleshooting
Before running the runbook, you need to prepare your environment:
-
Create an IAM role for SSM automation:
Create a role named
TroubleshootEbsCsiDriversForEks-SSM-Role
with the following trust relationship:{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "Service": "ssm.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }
-
Attach the permissions policy:
The runbook requires specific IAM permissions to function properly. Note that
OptionalRestrictPutObjects
can be removed if you are not planning to store SAW diagnostic report to Amazon S3 bucket. Here’s an example policy outlining the necessary permissions:{ "Version": "2012-10-17", "Statement": [ { "Sid": "OptionalRestrictPutObjects", "Effect": "Allow", "Action": ["s3:PutObject"], "Resource": ["arn:{partition}:s3:::BUCKET_NAME/*"] }, { "Effect": "Allow", "Action": [ "ec2:DescribeIamInstanceProfileAssociations", "ec2:DescribeInstanceStatus", "ec2:GetEbsEncryptionByDefault", "eks:DescribeAddon", "eks:DescribeAddonVersions", "eks:DescribeCluster", "iam:GetInstanceProfile", "iam:GetOpenIDConnectProvider", "iam:GetRole", "iam:ListOpenIDConnectProviders", "iam:SimulatePrincipalPolicy", "s3:GetBucketLocation", "s3:GetBucketPolicyStatus", "s3:GetBucketPublicAccessBlock", "s3:GetBucketVersioning", "s3:ListBucket", "s3:ListBucketVersions", "ssm:DescribeInstanceInformation", "ssm:GetAutomationExecution", "ssm:GetDocument", "ssm:ListCommandInvocations", "ssm:ListCommands", "ssm:SendCommand", "ssm:StartAutomationExecution" ], "Resource": "*" }, { "Sid": "SetupK8sApiProxyForEKSActions", "Effect": "Allow", "Action": [ "cloudformation:CreateStack", "cloudformation:DeleteStack", "cloudformation:DescribeStacks", "cloudformation:UpdateStack", "ec2:CreateNetworkInterface", "ec2:DeleteNetworkInterface", "ec2:DescribeNetworkInterfaces", "ec2:DescribeRouteTables", "ec2:DescribeSecurityGroups", "ec2:DescribeSubnets", "ec2:DescribeVpcs", "eks:DescribeCluster", "iam:CreateRole", "iam:DeleteRole", "iam:GetRole", "iam:TagRole", "iam:UntagRole", "lambda:CreateFunction", "lambda:DeleteFunction", "lambda:GetFunction", "lambda:InvokeFunction", "lambda:ListTags", "lambda:TagResource", "lambda:UntagResource", "lambda:UpdateFunctionCode", "logs:CreateLogGroup", "logs:CreateLogStream", "logs:DescribeLogGroups", "logs:DescribeLogStreams", "logs:ListTagsForResource", "logs:PutLogEvents", "logs:PutRetentionPolicy", "logs:TagResource", "logs:UntagResource", "ssm:DescribeAutomationExecutions", "tag:GetResources", "tag:TagResources" ], "Resource": "*" }, { "Sid": "PassRoleToAutomation", "Effect": "Allow", "Action": "iam:PassRole", "Resource": [ "arn:*:iam::*:role/TroubleshootEbsCsiDriversForEks-SSM-Role", "arn:*:iam::*:role/Automation-K8sProxy-Role-*" ], "Condition": { "StringLikeIfExists": { "iam:PassedToService": [ "lambda.amazonaws.com", "ssm.amazonaws.com" ] } } }, { "Sid": "AttachRolePolicy", "Effect": "Allow", "Action": [ "iam:AttachRolePolicy", "iam:DetachRolePolicy" ], "Resource": "*", "Condition": { "StringLikeIfExists": { "iam:ResourceTag/AWSSupport-SetupK8sApiProxyForEKS": "true" } } } ] }
-
Configure EKS Cluster Access:
The recommended approach is to create an Access Entry in your EKS cluster:
- Navigate to your cluster in the Amazon EKS console
- Verify your access configuration is set to
API_AND_CONFIG_MAP
orAPI
- Choose “Create access entry”
- For IAM principal ARN, select the SSM automation role you created
- For Type, select “Standard”
- Add an access policy with “Cluster” scope and
AmazonEKSAdminViewPolicy
-
Prepare an S3 bucket (optional):
If you plan to store troubleshooting logs, create a private S3 bucket in the same region as your EKS cluster.
Running the AWSSupport-TroubleshootEbsCsiDriversForEks Runbook
- Run this Automation (console) 2
- Execute the automation with the following parameters:
- AutomationAssumeRole (Optional): ARN of the IAM role you created, e.g.,
arn:aws:iam::123456789012:role/TroubleshootEbsCsiDriversForEks-SSM-Role
- EksClusterName: Name of your EKS cluster experiencing issues
- ApplicationPodName: Name of the Kubernetes pod having issues with the EBS CSI driver
- ApplicationNamespace: Namespace of the application pod having issues
- EbsCsiControllerDeploymentName (Optional): Deployment name for the EBS CSI controller pod (default:
ebs-csi-controller
) - EbsCsiControllerNamespace (Optional): Namespace for the EBS CSI controller (default:
kube-system
) - S3BucketName (Optional): S3 bucket name for uploading troubleshooting logs
- LambdaRoleArn (Optional): IAM role ARN for the Lambda function the runbook creates
- AutomationAssumeRole (Optional): ARN of the IAM role you created, e.g.,
-
Alternative: Execute via AWS CLI:
aws ssm start-automation-execution \ --document-name "AWSSupport-TroubleshootEbsCsiDriversForEks" \ --parameters "EksClusterName=my-production-cluster,ApplicationPodName=my-stateful-app,ApplicationNamespace=default,AutomationAssumeRole=arn:aws:iam::123456789012:role/TroubleshootEbsCsiDriversForEks-SSM-Role"
-
Monitor the execution in the AWS Systems Manager console or via CLI:
aws ssm get-automation-execution \ --automation-execution-id "execution-id-from-previous-step"
Use Case
Diagnosing Persistent Volume Claim (PVC) Stuck in Pending State
Scenario: An application team running their patient data service on EKS encountered an issue where newly created PVCs remained in a “Pending” state indefinitely. The applications were unable to start properly because the required storage volumes could not be provisioned. This issue typically required hours of manual investigation across multiple components:
- Inspecting PVC and StorageClass configurations
- Checking EBS CSI controller pod logs
- Verifying IAM role permissions for the service account
To reproduce the scenario, I create a shell script to create a sample scenario (scenario1.sh) with a misconfigured policy. This will help demonstrate how the AWS Support Automation Workflow identifies and resolves EBS CSI driver-related issues:
export AWS_REGION=eu-west-1
./scenario1.sh
With the test scenario ready, I execute the reproduction script which creates a pod requires PV. As expected, the pod remains in a Pending state due to volume attachment issues.
$ kubectl get pod -n default
NAME READY STATUS RESTARTS AGE
app 0/1 Pending 0 48m
Using AWS Support Automation Workflow (SAW)
In this case, I can add an access entry for the SSM role to my EKS cluster, then start the troubleshooting workflow by executing the AWSSupport-TroubleshootEbsCsiDriversForEks
runbook. The automation will systematically check for common issues, including IAM permissions, pod scheduling problems, and volume attachment status. This comprehensive diagnostic approach ensures we can quickly identify and resolve any EBS CSI driver related issues.
With the AWSSupport-TroubleshootEbsCsiDriversForEks
runbook, you only need to provide the cluster name, affected pod name, and namespace. The automation can:
- Verified the EBS CSI controller deployment health and configuration
- Identified a misconfigured IAM role for the EBS CSI driver service account
- Confirmed the StorageClass parameters were correct
- Verified that the node IAM role was missing required EBS permissions
- Generated a comprehensive report identifying the exact missing permissions
Results:
The troubleshooting time decreased from hours to just 15 minutes. The generated output quickly pinpointed the root cause—a missing IAM permission—and provided clear steps to fix the IAM role.
Output:
==================================================
1. IAM and Service Account permissions checks
==================================================
Checked service account role AmazonEKSEBSCSIDriverRoleLab1 and there is no trust policy for federated
arn:aws:iam::123456789:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/C746A744DA16B63394A03370F964F05A
with condition
{
'StringEquals': {
'oidc.eks.eu-west-1.amazonaws.com/id/C746A744DA16B63394A03370F964F05A:aud': 'sts.amazonaws.com'
}
}.
Check AWS documentation https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html
The report pointed out that the OIDC provider condition was incorrect. In my reproduction, I was using www.sts.amazonaws.com
instead of sts.amazonaws.com
. This small but critical configuration error prevented the EBS CSI driver service account from assuming its IAM role, which caused volume provisioning failures. After correcting the trust policy, the PVC was successfully created and the application pod started normally.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789:oidc-provider/oidc.eks.eu-west-1.amazonaws.com/id/C746A744DA16B63394A03370F964F05A"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.eu-west-1.amazonaws.com/id/C746A744DA16B63394A03370F964F05A:aud": "www.sts.amazonaws.com", <-- Incorrect setting
"oidc.eks.eu-west-1.amazonaws.com/id/C746A744DA16B63394A03370F964F05A:sub": "system:serviceaccount:kube-system:ebs-csi-controller-sa"
}
}
}
]
}
Advanced Tip: Streamlining Your EBS CSI Troubleshooting Workflow
To maximize the effectiveness of the AWS Support Automation Workflow for EBS CSI driver troubleshooting, set up automated responses for common failure patterns. You can integrate EventBridge rules to trigger actions when monitoring systems detect issues like failed application health checks. This automation helps teams reduce manual intervention and resolve issues more quickly.
The diagram shows how the automation workflow handles EBS CSI driver issues—starting from the initial trigger, moving through comprehensive diagnostics, and ending with final reporting and external system integration.
-
Ensure systems are prepared for automation: While the runbook works with standard EKS configurations, some advanced diagnostics require worker nodes to be Systems Manager managed instances. The EKS-optimized Amazon Linux AMI includes the SSM agent by default. For nodes not using the EKS-optimized AMI, you’ll need to register them as managed instances to enable all diagnostic capabilities.
-
Create specific CloudWatch alarms to detect EBS CSI issues early: Next, establish a monitoring mechanism to track stateful system health. This can include custom CloudWatch metrics that reflect your stateful application’s health. Set up CloudWatch alarms with appropriate thresholds to detect potential problems early.
-
Create EventBridge rules to trigger the automation when issues are detected: Using CloudWatch alarm states, you can create specific EventBridge rules to monitor EBS CSI driver events and trigger the automation workflow with AWS Lambda when issues arise.
aws events put-rule \
--name "EBSCSITroubleshootingTrigger" \
--event-pattern '{
"source": ["aws.cloudwatch"],
"detail-type": ["CloudWatch Alarm State Change"],
"resources": ["arn:aws:cloudwatch:us-east-1:123456789012:alarm:EBSVolumeProvisioningFailures"],
"detail": {
"state": {
"value": ["ALARM"]
}
}
}'
However, if you have your own alerting system, you can directly invoke the SSM runbook using the StartAutomationExecution
API:
aws ssm start-automation-execution \
--document-name "AWSSupport-TroubleshootEbsCsiDriversForEks" \
--parameters \
"EksClusterName=my-production-cluster,\
ApplicationPodName=my-stateful-app,\
ApplicationNamespace=default,\
AutomationAssumeRole=arn:aws:iam::123456789012:role/TroubleshootEbsCsiDriversForEks-SSM-Role"
Integration with Monitoring and Notification Systems
-
Feed automation results to incident management platforms:
You can set up a Lambda function that processes the automation output and updates your incident management system with findings and recommended actions:
import boto3 import json import requests def lambda_handler(event, context): ssm_client = boto3.client('ssm') execution_id = event['detail']['automation-execution-id'] # Get automation execution output response = ssm_client.get_automation_execution( AutomationExecutionId=execution_id ) outputs = response['AutomationExecution']['Outputs'] diagnosis_report = json.loads(outputs['DiagnosisReport'][0]) # Update PagerDuty incident headers = { 'Content-Type': 'application/json', 'Authorization': 'Token token=YOUR_API_KEY' } payload = { 'incident': { 'id': event['incident_id'], 'type': 'incident', 'status': 'acknowledged', 'notes': [{ 'content': f"EBS CSI Diagnosis Results:\\n{json.dumps(diagnosis_report, indent=2)}" }] } } requests.put( f"<https://api.pagerduty.com/incidents/{event['incident_id']}>", headers=headers, data=json.dumps(payload) ) return { 'statusCode': 200, 'body': 'PagerDuty updated with diagnosis results' }
-
Create a centralized repository for storage troubleshooting outcomes:
Store the results of each automation run in a searchable format (such as Elasticsearch) to identify patterns and improve storage configuration over time.
Conclusion
The AWSSupport-TroubleshootEbsCsiDriversForEks
runbook provides a robust solution for diagnosing and resolving EBS CSI driver issues in EKS environments. By automating the troubleshooting process across AWS services and Kubernetes components, it reduces the time needed to identify and fix problems.
This post demonstrated how automation can significantly reduce troubleshooting time while providing comprehensive diagnostics. By integrating this workflow with monitoring systems and incident management platforms, teams can create a proactive approach to storage management in their Kubernetes environments.
References