CloudFormation in Production: What Breaks and How to Fix It

Sascha

Команда форума
Администратор
Ofline
https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F13un9iinbupmyspum7mf.jpg


Moving past YAML templates to failure handling, security, and real tradeoffs


Before we start​


This is a follow-up to Infrastructure as Code with AWS CloudFormation: From Fundamentals to Production Patterns.

That article covered templates, stacks, nested stacks, CI/CD, and production best practices.

This article covers what happens when those best practices aren't enough. When things break in ways the documentation doesn't warn you about. When you're reading CloudFormation error messages at midnight and need answers.


Part 1: Stack deployment failures​

Failure 1: "Resource handler returned message: 'Role does not exist'"​


Symptoms:

  • IAM role creates successfully (status: CREATE_COMPLETE)
  • Lambda or EC2 resource fails immediately after
  • Error: "The role named 'xxx' does not exist or is not authorized"

Root cause:
IAM has eventual consistency. CloudFormation marks the role as complete as soon as the API call returns, but the role may take 5-10 seconds to propagate across AWS partitions.

Fix:


Код:
LambdaFunction:
  Type: AWS::Lambda::Function
  DependsOn: LambdaExecutionRole
  Properties:
    Role: !GetAtt LambdaExecutionRole.Arn



DependsOn forces CloudFormation to wait for the role resource to be fully created "including its propagation" before creating the Lambda function.

Prevention:
Always add DependsOn when a resource consumes an IAM role created in the same stack.


Failure 2: Stack timeout without clear cause​


Symptoms:

  • Stack creation or update times out after the configured limit
  • No obvious error in event log
  • Some resources show CREATE_IN_PROGRESS for hours

Root cause:
Resources with CreationPolicy or WaitCondition are waiting for signals that never arrive. Common causes:

  • EC2 instance user data script fails silently
  • Custom resource Lambda times out
  • Application code never calls cfn-signal

Diagnosis:


Код:
# Check if any resources have CreationPolicy
aws cloudformation describe-stack-resources --stack-name prod-stack \
  --query "StackResources[?ResourceStatus=='CREATE_IN_PROGRESS']"

# For EC2, check user data logs on the instance
cat /var/log/cloud-init-output.log



Fix:

For EC2 with user data:


Код:
#!/bin/bash
# Do your setup here

# Signal success or failure
/opt/aws/bin/cfn-signal --exit-code $? --stack ${AWS::StackName} \
  --resource WebServerInstance --region ${AWS::Region}



For custom resources, implement timeout handling:


Код:
def handler(event, context):
    try:
        # Do work
        send_response(event, context, "SUCCESS")
    except Exception as e:
        # CRITICAL: Always send a response
        send_response(event, context, "FAILED", reason=str(e))



Prevention:
Always test CreationPolicy paths with --disable-rollback first so you can inspect failed resources without automatic cleanup.


Failure 3: Nested stack update fails, root cause invisible​


Symptoms:

  • Parent stack update fails
  • Error message: "Nested stack failed to update"
  • No details about why the nested stack failed

Root cause:
CloudFormation does not bubble up nested stack failure details to the parent. You have to check each nested stack individually.

Diagnosis:


Код:
# List nested stacks from the parent
aws cloudformation list-stack-resources --stack-name parent-stack \
  --query "StackResources[?ResourceType=='AWS::CloudFormation::Stack'].[PhysicalResourceId]"

# Check each nested stack's events
aws cloudformation describe-stack-events --stack-name nested-stack-1



Fix:

Add explicit validation before updating parents:


Код:
# Validate nested template before updating parent
aws cloudformation validate-template --template-body file://nested.yaml

# Check nested stack for drift
aws cloudformation detect-stack-drift --stack-name nested-stack-1



Prevention:
Minimize nested stack depth (2 levels maximum). For complex dependencies, use StackSets or split into separate parent stacks.


Part 2: Drift and configuration mismatch​

Failure 4: Production resource changed outside CloudFormation​


Symptoms:

  • Security group rule allows unexpected traffic
  • S3 bucket becomes public
  • RDS backup retention period changes
  • No corresponding change in Git history

Root cause:
Someone modified a resource directly in the AWS console or via CLI, bypassing CloudFormation.

Diagnosis:


Код:
# Detect drift on a stack
aws cloudformation detect-stack-drift --stack-name prod-web

# Get detailed drift results
aws cloudformation describe-stack-drift-detection-status \
  --stack-drift-detection-id <id>

# List drifted resources
aws cloudformation list-stack-resources --stack-name prod-web \
  --query "StackResources[?DriftInformation.StackResourceDriftStatus!='NOT_CHECKED']"



Fix — manual:


Код:
# Import drifted resource back to CloudFormation
aws cloudformation import-stack-to-drift --stack-name prod-web \
  --template-body file://template.yaml \
  --resources-to-import '[{"ResourceType":"AWS::S3::Bucket","LogicalResourceId":"DataBucket"}]'



Fix — automated:


Код:
# CloudWatch Event to detect drift weekly
DriftDetectionRule:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: "cron(0 12 * * 1)"  # Every Monday at noon
    Targets:
      - Arn: !GetAtt DriftLambda.Arn
        Input: '{"stackName": "prod-web"}'



Prevention:

  • Enforce IAM policies that prevent resource modification outside CloudFormation
  • Enable drift detection on all production stacks
  • Review drift reports weekly

Failure 5: Stack drift causes deletion protection to block cleanup​


Symptoms:

  • Trying to delete a stack
  • Error: "Cannot delete stack because resource X has deletion protection"
  • That resource was not supposed to have deletion protection

Root cause:
Someone enabled deletion protection directly on an RDS database or S3 bucket. CloudFormation doesn't know about it.

Diagnosis:


Код:
# Find which resource is blocking deletion
aws cloudformation describe-stack-resources --stack-name prod-stack \
  --query "StackResources[?ResourceStatus=='DELETE_FAILED']"



Fix:


Код:
# Remove deletion protection from the resource directly
aws rds modify-db-instance --db-instance-identifier mydb \
  --no-deletion-protection

# Or for S3
aws s3api put-bucket-versioning --bucket mybucket \
  --versioning-configuration Status=Suspended

# Retry stack deletion
aws cloudformation delete-stack --stack-name prod-stack



Prevention:
Include DeletionPolicy: Retain in your template for stateful resources, not deletion protection. DeletionPolicy is understood by CloudFormation. Deletion protection is not.


Part 3: Rollback failures​

Failure 6: Rollback fails because resource won't delete​


Symptoms:

  • Stack update fails
  • Rollback starts
  • Rollback fails
  • Stack stuck in ROLLBACK_FAILED

Root cause:
A resource created during the failed update cannot be deleted. Common reasons:

  • S3 bucket has versioning enabled and contains objects
  • RDS has deletion protection enabled
  • Network interface is still attached
  • Custom resource performed external actions

Diagnosis:


Код:
# Find which resource caused rollback failure
aws cloudformation describe-stack-events --stack-name prod-stack \
  --query "StackEvents[?ResourceStatus=='DELETE_FAILED']"



Fix - for S3:


Код:
# Empty the bucket first
aws s3 rm s3://bucket-name --recursive

# Disable versioning
aws s3api put-bucket-versioning --bucket bucket-name \
  --versioning-configuration Status=Suspended

# Retry stack deletion
aws cloudformation delete-stack --stack-name prod-stack



Fix - for RDS:


Код:
# Disable deletion protection
aws rds modify-db-instance --db-instance-identifier mydb \
  --no-deletion-protection

# Skip final snapshot if you want fast cleanup
aws rds delete-db-instance --db-instance-identifier mydb \
  --skip-final-snapshot



Prevention:
Design stateful resources with DeletionPolicy: Retain in production. Accept that you will clean them manually. Do not let stateful resources block automated rollbacks.


Failure 7: Rollback takes too long, extending downtime​


Symptoms:

  • Stack update fails at minute 15
  • Rollback takes another 20 minutes
  • Total downtime: 35+ minutes

Root cause:
Resources with DeletionPolicy: Snapshot take time to create snapshots during rollback. RDS snapshots can take 10-20 minutes. EBS snapshots add minutes per volume.

Diagnosis:


Код:
# Check which resource is taking time during rollback
aws cloudformation describe-stack-events --stack-name prod-stack \
  --query "StackEvents[?contains(ResourceStatus, 'DELETE')]"



Fix during incident:
You have limited options once rollback starts. The fastest path is often to let it finish, even if slow.

Prevention:
Separate stateful resources (databases, buckets) into their own stack. This stack changes rarely. Application stacks change frequently but contain no stateful resources.


Код:
# Stack 1: Data (deploys monthly, rollback takes time but happens rarely)
DatabaseStack:
  Type: AWS::RDS::DBInstance
  DeletionPolicy: Snapshot

# Stack 2: Application (deploys daily, rollback is fast)
AppStack:
  Type: AWS::AutoScaling::AutoScalingGroup
  DeletionPolicy: Delete  # No snapshot, instant deletion



When AppStack fails, rollback takes seconds, not minutes. Database is untouched.


Part 4: IAM and permission failures​

Failure 8: "User is not authorized to perform cloudformation:CreateStack"​


Symptoms:

  • CI/CD pipeline fails
  • Error message about missing permission
  • Same permissions worked yesterday

Root cause:
IAM policies changed. A condition was added. A permission was removed. The role used by CI/CD no longer has required access.

Diagnosis:


Код:
# Simulate policy to find missing permission
aws cloudformation create-stack --stack-name test-stack \
  --template-body file://test.yaml \
  --dry-run

# Check effective permissions for the role
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789012:role/ci-cd-role \
  --action-names cloudformation:CreateStack \
  --resource-arns arn:aws:cloudformation:us-east-1:123456789012:stack/*



Fix:
Add the missing permission to the CI/CD role:


Код:
{
  "Effect": "Allow",
  "Action": "cloudformation:CreateStack",
  "Resource": "arn:aws:cloudformation:region:account:stack/*"
}



Prevention:
Use IAM boundaries and permission guardrails. Test CI/CD role permissions in a staging account before deploying to production.


Failure 9: Cross-account stack operations fail​


Symptoms:

  • Stack in Account A tries to create a resource in Account B
  • Error: "Access denied" or "Role does not exist"

Root cause:
CloudFormation does not natively support cross-account resource creation. You need IAM roles in both accounts with trust relationships.

Fix — setup cross-account role in target account:


Код:
# In Account B (target)
CrossAccountRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Statement:
        - Effect: Allow
          Principal:
            AWS: arn:aws:iam::AccountA:root
          Action: sts:AssumeRole
    ManagedPolicyArns:
      - arn:aws:iam::aws:policy/AdministratorAccess  # Scope down in production



Fix — assume role from source account:


Код:
# In Account A (source)
CustomResource:
  Type: Custom::CrossAccount
  Properties:
    ServiceToken: !GetAtt CrossAccountLambda.Arn
    TargetRoleArn: arn:aws:iam::AccountB:role/CrossAccountRole



Prevention:
Design stacks to be account-specific. Use AWS Organizations and StackSets for multi-account deployments instead of cross-account resource references.


Part 5: Template validation failures that only appear at deploy time​

Failure 10: Template validates but deployment fails​


Symptoms:


Код:
aws cloudformation validate-template --template-body file://template.yaml
# Returns: Template is valid



But deployment fails with: "Encountered unsupported property" or "Resource handler returned invalid request"

Root cause:
validate-template checks syntax and basic schema. It does not check:

  • Resource property combinations that are invalid (e.g., certain combinations of SourceSecurityGroupId and CidrIp)
  • Region-specific limitations (some resources not available in all regions)
  • Service limits (e.g., requesting 2000 IOPS when limit is 1000)

Diagnosis:

Deploy with --disable-rollback to keep failed resources for inspection:


Код:
aws cloudformation create-stack --stack-name test-stack \
  --template-body file://template.yaml \
  --disable-rollback



Then examine the failed resource's status reason:


Код:
aws cloudformation describe-stack-resources --stack-name test-stack \
  --query "StackResources[?ResourceStatus=='CREATE_FAILED']"



Fix:
Correct the specific property combination. Check region availability. Request service limit increases before deployment.

Prevention:
Test in a staging region first. Use cfn-lint in CI/CD — it catches property combination errors that validate-template misses.


Код:
# Install cfn-lint
pip install cfn-lint

# Run locally before commit
cfn-lint template.yaml



Part 6: Change set failures​

Failure 11: Change set shows replacement when you expected modification​


Symptoms:

  • Change set indicates "Replacement" for a production resource
  • You expected an in-place modification
  • Replacement means downtime

Root cause:
Certain property changes force replacement. For RDS: changing EngineVersion or DBInstanceClass sometimes requires replacement depending on the version difference.

Diagnosis:

Check which property triggered replacement:


Код:
aws cloudformation describe-change-set --change-set-name my-change-set \
  --query "Changes[?ResourceChange.Replacement=='True']"



Common properties that force replacement:

ResourceProperty that forces replacement
AWS::RDS:😀BInstance Engine, EngineVersion (major version), DBSubnetGroupName
AWS::EC2::Instance ImageId, InstanceType (sometimes), SubnetId
AWS::S3::Bucket BucketName (can't change), AccessControl (sometimes)
AWS::Lambda::Function Code (S3 bucket/key change)

Fix:

  • Accept the replacement and plan for downtime
  • Use blue/green deployment for zero-downtime replacement
  • Modify the resource directly in AWS console (not recommended for IaC)

Prevention:
Always review change sets in staging before production. Know which properties cause replacement for your critical resources.


Failure 12: Change set execution fails because of update conflicts​


Symptoms:

  • Change set creates successfully
  • execute-change-set fails
  • Error: "Cannot update stack because another update is in progress"

Root cause:
Another process (CI/CD pipeline, another engineer, scheduled automation) started a stack update while your change set was waiting for execution.

Diagnosis:


Код:
# Check current stack status
aws cloudformation describe-stacks --stack-name prod-stack \
  --query "Stacks[0].StackStatus"

# Status like UPDATE_IN_PROGRESS or ROLLBACK_IN_PROGRESS means locked



Fix:
Wait for the other update to complete. Then create a new change set based on the latest stack state. Do not execute the old change set — it's now out of date.


Код:
# Delete old change set
aws cloudformation delete-change-set --change-set-name old-change-set

# Create new change set against current stack
aws cloudformation create-change-set --stack-name prod-stack \
  --change-set-name new-change-set --template-body file://template.yaml

# Execute fresh change set
aws cloudformation execute-change-set --change-set-name new-change-set



Prevention:

  • Implement stack-level locking via S3 condition keys or custom resources
  • Coordinate CI/CD pipelines to never deploy simultaneously to the same stack
  • Use separate stacks for separate environments

Part 7: Performance and quota failures​

Failure 13: Stack deployment times out due to API rate limiting​


Symptoms:

  • Stack deployment slows dramatically after hundreds of resources
  • Error: "Rate exceeded" for various AWS APIs
  • Some resources take 5-10 retries before succeeding

Root cause:
CloudFormation makes many API calls to create resources. AWS APIs have rate limits. Large stacks hit these limits.

Diagnosis:


Код:
# Check CloudTrail for throttle errors
aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=ThrottlingException



Fix — immediate:
Split the stack. CloudFormation has a recommended limit of 200 resources per stack for optimal performance.


Код:
# List resources by type to see distribution
aws cloudformation list-stack-resources --stack-name large-stack \
  --query "StackResources[*].[ResourceType]" --output text | sort | uniq -c



Fix — long term:
Design modular stacks:


Код:
network-stack.yaml     (VPC, subnets, route tables)
data-stack.yaml        (RDS, ElastiCache, S3)
compute-stack.yaml     (ASG, launch templates)
app-stack.yaml         (Lambda, API Gateway)



Prevention:
Monitor stack creation time. If it exceeds 15 minutes for non-stateful resources, split the stack.


Failure 14: Service quota exceeded during deployment​


Symptoms:

  • Deployment fails
  • Error: "You have reached your limit of X resources"

Root cause:
AWS account has default service limits. You're trying to create more resources than allowed.

Common quotas:

  • VPCs per region: 5
  • Security groups per VPC: 500
  • RDS instances per region: 40
  • Lambda concurrent executions: 1000

Diagnosis:


Код:
# Check current usage against quota
aws service-quotas get-service-quota \
  --service-code ec2 --quota-code L-12345678

# List all quotas for a service
aws service-quotas list-service-quotas --service-code rds



Fix — immediate:
Request quota increase from AWS Support or via Service Quotas API:


Код:
aws service-quotas request-service-quota-increase \
  --service-code ec2 --quota-code L-12345678 \
  --desired-value 100



Fix — tactical:
Reduce resource count in the current deployment. Use smaller instance sizes. Share resources across stacks.

Prevention:
Include quota checks in your CI/CD pipeline before deployment:


Код:
# Script to check quotas before deploying
python scripts/check_quotas.py --template template.yaml



Part 8: Troubleshooting workflow - where to start​


When a CloudFormation deployment fails, follow this workflow:

Step 1: Get the raw error​


Код:
aws cloudformation describe-stack-events --stack-name prod-stack \
  --max-items 20 --query "StackEvents[?ResourceStatus=='CREATE_FAILED' || ResourceStatus=='UPDATE_FAILED']"



Look for the ResourceStatusReason field. This is your primary clue.

Step 2: Identify the failed resource​


The error message tells you which logical resource failed. Find its type and properties in your template.

Step 3: Check if it's a known failure pattern​

Error message patternLikely causeFix section
"Role does not exist"IAM eventual consistencyPart 1, Failure 1
"Rate exceeded"API throttlingPart 7, Failure 13
"Limit exceeded"Service quotaPart 7, Failure 14
"Deletion protection"Rollback blockedPart 3, Failure 6
"Another update in progress"Concurrent updatePart 6, Failure 12

Step 4: Deploy with --disable-rollback for debugging​


Код:
aws cloudformation create-stack --stack-name debug-stack \
  --template-body file://template.yaml \
  --disable-rollback



Failed resources remain so you can inspect them directly.

Step 5: Inspect the failed resource directly​


For EC2:


Код:
aws ec2 describe-instances --instance-ids i-12345
ssh ec2-user@instance-ip # Check logs



For Lambda:


Код:
aws logs describe-log-groups --log-group-name-prefix /aws/lambda/my-function
aws logs get-log-events --log-group-name /aws/lambda/my-function --log-stream-name $(aws logs describe-log-streams --log-group-name /aws/lambda/my-function --query "logStreams[0].logStreamName" --output text)



For RDS:


Код:
aws rds describe-db-instances --db-instance-identifier mydb
aws rds describe-events --source-identifier mydb --source-type db-instance


Step 6: Fix, then continue​


If stack is in ROLLBACK_FAILED, you have two options:

Option A: Delete the failed stack and recreate


Код:
aws cloudformation delete-stack --stack-name prod-stack
# Wait for deletion
aws cloudformation create-stack --stack-name prod-stack --template-body file://template.yaml



Option B: Continue rolling back after fixing the blocker


Код:
# Fix the blocking resource (empty S3 bucket, disable deletion protection)
# Then retry rollback (CloudFormation may need manual intervention via support)



Production CloudFormation checklist​


Before deploying to production, verify:

Drift detection

  • [ ] Enabled on all production stacks
  • [ ] Weekly automated drift check configured
  • [ ] Alerts configured for drift findings

Rollback strategy

  • [ ] Stateful resources have DeletionPolicy: Retain or Snapshot
  • [ ] Stateless resources have DeletionPolicy: Delete
  • [ ] Stateful and stateless resources in separate stacks

IAM and security

  • [ ] No "Action": "*" in policies
  • [ ] Secrets use {{resolve:secretsmanager:...}} not parameters
  • [ ] CI/CD role has minimal required permissions
  • [ ] cfn-guard or cfn-lint running in CI

Failure handling

  • [ ] CreationPolicy includes timeout and signal handling
  • [ ] Custom resources always send SUCCESS or FAILURE responses
  • [ ] Nested stack depth ≤ 2

Performance

  • [ ] No stack exceeds 200 resources
  • [ ] No stack consistently deploys longer than 15 minutes
  • [ ] Service quotas checked before deployment

Troubleshooting readiness

  • [ ] describe-stack-events command documented in runbook
  • [ ] Access to failed resource logs (EC2, Lambda, RDS) available
  • [ ] --disable-rollback used in staging deployments


Written by Onyedikachi Obidiegwu | Cloud Security Engineer

 
Назад
Сверху Снизу