In our project, we have lots of AWS accounts. And we are in charge of deploying base resources in each of them. To do this, we use CloudFormation to deploy several stacks. Actually several stacks nested into one master stack.
One problem we have is that our customers do not always update their stacks to the latest version. Also, when a stack fails to update, they sometimes let it rollback, and do not care to ask for a fix. Of course, this summer, there was even less updates, people being on vacation. And also no release, since we felt there would be nobody to deploy it. So we end up this September with a larger release than usual.
The result of all this is that we found ourselves staring at lots of accounts with a stack entering UPDATE_ROLLBACK_FAILED state. The reason? Deprecation. As it happened, AWS decided to deprecate Python 3.6, and also a couple of Policies (AWSConfigRole, AWSCloudTrailReadOnlyAccess). Of course we updated our stacks with the correct values for Python and the replacement policies some time ago. But as the stacks were not always up to date, and there were some issues while updating to the new release, many stacks started to rollback. And since some of the rollbacked values were deprecated, the rollbacks failed.
When you are in that case, you have two choices. The first one is to delete everything and redeploy. We tried it on one account, and it was really painful. Too many dependencies and manual actions. The second one is the one advised to us by the AWS support itself: continuing rolling back. When you continue a rollback, you have the possibilities to skip some resources. In our case, we needed to skip all lambdas using Python 3.6, and all roles using the deprecated policies.
We tried it manually in the AWS Console, and there is one caveat: you can select resources from nested stacks, but not resources from stacks nested into nested stacks. Since we have many accounts to update, and many resources to rollback, we decided to script the whole process.
We thought it will be simple: using Python and boto3, we list all the resources in our stack, recursively entering nested stacks, and filtering all lambdas and roles. We ran into several problems:
- You cannot skip resources that are not in a failed stack
- You cannot skip resources that are not in a failed state
- You cannot skip resources that are in a failed state because CloudFormation cancelled the update
- Once you run rollback with skipped resources, CloudFormation discovers new failing resources, so you have to iterate until all is fine, or the list of resources to skip does not change between two iterations.
- Name of resources in nested stack are <nested_stack_name>.<resource_logical_id>. Even for resources in several level of nested stack, you still use the same pattern, giving only the name of the direct parent stack.
- Waiting for a rollback to complete will throw an exception if the rollback fails.
This is the script that helps turning a stack from UPDATE_ROLLBACK_FAILED to UPDATE_ROLLBACK_COMPLETE:
Of course, once this is done, you still have to fix the stacks and run an update.