Two years with CloudFormation: lessons learned

jillesvangurp · on Aug 12, 2018

We've been using CF for a few years as well. IMHO it is very complicated to manage and getting stuff working involves a lot of trial and error. Also you end up waiting a lot. Waiting for things to spin up, waiting for things to become available, waiting for things to rollback, etc. On top of that the failure modes can be ugly and hard to figure out.

My recommendation is to treat CF as a single point of failure. Once it gets in a broken state, you may have to destroy your stack and rebuild it. Even if it is fixable on paper, being able to just nuke a stack and replace it is a very good thing. This has happened to us multiple times and having a plan helps.

So what I do with elasticsearch for example is use 3 CF stacks (one for each AZ). This allows me to do things like rolling restarts in a sane way without having to do some flaky deep integration into CF to make it orchestrate a rolling restart without destroying my cluster state simply by replacing the stacks one by one.

If I were to build this again, I'd probably use terraform. Also, I'm looking forward to moving most of our stuff to kubernetes.

stingraycharles · on Aug 12, 2018

> My recommendation is to treat CF as a single point of failure. Once it gets in a broken state, you may have to destroy your stack and rebuild

One of the more common scenarios where CF gets into a broken state is:

1) create new S3 bucket + something else (e.g. some Elastic Beanstalk env update)

2) something else fails, causing rollback

3) S3 bucket already contained data (e.g. your failed Elastic Beanstalk env update caused it to write data)

4) CF refuses to destroy the S3 bucket, entering a "rollback failed" state

In this cause, manually wiping the S3 bucket works well enough. But generally, it appears that CF works kind of when the updates you're making are really small, incremental updates.

Sometimes it gets totally corrupted and you need to nuke stuff, per your advice. This automatically leads me to the following suggestion: leave mission-critical data out of CloudFormation. Specifically, stuff like RDS databases which you absolutely never ever want to have destroyed: just provide the endpoint as an input to your CF template.

SanderKnape · on Aug 12, 2018

You can set the DeletionPolicy attribute to "Retain" to work around this S3 issue. CloudFormation will successfully rollback without attempting to remove the S3 bucket. You can then do so manually yourself before trying to deploy it again.

Check out the docs here: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

cle · on Aug 12, 2018

You better do so before deploying again, because the roll forward will break since the resource already exists.

This a major pitfall when using DeletionPolicy=Retain with named resources. It breaks seamless rollbacks/rollforwards. If you rollback, in order to deploy again you need to either delete the named resources with DeletionPolicy=retain that were rolled back, or update your template to rename them all. It is such a huge pain.

SanderKnape · on Aug 12, 2018

True, but it beats the alternative where CloudFormation deletes objects that you didn't want deleted. The underlying issue is that the S3 objects are outside of the CloudFormation scope, thus it takes no risk and doesn't delete your objects.

A nice feature would be a "ForceDelete" deletionpolicy where it would delete the objects. You can even set this initially when creating a stack, and change it to "Retain" later when the stack is stable.

Totally agree btw that it's a huge pain initially, though once you know it it's also not that hard to work around.

cle · on Aug 13, 2018

My preferred behavior would be for CFN to not barf when rolling forward. In other words, to be able to assume control over a resource that already exists.

scarface74 · on Aug 12, 2018

Specifically, stuff like RDS databases which you absolutely never ever want to have destroyed: just provide the endpoint as an input to your CF template.

There are ways around it as stated below, but I agree completely. I don’t bother with CloudFormation with cross shared, building block type infrastructure like RDS and ElasticSearch. It’s just not that much of a pain to spin up a database on each account. Besides, the characteristics of the databases in different environments are going to be so different, that you are going to either have parameters or FindInMap functions anyway so for all intents and purposes, you’re not running the same template anyway.

As the article said, changing any resource that’s exported from a CF template is such a pain, it would be better just to use parameter store if you can get away with it.

auslander · on Aug 13, 2018

> DeletionPolicy=Retain with named resources

Try to avoid naming resources ("Name" property), so there would be no clashes. Use Ref (same stack) or ImportValue (other stack) to reference created resources. If you want, you may concatenate (Fn:Join "AWS:StackName", "-public-elb"). Cloud way is you replace things, rather than keeping it. It is convenient to know, that if I delete CF stack, everything is cleaned.

Speaking of S3, its better not to include bucket resource in CF until you know what you're doing.

> stuff like RDS databases which you absolutely never ever want to have destroyed

That begs for a separate CF stack, with template creating RDS and related things only, then export endpoint in Outputs.

tangentspace · on Aug 12, 2018

I agree with splitting up CF stacks this way. It reduces the the blast radius, that saved me on a number of occasions when stack updates went sideways.

The problem introduced by that approach is how to manage a large number of CF stacks. First I used a homegrown Python library to manage them, then switched to having Terraform manage CF.

At first Terraform on CF was just intended to be an expedient measure to facilitate migrating everything to Terraform, and eventually we did migrate to pure Terraform. But then we started hitting all the rough edges in Terraform. In hindsight, the hybrid approach had actually been more stable and manageable than using either tool in isolation.

deboflo · on Aug 12, 2018

You should take a look at AWS Code Pipeline, which natively support creating/updating cloudformation stacks.

Rapzid · on Aug 12, 2018

Same waiting with Terraform. If you aren't automating windows installs and/or imaging, count yourself lucky :)

Terraform also lacks a lot of the extra "smarts" that CF has; like rolling updates of any kind and some other higher level automation across different services. They take very much a blue/green approach which is beyond limiting for some services.

Ran into a few broken states using Cloud Formation via Serverless project. Luckily had been in the habbit of keeping "statefull" AWS resources such as queues, databases, and other stuff in their own stacks(or Terraform) to keep their stacks complication to a minimum and mitigate the impact of the more complicated stacks needing to be deleted..

Ultimately I started going with and promote a hybrid approach where Terraform makes sense; Terraform for base infrastructure(including most stateful resources) and CF for stuff like autoscale groups and etc.

plumeria · on Aug 13, 2018

Updating GSIs for a DynamoDB table has been difficult for us. We ended up recreating the table in order to change the indexes.

kokey · on Aug 12, 2018

The only thing that spending time with Cloudformation teaches me is how much it makes me prefer doing things with Terraform. I think Cloudformation is considerably better than nothing and it was great when there were no alternatives, but that was a while ago.

auslander · on Aug 12, 2018

Terraform is terrible compared to Cloudformation. Its selling point is multi-cloud support, but you'll never get it, clouds are too different.

- Good CF template is 10x less code for the same solution.

- No corrupted state problems.

- Native tool, supporting all properties of resources

Writing good CF templates takes good AWS knowledge, and system thinking, you group resources that belong together, it actually teaches you good architecting.

kokey · on Aug 12, 2018

I think Terraform's multi-cloud support is a bit better than Cloudformation's. Jokes aside, I don't think the multi-cloud part is really the biggest selling point, the biggest selling points, for me, are:

- Much better than Cloudformation at telling you what it's going to change before you apply the changes and the ability to record those changes. (much better than those dreaded 'conditional' changes)

- The ability to import changes if you found some that were done outside of Terraform. It's not perfect, or easy, but mostly doable.

- The ability to look at the code, the state file and the plan to get a good representation of what's actually deployed.

Those three are more significant than it looks, but together it makes sure you:

- Don't get into a situation where automation is broken and you can only recover by rebuilding the stack.

- Don't get unexpected downtime because a change replaces a resource unexpectedly.

- Being able to track, record and manage changes in easy to read diffs and plans.

auslander · on Aug 13, 2018

> The ability to look at the code, the state file and the plan to get a good representation of what's actually deployed

CF: to look at template code only, to know what was deployed.

deboflo · on Aug 12, 2018

The changesets feature of cloudformation allows users u to do most of what you mention here. Also take a look at resource deletion policies and Lambda custom resources.

Rapzid · on Aug 12, 2018

Unless they fixed it though it didn't work well in certain situations, like with nested stacks, and often doesn't provide nearly the same level of detail as to what EXACTLY is changing and why.

dkhenry · on Aug 12, 2018

Would you be willing to share a CF template that is 10x less then the equivelent terraform ? It has been my experience that terraform is much less verbose and much more reuseable via modules. Cloud Formation I have seen has always seems excessive and quite convoluted to do simple tasks.

The latest example I have is the cloud formation used to generate EKS clusters(https://amazon-eks.s3-us-west-2.amazonaws.com/1.10.3/2018-07...) it clocks in at 168 LOC and the equivalent terraform (https://raw.githubusercontent.com/terraform-providers/terraf...) is only 53 LOC

deboflo · on Aug 12, 2018

Can’t get much less verbose than cloudformation yaml in combination with usage of only required parameters for a resource. For example, write a cloudformation yaml template that creates an automatically named S3 bucket.

skywhopper · on Aug 12, 2018

Is it shorter than the Terraform for the same thing?

    resource "aws_s3_bucket" "x" {}

jen20 · on Aug 12, 2018

Terraform gives a consistent _workflow_ across clouds, not a consistent codebase. I know personally of many teams using Terraform for significant multi-cloud deployments consisting of thousands of resources. Several switched from CloudFormation and saw their codebase size dramatically reduce.

Furthermore, the other comments on this post should disabuse you of the notion that there are "no corrupted state" problems with cloud formation: they happen all the time.

Disclosure: I worked on Terraform for quite some time at HashiCorp and am still a (community) maintainer of the AWS provider.

eropple · on Aug 12, 2018

Food for thought, but I've never had CloudFormation break because a patch level upgrade changed a regular expression to 1) disallow previously legal inputs, and 2) disallow inputs AWS allows. I've also never had it forget a resource (probably due to race conditions when deleting failed resources).

CFN has its warts, but I full-stop don't trust HashiCorp's operations or their attempt at a SDLC and I wouldn't trust my business's health to them as a company (and if the results of using it weren't enough, their clownshoes sales team's bad attempts to upsell would cinch it).

deboflo · on Aug 12, 2018

Totally agree. I don’t care how much less verbose terraform may be (this claim is questionable IMHO). The most important part of infrastructure engineering is being able to debug and fix things quickly by isolating issues to the smallest possible domain. The additional layer of highly unstable terraform source code does not decrease the debugging surface area.

skywhopper · on Aug 12, 2018

Serious question: how do you go about debugging a CloudFormation stack that's in a broken state, without involving AWS support?

I mean, it's weird, because I agree with your statement that "The most important part of infrastructure engineering is being able to debug and fix things quickly by isolating issues to the smallest possible domain." And that's why I so strongly prefer Terraform, because I actually have control over the state file, how Terraform interacts with it, and I can move things in and out, and change things in-situ if necessary.

auslander · on Aug 13, 2018

> I wouldn't trust my business's health to them as a company

That. It's most dangerous thing for long term projects - third party tools and services. You want to minimise that to absolutely necessary ones.

skywhopper · on Aug 12, 2018

Weird. I haven't compared byte sizes of CF templates and Terraform code, but just as far as readability, HCL works a lot better than YAML. YMMV.

As for state, I'm not sure what you did to corrupt your state, but we use Terraform to manage thousands of resources across dozens of AWS accounts for the past three years and haven't had any state corruption, except when a human messes up editing a state file by hand. Obviously in that case you back things up first (or hopefully you are using remote state with some kind of versioning). But the fact that you are _able_ to manipulate the state with the CLI tool, or by hand in extreme cases, is itself a huge advantage over CloudFormation, which has no such capability.

As for coverage, my experience has been that Terraform often has coverage of new resource types and properties _before_ CloudFormation. And it's extremely rare for any new features to take very long to show up in the AWS provider. Anything significant is usually picked up in 2-3 weeks from the API release at most.

rwiggins · on Aug 12, 2018

I'm skeptical of your 10x less code claim. You can definitely get into broken state problems in CloudFormation - with no recourse but to blow it all away and start over. And despite being a native tool, CloudFormation support for new features and services in AWS is often spotty/missing.

That said, my experience has been that both CloudFormation and Terraform are irritating, just in different ways; they both are warty.

I do ultimately prefer Terraform - even in a single-cloud setup.

deboflo · on Aug 12, 2018

Cloudformation supports more AWS features than terraform.

jpdb · on Aug 12, 2018

Such as? My experience has always been the opposite.

Recently, I decided to use Terrform over CloudFormation specifically because you can't create an EKS cluster (with nodes) in a single stack.

devonkim · on Aug 13, 2018

Some specific services (namely Data Pipeline) aren’t supported in Terraform. However, some parameters like Enhanced VPC routing in Redshift clusters is supported by Terraform but not CloudFormation.

The rule of thumb that you should generally stick to CloudFormation if you are full bore invested into AWS has some truth.

My issues with CloudFormation are lack of control over rollbacks, missing features for existing and mature services like the above, and forcing me to use custom resources to do anything that vaguely resembles coding that Terraform does just fine like IP address math functions.

jpdb · on Aug 12, 2018

> Native tool, supporting all properties of resources

Unfortunately, this is far from the case. I can name 3 things off the top of my head not supported in CloudFormation.

1. The ability to create Route 53 records for Certificate Manager

2. Create a full EKS Cluster (with nodes).

3. EC2-Fleet.

zaphar · on Aug 12, 2018

Terraform is a much nicer way to deploy your CF templates than with the AWS CLI though. You can get the best of both worlds by deploying the templates with Terraform.

deboflo · on Aug 12, 2018

The AWS CLI for CF is simple and consistent with other AWS CLI's. You can also use one of the language specific SDK's such as boto3 or use AWS CodePipeline to create/update stacks.

laumars · on Aug 12, 2018

I want to love Terraform but it's such a horrible platform to code on:

* Error messages are overly verbose yet cryptic, and sometimes even unrelated to the actual error raised by the cloud provider themselves. Coupled with the lack of line numbering or other helpful identifier aside the unnecessarily long module hierarchy and debugging those scripts is a massive exercise in frustration and usually far more time wasted than really should ever be necessary.

* HCL is a hateful "language". The fact you cannot order stuff procedurally means you're constantly running into dependency issues on larger deployments. And dont even get me started on the "count" kludge to work around a lack of proper iteration.

* There is a lack of internal consistency with the support of different methods. Eg "count" does t always work with all resource types. Some resources cannot have properties defined with variables.

* Using calling modules requires so much bootstrapping code. It's just painful.

I get Terraform is the best we have for multi-provider deployments but their idea to create a superset of JSON only to then compile that back down to JSON anyway was such a poor decision in my personal opinion. I get the point was to have something that was accessable to non-programmers while still expressive enough for developers to use; however instead what they've created is a monstrous language that is too complex for the former group and too irrational for the latter.

I've been been very tempted to write my own Terraform alternative based on my experiences using it (and CloudFormation) - I even already have another programming language that Ive written a parser for and would be well suited for this type of application. But my time is pretty limited at the moment so I struggle on Terraform.

rwiggins · on Aug 12, 2018

Funny thing is, HCL isn't actually a superset of JSON. For example,

    {
      "foo": { "bar": 1 }
    }

can't be represented in HCL. (Even Consul had to add a kludgy hack to support HCL config as a result.) Instead, they call HCL "JSON-compatible", which I think means JSON can be written to represent any equivalent HCL structure (HCL is a subset, essentially).

That said, you might be interested in Terraform 0.12 [0], which will be using some new HCL v2 that actually has first-class expressions and dynamic blocks (for loops). And, finally, a ternary operator that short-circuits. Unfortunately, the dynamic block stuff looks like it's based around for-loops and doesn't support just regular if-statements... but we'll see where that goes.

[0] https://www.hashicorp.com/blog/terraform-0-1-2-preview

laumars · on Aug 12, 2018

Thanks for the link. Ive not yet had a chance to play with Terraform 0.12 but from what I've read it does sound like HCL v2 is definitely a step in the right direction. However so long as it's primarily a data serialisation format I think I'm going to take issue writing Terraform code in it because sometime you just need to express something procedurally. Maybe I'm just in the minority here? Or maybe I've just been spoilt with tools like Puppet and Bash but I can't help feeling that HCL is a step backwards in terms of expressiveness.

I wasn't aware about the JSON subset / Consul problem though. That's really interesting to read. It's funny because back when I was building test Consul cluster I did wonder why JSON was used for config instead of HCL. I guess now I know why.

rwiggins · on Aug 13, 2018

I agree with you 100%. I'm pretty excited about HCL v2 - already simple stuff like a short-circuiting ternary operator makes my life easier (no more weird joins/splats with conditional resources in outputs). Hopefully further improvements are implemented on top of the 0.12 changes.

Otherwise, about like you, I'm tempted to write a Terraform frontend that interfaces with existing providers...

devonkim · on Aug 13, 2018

A lot of problems similar to your complaints exist for CloudFormation as well. Instead of complaining about the language warts, you’ll be forced to use an even more cumbersome set of primitives to access arrays and hashes, and the messages will only get less sensible as you look for either YAML whitespace errors or inevitably write a converter to avoid using JSON. CloudFormation limits like number of parameters and outputs start to be a real pain to scale to support a production environment beyond simple demo stacks, and Terraform has more issues scaling with team members concurrently trying to modify system state due to its less strict modularization / locking model.

Infrastructure as code tooling is all very primitive compared to what we take for granted writing most other traditional software but it will take some time and maybe another generation to do it well.

laumars · on Aug 14, 2018

I don't know where you got the impression I was arguing that CloudFormation is better than Terraform from but I assure you that wasn't the arguement I was making. In fact I also made the same points you made in reply to another commenter in this discussion.

I know Terraform is the best tool we currently have (I even explicitly stated that I'm my previous post) but that doesnt mean there isnt still a massive room for improvement. Starting with the depreciation of HCL, in my personal opinion.

Corrado · on Aug 12, 2018

You might want to give cfn-builder[0] a try. I'm biased because I wrote it but I find that it's a good way to write and maintain my CFN templates. Also it's written in simple Nodejs and is easy to expand for your own needs.

[0] https://www.npmjs.com/package/cfn-builder

laumars · on Aug 12, 2018

CloudFormation isn't really good enough because it's AWS specific and doesn't track changes to the state. Plus it just uses data serialisation formats as well so doesn't even address the core problems I raised with HCL. What I really want is Puppet for Infrastructure; the closest I've used to that is Terraform but the syntax isn't quite there...yet.

Plus the pace of development in nodejs worries me. All too often I've ran into issues where modules have changed and broken things downstream. When you're running infrastructure as code you really want to be damn sure your tooling is going to be consistent for years to come and I really don't have that faith in nodejs. Sure, if your a JavaScript developer you can manage it easily enough, but if you're DevOps who rarely touches JS then you really want your tools to be low maintenance. So to that end I wouldn't consider any nodejs projects for any serious production work given the kind of customers I work for (high availability stuff for some major brands). I might be fine but it's just not worth the risk.

Dunedan · on Aug 12, 2018

What I don't like about Terraform is that the state of your resources is stored locally, which might lead to consistency problems depending on your setup. With CloudFormation state is handled by CloudFormation itself, so you can be confident that stack updates operate on the latest state.

liveoneggs · on Aug 12, 2018

I store terraform state in s3 with dynamodb locking. It looks like this:

    #backend/state uses less priv keys
    terraform {
      backend "s3" {
        bucket     = "mystatebucketisunique-tf"
        key        = "states/tf/tf.tfstate"
        region     = "us-east-1"
        lock_table = "lock-tf"
        profile    = "aws_tf_s3-prof"
      }
    }

So for an entirely new env I have to setup that bucket and the dynamodb table

cmonfeat · on Aug 12, 2018

That's only the default. Terraform supports storing state remotely, with locking. Most folks I know who use Terraform at scale recommend using this feature (if they store state at all).

deboflo · on Aug 12, 2018

Then choosing where and how to store state using Terraform becomes another point of possible inconsistency in your infrastructure and thing to worry about. In CloudFormation, the state is coupled with the service, and you don't have to worry about it. It just works.

Dunedan · on Aug 12, 2018

Thanks for the information. Apparently my knowledge about state handling in Terraform did predate the introduction of remote state. I should've checked that first.

artellectual · on Aug 12, 2018

You can store the state remotely there are many options for state storage too. You can use s3 or dynamo db or even create your own web service that accepts web hooks from terraform.

Dunedan · on Aug 12, 2018

Of course you can implement/set up syncing of state on your own, but that's something you don't have to worry about when using CloudFormation. I'm not sure how the options for Terraform handle race conditions when trying to deploy updates in parallel, but again, something you don't have to worry about with CloudFormation.

nawitus · on Aug 12, 2018

This "implementing/setting up syncing of state" requires only a few lines of Terraform config.

deboflo · on Aug 12, 2018

Lines of code is irrelevant when you are in the middle of a production deploy and you just want the highest visibility into what is going on. In cloudformation, there are only two possible places for resource state: 1) The actual resource state in AWS and 2) The desired state stored in CloudFormation.

dcu · on Aug 12, 2018

no, you and your team have to worry about not making a change that could render your whole stack useless

deboflo · on Aug 12, 2018

"there are many options for state storage too"

Just, no.

deboflo · on Aug 12, 2018

Exactly. This is a HUGE point of interest that, in complex multi AWS account or multi region scenarios, can make or break your infrastructure.

dcu · on Aug 12, 2018

additionally you can run it from a single management host (or Jenkins), for example, so the state is always up to date

geggam · on Aug 12, 2018

renke1 · on Aug 12, 2018

CloudFormation is pretty cool. In a rather short amount of time I was able to create a reproducible deployment (based on any commit in my Git repo) that deploys a Lambda, makes it accesible via API Gatway, creates a DynamoDB table for storage, sets up Cognito user pool for user management, creates CloudFront distribution that securely serves my SPA and the API Gateway and lastly adds a record to my domain such that is accesible at `${commit}.mydomain.com`.

technics256 · on Aug 12, 2018

Awesome. Care to share or point to best resources to learn? Thanks!

olafalo · on Aug 12, 2018

IMO the limits of CloudFormation are a bigger pain point than they're made out to be here. The limit of 200 resources per stack is easy to hit, and so is the 450KB template size limit (well, it's possible at least). It's frustrating to need to spread a single service across three stacks because it has a lot of API Gateway endpoints. The real answer is nested stacks, but those still count towards the (raisable) total stack limit of 200.

Dunedan · on Aug 12, 2018

At least there isn't the limit of 50KB for SAM templates anymore [1]. That was a ridiculous limit, especially as nested templates aren't possible with SAM yet [2].

[1]: https://github.com/awslabs/serverless-application-model/issu...

[2]: https://github.com/awslabs/serverless-application-model/issu...

kesor · on Aug 12, 2018

Have you even read the article?

Using exported resources and avoiding nested stacks makes it impossible to reach these numbers.

deboflo · on Aug 12, 2018

Exactly. Stack imports/exports was launched to address the shortcomings of nested stacks. Avoid using nested stacks in favor of stack imports/exports. Actually, avoid both nested stacks and stack imports/exports in favor of SSM parameters.

deboflo · on Aug 12, 2018

200 resources is way way too many for a single template. Split up your API Gateway API resources, methods, and models into multiple stacks. One stack for the API, another group of stacks for the API resources, methods, models, and another stack for your API deployments.

auslander · on Aug 12, 2018

> The limit of 200 resources per stack is easy to hit

Never managed to get even close to that number, doing CF for 5 years :)

olafalo · on Aug 12, 2018

I'm guessing it's heavily dependent on what you're using it for, what kind of resources are you describing in CFN?

In my case, it's Lambda + API Gateway that's the main culprit. For each endpoint in an API, there's a resource for the lambda itself, and a managed policy, role, log group, subscription filter, API resource, method, lambda permission, model, and additional OPTIONS method for CORS purposes. With a setup like that, you can hit the limit with a moderately-sized API.

kesor · on Aug 12, 2018

or you can utilize exported resources, like the article describes, and append your API endpoints to the "master" API Gateway resource that is not even in the same stack.

olafalo · on Aug 12, 2018

Yes, that's the solution we came up with. One stack with lambdas, managed policies, roles, user pool, etc; one stack with the APIGateway::RestApi and some resources and methods, and then another stack with yet more resources and methods. This works, but I don't think it's a great solution. Wouldn't it be preferable to have one stack that encompasses the whole API for one service?

sheeshkebab · on Aug 12, 2018

WTF are you developing... good grief. What happened to a simple web app with some stateless rest endpoints?

Cpoll · on Aug 12, 2018

I recommend using Troposphere instead of vanilla CF. It's a Python library that generates CF templates. It doesn't abstract out anything, so the structure ends up looking very similar to a json or yml template, but with all the conveniences of working with objects in Python.

The biggest gripe I have with CF is that it's impossible to introduce existing components into a CloudFormation stack, so any legacy infrastructure has to remain manually managed.

jonthepirate · on Aug 12, 2018

Terraform does everything CloudFormation does but in a simpler way where you have more control over what's happening.

zaphar · on Aug 12, 2018

Cloudformation has better atomicity guarantees though. It's not perfect but in general if a change to a stack fails it will get rolled back to a known good state. Terraform doesn't give you the same guarantee. You'll have to push a rollback or fix yourself leaving your AWS resources in a potentially broken state while you do.

bevel · on Aug 12, 2018

My biggest inconsistency with Cloud Formation is with smaller AWS offerings. If I need to build a VPC with some EC2 capacity it works well. If I want to create a load balancer and use R53 to do DNS based certificate validation with their in house SSL provider, I'm out of luck.

It looks like internal products need to work with cloud formation to enable support, and aws doesn't have a consistent model here. It seems that they are fine with some products cutting corners and not offering support (like DNS based certificate validation)

Inconsistency within aws isn't all that surprising.

gazoakley · on Aug 12, 2018

Terraform can do that: https://www.terraform.io/docs/providers/aws/r/acm_certificat...

That said, one irritating omission I've had to deal with is not being able to add email subscriptions to SNS topics. The underlying AWS API is a bit odd - I don't think it provides an ARN until the subscription is confirmed.

alecbenzer · on Aug 12, 2018

How does CloudFormation compare with using something like ansible to manage AWS environments?

unkoman · on Aug 12, 2018

Cloudformation is infrastructure management, not configuration management. Both Ansible and Cloudformation can be used for both in different ways, but usually you have your configuration management (such as docker containers) in one step of your pipeline and cloudformation templates as another. That way you can test your infrastructure (by deploying cloudformation templates and tearing them down) as well as your code without them being too dependant.

jbergknoff · on Aug 12, 2018

This infrastructure/configuration distinction is very hazy when it comes to services like Lambda or Fargate, where you just specify your code artifact and there's essentially nothing more to do. It's not clear that it's a net benefit to introduce additional tooling beyond CloudFormation/Terraform for deploying to these services. It's certainly not strictly necessary.

scarface74 · on Aug 12, 2018

What do you mean by “configuration management” - I don’t use Docker. I use CF for managing configuration with Parameter Store.

dkhenry · on Aug 12, 2018

I have used ansible to manage large production deployments of AWS infrastructure and what I will say is that it is very good at doing it, but it requires a lot of work compared to using something like terraform or cloud formation. Its not hard to have one playbook to provision all your infrastructure and make sure its up and running it just takes a good amount of lookups and facts calls.

The only reason I advocate for doing it is if a team will have a small infrastructure complexity ( like a basic ELB -> ASG -> RDS/EC2/S3 ) and they don't want to bring in more complex tools. Using ansible means you can use one tool to manage both your AMI's for immutable infrastructure and the infrastructure its self ( and can easily script your continuous deployment ). Once you start to really get a complex footprint getting a dedicated tool for infrastructure makes a lot of sense.

kokey · on Aug 12, 2018

My opinion is that just because you can interact with AWS APIs with Ansible it means that you should. I think it's good to use AWS interaction with Ansible for things like dynamic inventory and the orchestration of certain tasks (e.g. stop a group of instances, switch load balancer config between blue/green deployments, etc.). That said I don't think it's that much worse than Cloudformation because it suffers from the same lack of statelessness and idempotency that you need to engineer around.

auslander · on Aug 12, 2018

> using something like ansible to manage AWS

You may use Ansible for top level orchestration, deploying CF stacks and providing CF parameters. Just leave all resource creation to CF itself.

Illniyar · on Aug 12, 2018

Is there anyone here who used both amazon CF's and azure's ARM and can comment on the benefits and problems of each?

When I used CF a few years back (when it started) it was a pain (for those things it actually supported). I'm now using azure and ARM's integration with azure's cloud seems better to me.

ghayes · on Aug 12, 2018

I’d also love any experiences people have with Google’s Google Deployment Manager. For me, the product felt like it had many flaws that also had previously tended me away from CloudFormation (specifically, not full support for beta or alpha features and questions about inconsistent states during failures). I decided to go with Terraform since it feels like the industry standard and had full support for even quite new GCP features.

kesor · on Aug 12, 2018

Last time I checked ARM did not have any support for custom resources. The first thing mentioned in the article. Also last time I checked it was JSON only, while CloudFormation moved to YAML which is shorter and allows for proper comments.

djstein · on Aug 12, 2018

I was unaware of the ability to create custom CF resources! This is great. I will try to make a config to create a AWS Aurora Serverless RDS instance. It went GR Friday, and the team says CF support won’t be available until the end of the month.

kesor · on Aug 12, 2018

Excellent advice! I would also advise creating a couple of scripts that upload to S3 and run the update-stack commands automagically. Every advice in the article is gold though.