Automated Policy Enforcement
Enforcing policies with automation is a balancing act. In this post I discuss some of the considerations when implementing automating policy compliance.
Rules are mostly made to be broken and are too often for the lazy to hide behind.
― Douglass Macarthur
There is merit in Macarthur’s often butchered quote about rules. Some rules are made to be broken.
Other rules are made for good reasons - such as security and privacy policies. For policies to be effective, they need to be enforced. There is no point in having policies that are ignored.
Enforcing policies with automation is seen as an easy path to compliance. But careful consideration needs to be given to every enforcement action.
It sounds easy. If a user does the wrong thing, roll it back and send them an email. Job done! Not always.
Automated policy compliance enforcement is a balancing act. On one side sits the impact of action. The other side is the risk and severity of the violation.
Discovery of policy violations is an opportunity to educate users. Enforcement should always assume that the user made an honest mistake. The action taken needs to reflect the severity of the violation.
Act Fast
An organisation may have a policy of not allowing public repositories in GitHub. The enforcement flow could make the repository private and notify the user who made it public. This is a low risk enforcement action.
Public forks in GitHub can’t be made private. In this case, the enforcement flow should be to delete the repository and notify the user. The code is still available in the upstream repository. Any changes a developer made are likely still on their local machine.
In both cases, speed of enforcement is important.
When it comes to public repositories, it is acceptable to run a daily, weekly, or monthly script that switches them to private. This is better than doing nothing, but it isn’t ideal as internal code is publicly available for at least a day.
In the case of remediating forks, timing is critical. Deleting a fork days or weeks after it was created risks data loss. Remediating issues within seconds is ideal. Deleting a fork immediately after creation avoids data loss.
GitHub webhooks notify you almost immediately after a user performs a non compliant action. If your tooling can delete a fork within seconds, you can make a public repo private in a similar timeframe.
Sticking with GitHub for another example. An organisation may require signing of all commits. We could use branch protection rules to prevent merging of a pull request containing unsigned commits. This will ensure compliance, but we can catch this earlier. Deleting or rolling back unsigned commits won’t help anyone.
A flow can check every push event and ensure all commits are signed. If they aren’t, email the engineer and possibly their manager. This should be enough to get an engineer to reconfigure their git client.
Where to Fix the Problem?
Enterprise security teams love Cloud Security Posture Management (CSPM) platforms. The sales pitch goes something like this, you create a bunch of rules. If non compliant resources are created, the CSPM will delete them or raise a ticket. This sounds amazing, everything will be compliant at all times.
Automated, immediate removal of non compliant resources works well in a ClickOps environment. If the resource is gone before the user has a chance to use it, the impact is minimal.
An email with a link to the relevant policy or documentation helps them do better next time. Consider alternatives to email, such as slack/Teams/discord messages. There is a lot of noise in people’s inboxes and email delivery can be unreliable.
However, ClickOps doesn’t scale. What happens if CloudFormation created the resource? If the CSPM platform deletes a resource, now you have a broken stack. Have fun fixing that!
It’s a similar story with resources managed by terraform or other Infrastructure as Code (IaC) tools. A non compliant resource is created. The CSPM deletes it. The engineer notices the stack is broken and redeploys. Non compliant boomerangs managed using IaC waste everyone’s time.
This isn’t how remediation should work, especially when it can impact production workloads. Tagging resources provisioned using IaC aids traceability. I’ve seen managed_by
, provisioner
and iac
used to identify which tool provisioned the resource.
Let’s take this one step further. All resources provisioned by a terraform module should include a tag source_repo
which includes the org/repo
reference, git url, or at the very least the name of the repository where the configuration originated. You might want to consider another tag source_project
, which provides the Jira project code.
Automate Collaborative Remediation
When the CSPM platform detects a non compliant resource provisioned using IaC, rather than deleting it, the flow should be to create a critical remediation ticket. A well designed remediation flow should include escalations if the ticket isn’t closed within a reasonable time frame.
Fixing problems at the source reduces the risk of them resurfacing.
It is most efficient to catch non compliant resources managed by IaC during the development phase. Static analysis tools such as tfsec, tflint and open policy agent are excellent for flagging issues in Terraform. These can be blocking checks in the CI/CD pipeline.
Automated policy enforcement is about ensuring things are done properly, rather than punishing people when they mess up. If your policy enforcement is too heavy handed, you may face a revolt from your users.
The best way to get a bad law repealed is to enforce it strictly.
― Abraham Lincoln
Need Help?
If you want to adopt Proactive Ops, but you're not sure where to start, get in touch! I am happy to help get you.
Proactive Ops is produced on the unceeded territory of the Ngunnawal people. We acknowledge the Traditional Owners and pay respect to Elders past and present.