Why Event Driven Serverless for IT Operations?
Several times I’ve presented a talk entitled Proactive Ops: Event Driven IT Operations. Ops is already event driven - something goes wrong and a human tries to fix it. Proactive Ops is about moving beyond being reactive.
In this post, we will explore why you should run your Proactive Ops platform on a serverless stack. These ideas can be applied on AWS or other cloud platforms. If you really want, you could build on top of an open source, self hosted, “serverless” stack. However, the “on prem” approach will negate some of the benefits of a fully managed serverless offering.
Eliminate Toil
Eliminating toil is a key tenet of Site Reliability Engineering (SRE). Running servers, be they bare metal, VMs or containers, involves significant toil. Serverless removes this overhead. Teams can focus on building the business logic, API clients, routing rules and workflows for detecting and automagically remediating issues. These are the core technical building blocks of a Proactive Ops platform.
The team building anything else is a distraction. Teams can deliver faster when you remove the maintenance overhead and cognitive load of provisioning and patching VMs, managing web servers and maintaining Dockerfiles.
There is some initial overhead when teams adopt serverless. It is a different paradigm. The initial investment in learning is paid back quickly once the team learns and masters the new approach.
Auto Scaling
Most B2B SaaS products expose state changes via webhooks. These events provide real time updates about what is happening within the app. The volume of web hook events ebbs and flows across the day as engineers perform different actions. You need elastic infrastructure to handle peaks, without paying for unused compute during the troughs.
Immediate Response
Some teams opt for polling APIs periodically in batches. Batch processing data adds complexity to your environment. The API crawler needs to keep track of the last run, page through results, handle partial batch failures and so on. Handling the events in real time involves less moving parts.
Even if your polling task runs every 5 minutes, you’ve introduced a delay in response time. If it runs weekly or monthly, you’re reacting to very old changes.
The longer a non compliant resource remains in an environment, the greater the risk that something or someone will rely upon it. Immediately remediating issues removes this risk. For example, if an action doesn’t comply with a policy, it should be reverted immediately. If the change is rolled back before someone gets to use it, they’re mildly inconvenienced. If the change is rolled back a week later, it is likely to lead to a production outage. We want to avoid that.
Transparent Costs
The pricing model used by all three big cloud providers for their serverless offerings, is pay per use. There is no need to pay for idle resources.
With serverless Proactive Ops there is a direct relationship between cloud spend and business value. For example, if a workflow costs 0.01USD per execution, and a human performing that task costs 20USD, the cost savings of 19.99USD per execution can be quantified. Calculating the savings, in dollar terms, from avoided outages or context switching for engineers is harder to calculate. When sharing cost savings with management, it is best to use numbers you can easily explain and justify.
It is possible to build a Proactive Ops platform using more traditional application platforms, such as containers or VMs. It is also possible for me to drive my car 30kmh over the speed limit while not wearing a seat belt! Just because something is possible, doesn’t mean it is a good idea.
Adopt a serverless first approach for your Proactive Ops platform. This will allow your engineers to focus on features, not infrastructure. Non compliant actions and resources will be remediated before anyone relies on them. Your platform will scale without manual intervention. You will be able to demonstrate a direct relationship between your bill and the cost savings. 🌊
Need Help?
If you want to adopt Proactive Ops, but you're not sure where to start, get in touch! I am happy to help get you.
Proactive Ops is produced on the unceeded territory of the Ngunnawal people. We acknowledge the Traditional Owners and pay respect to Elders past and present.