Phil Jaquenoud Director of Engineering, Hyperscale Cloud
When Mayhem Hits
It’s Monday morning. You’ve only been head of IT for three months and there is a lot to do. You’re hopeful – things have become less chaotic than when you first accepted your new role. This week you’ll smash those goals, hire some great new talent, and change the world. You may even have time for lunch away from your desk.
Hold on. Where’s the email for the weekend sales figures? Why can’t you login to the system? No, you pray silently, it can’t be.
You take a deep breath and see your life flash before your eyes.
How long has it been down? What’s impacted? And WHY weren’t you notified right away?
You check all monitoring and find nothing. Everything seems to be in perfect order except the fact that the system is down of course. Think. Think fast. So, what’s changed?
Change Management says there’s just the usual code release on the first Saturday of the month. And, oh, yes, possibly two last-minute infrastructure changes on the app servers that didn’t go through the process!
Your engineers carefully roll back the various changes but the situation remains the same.
How about failing over to DR? You’re not sure when it was last tested, and you know for certain it doesn’t have the latest data. Is it even running the same major version of the app and database?
What about building some replacement app servers? Then you remember your senior engineer is on holiday and you’ve been told he’s the only one who knows how they’re configured.
It seems the only sensible solution is to restore from backups. Your team estimates 8-10 hours and they make a start.
After working through the night, the systems are finally online again.
What Could You Have Done Differently?
Of course, your work doesn’t stop when the system is online again. How can you prevent, or at least significantly reduce, the chances of something like this from happening again?
After some thought, the problems become clear:
The test environment isn’t a good representation of production
Changes to the test environment aren’t being tracked
The DR environment can’t be trusted
Server configuration across the estate is unknown and has drifted over time
You have no way to bring online additional or replacement servers quickly
Sound familiar? Though this was fictional, you wouldn’t need to ask around for long to find professionals that suffered through similar situations due to one or more of the above problems.
How DevOps Enablers Can Help
Continuous integration (CI) and continuous delivery (CD) are not new by any means – it’s been 12 years since that famous article on CI by Martin Fowler – and that’s definitely one possible solution here.
However, we work with many clients moving workloads to the public cloud that are not yet ready for full CI/CD but do need infrastructure and server configuration to be consistent to help them deliver their mission critical applications.
At Ensono, we’ve implemented some key DevOps enablers for our AWS clients that help across the spectrum – from those deploying many times a day via sophisticated CI/CD pipelines with automated testing, to those deploying and testing once a year manually.
Infrastructure As Code
Using tools like Hashicorp’s Terraform, or AWS Cloud Formation, we define and manage blueprints of a client’s environment as code. By doing this, we gain some great benefits:
We can reuse field-tested design patterns
It’s possible to create identical replica environments (e.g. Test and DR )
We can ensure changes to these environments are recorded and remain consistent
We define server configurations as pseudo-code using Ansible. That way we gain these main advantages:
Changes can be rolled out across a large number of servers in a consistent manner.
A new server can be configured in a matter of seconds.
We can create trusted machine images (e.g. using Hashicorp’s Packer). When the demand requires it, we’re ready to perform fast and reliable autoscaling.
All configurations are version controlled in Git repositories. By storing them this way, developers and engineers can have full visibility into the revision history of the configurations – who has changed what, when and why.
Employ DevOps Enablers Today
By having easy and reliable ways to replicate environments, and by sharing configurations across them and recording changes that happen, you reduce the chances of major problems happening and empower the professionals in your team to deal with problems swiftly and efficiently should they occur.