Imagine having to take an entire summer off from delivering great features to your customers in order to slog through a swamp of hardware, OS and application issues.
What if it caused you to lose three months of feature roadmap on your premier product?
Even worse, what if the pain was self-inflicted?
What follows is the tale of a popular SaaS provider that was held captive to its environment and infrastructure. Instead of focusing on innovation, the company got caught in a mad scramble to maintain a reliable, PCI-compliant environment. Instead of being able to add new features to its flagship product, it spent the summer bogged down with patching and remediation. In the end, teams were burned out and key talent hit the door.
Of course, names have been changed to protect the guilty, but the truth is that this could happen anywhere.
Control Issues, Anyone?
The company in question labored under the belief that it needed to have substantial control over its environment. Under the terms of its co-location agreement, the company was responsible for everything but the bare essentials.
Problem was, teams were already stretched thin with a heavy load of hardware, OS and application work. A key deadline for PCI-compliance was looming, and there was deep attrition in the NOC, Network Engineering, Systems Engineering and even Application Engineering teams.
For some companies, this might have been the perfect inflection point for considering moving to a managed service provider. Instead, the company doubled down on staff augmentation with an offshore partner — a partner who wasn’t providing strong Operations support in the first place. The hope was to make up for the talent that had walked out the door with fresh new bodies located halfway around the world.
Enter the Hero
As with any good tale, this one isn’t without its heroes. Through the heroic efforts of leaders in Infrastructure Engineering — as well as the CISO — the company managed to obtain its PCI 2.0 certification.
But there was a cost.
With the focus on compliance work, virtually no feature work was done on the company’s premium product for three months. Other products suffered dearly, as well. Production Change Control events were severely limited to anything that was not pushing the compliance ball forward. Specifically, Operations was so overwhelmed with patching/remediation that other products waited in a holding pattern to get change control windows.
Within several months of that grueling certification push, key leaders and team members left. What remained was a less-capable team facing an even more daunting task of PCI 3.0 certification just 12 short months away. With no comprehensive patching program in place, it’s likely to be a rinse-and-repeat exercise with even less skill onboard.
5 Lessons Learned
Of course, there is a moral to the story — five in fact. Even tragic tales like this one don’t have to be fruitless so long as lessons can be learned.
Lesson 1: The “hero model” is unsustainable. When things get tight, it’s easy to think you’ll just fall back on your heroes. But heroes get tired of not sleeping, and they leave. You’ll have to pay dearly for those heroes who do stay.
Lesson 2: Bench strength matters. The short sided view is, “Hey, let’s just use our offshore folks.” The reality is that if you are staffed so thin that you are heavily reliant on offshore partners, you probably have no capacity left to build your bench with operational prowess.
Lesson 3: You can’t focus on innovation when you’re focused on managing infrastructure. When you spend the bulk of your energy fighting the problems of an underperforming tech-ops organization, you only have so much time and resources to focus on innovation. Better to be focused on delivering on the things it takes to be a market leader.
Lesson 4: If you don't take care of problems at the bottom, they quickly rise to the top.
In the case of the SaaS provider, there weren’t enough strong people in the NOC. They had insufficient skills and work instructions, so problems went up the ladder pretty quickly. Architects were being woken up at night to firefight production issues.
Lesson 5: Scalability is pretty important. In this case, the company’s environment was sized to survive huge traffic on the first day of each week. Any other time, it was far underutilized. The systems were “over-engineered and overpriced” to meet that once-a-week demand. At the same time, they were also under-engineered and quickly went down whenever clients transferred large data sets. A cloud environment would have provided a simple way for the SaaS provider to scale up and push out to a cloud provider such as AWS to unload some of the processing.
A Cautionary Tale
There is value in the old principle of, “we do what we do, so you can do what you do.” With a managed service provider taking over the hardware and providing project management, this story might have ended differently. It certainly would have been a happier ending if the SaaS provider had been freed up to focus on delivering value to its customers.