Data center outages are expensive. Many organizations actually rush to migrate to the cloud, in part, to obtain the 99.9% or more availability that public cloud providers promise them.
In their haste, those same organizations often fail to guard against the potential cost of “unexpected uptime” in the cloud. Because leaders of infrastructure and operations, or I&O, typically approach cloud cost control as a financial governance problem, they view cost mistakes as missed opportunities for efficiency rather than a risk to the business.
Many Gartner clients find out the hard way that cloud costs can spiral out of control more quickly and unexpectedly than traditional information technology costs. Because cloud usage is metered and billed in a “pay as you go” model, costs are highly sensitive to usage patterns. These patterns, in turn, may vary unexpectedly due to changes in business activity, human errors such as inefficient configurations and scripting mistakes, or even malicious external attacks that can create spikes in resource utilization.
In fact, only 22% of I&O leaders are confident that their spending in the cloud is under control, according to Gartner surveys.
One infrastructure leader at a large global bank found himself drained of short-term operating cash and juggling payables after experiencing sudden, sharp jumps in their monthly cloud bill. As another example, “cloud cost incidents” are not covered by insurance in the same way security breaches or physical disasters are: The organization must fully absorb the financial impact itself.
There is no “safest” level of cloud spending for an organization since we know cloud costs create business risk when they are dramatically different than expected. There are, however, three key ways that I&O leaders can minimize the business impact of cloud cost surprises and become more resilient in the process.
No. 1: Map your points of cloud dost vulnerability
A foundational step in cloud cost tracking for all I&O organizations is to classify costs into categories that matter to the business, such as by business unit, project or application. From there, I&O leaders must work with the business to determine what areas are most vulnerable to unexpected cloud cost increases.
Cost sensitivity will vary by situation. For instance, some systems or business units may be particularly sensitive to cost spikes during certain periods, such as e-commerce systems during the Christmas holidays, or during certain phases of operation: sensitive in production, less sensitive during development and testing.
As with disaster recovery programs and other forms of IT resilience, it is rarely cost-effective to protect 100% of cloud spending; leaders should prioritize the most essential areas of the business first.
Identifying areas of cost sensitivity is only one aspect of risk assessment. To measure the true business risk of an unexpected cloud cost event, I&O leaders must combine the likelihood an event will occur with the impact it will have on the business if it does occur – and lay it out on a cloud cost vulnerability map (see Figure 1 below).
Figure 1. Cloud Cost Vulnerability Map (Source: Gartner 2022)
No. 2: Add ‘cost observability’ into your cloud monitoring
To react quickly to cost events, I&O must have a way of knowing when costs are going beyond tolerable limits. Many I&O teams set basic budgets and spending alerts in their cloud infrastructure using the built-in services of their cloud provider, or expect their managed service provider partner to do the same. However, more often than not, I&O does not specify alerting in a way that effectively addresses business risk.
Incorporate “cost observability” into your cloud monitoring processes. In IT, observability is defined as the ability to accurately infer the internal state of a system from its external outputs. Cloud cost observability is the ability to make accurate inferences about cost and financial impact from the system events recorded in cloud logs. I&O must expand its cloud management telemetry to include three basics:
- Event triggering rules that can spot when cloud usage patterns are trending away from budget policies, before significant mistakes occur. Base these on forecast usage, not historical usage.
- Analytics that match changes in cloud cost with known areas of budget sensitivity to make informed risk assessments.
- Differentiated alert and remediation protocols based on where the problems are occurring on your cloud cost vulnerability map.
No. 3: Build a cost incident response plan
In many cases, the initial assessment with business stakeholders will determine that the alert is a “false positive” — that the costs in question were actually expected, or the stakeholder is willing to tolerate them for other reasons. I&O leaders should use these false positives as “feedback mechanisms” to obtain an updated status and incorporate it into their alerting rules.
In other cases, however, the cost alert will be both valid and significant. There will be damage to date, ongoing impact, a need to restore protection quickly and a need for forensics that uncover the underlying causes so they can be addressed. For these situations, I&O must have an incident response plan to implement immediately, which is not to be confused with a security incident response plan.
The response plan is not a policy document or an operating guide. Rather, it is the emergency playbook that every organization hopes it rarely needs to use: practical instructions that will be needed by humans moving quickly under pressure. Make it simple, prescriptive, focused on realistic scenarios, tested and maintained. If an organization is using a cloud MSP, the response plan must be tightly aligned to the MSP’s own response procedures.
Cost management in the public cloud can be a challenging task for I&O teams that are not prepared. But as the previous sections outlined, I&O leaders can get ahead of cost incidents through the sensible assessment of risk points, careful monitoring of cloud budgets and spend, and the implementation of quick, practical recovery procedures in the event of a cost emergency.
David Wright is a research director at Gartner Inc., where he focuses on public, private and hybrid cloud infrastructure. Learn more about key drivers for cloud growth in the complimentary Gartner webinar “The Future of Cloud: Prepare for 2025.”
Show your support for our mission by joining our Cube Club and Cube Event Community of experts. Join the community that includes Amazon Web Services and Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.