The Global Outage: A Wake-Up Call : @VMblog

Article

Search:

Follow VMblog.com:

Improve end user experience in VDI, DaaS and physical endpoint environments

The Global Outage: A Wake-Up Call

By Kevin Cole, director of product and technical marketing, HPE

Few people aren't aware of the CrowdStrike incident from earlier this year that caused a global outage, impacting everything and everyone around the world. From daily operations to long-term strategic goals, businesses were left paralyzed, unable to operate without critical technology.

The extent of the widespread disruption revealed a critical truth: most organizations are not ready for such large-scale incidents. This highlights the importance of having reliable and robust disaster recovery and business continuity solutions. As we reflect on this incident, it's important to analyze key takeaways from the CrowdStrike/Microsoft incident to understand how leveraging the appropriate disaster recovery technology can help resolve similar incidents.

Reflecting & Recovering from the CrowdStrike Incident

This massive outage that resulted in devices crashing with a blue screen of death (BSOD) serves as a stark reminder that most "disaster" events are not caused by natural disasters, but, rather, human error or, in most cases, a cyberattack. The harsh reality of this outage exposed weaknesses in companies' disaster recovery strategies, such as inadequate testing, complexity, underestimation of costs, and unrealistic recovery objectives:

Lack of Frequent Testing of Recovery Plans: A recovery plan is only as good as its last test. Unfortunately, many businesses do not perform regular, thorough tests of their disaster recovery plans. Reflecting on the recent CrowdStrike incident, this lack of testing could be to blame for the gaps in recovery processes that went unnoticed and eventually lead to the outage.
Complexity and Fragmentation: Using multiple, different systems for data protection and recovery can lead to inefficiencies and increased risk. The outage indicated that organizations with complex, fragmented recovery solutions had more difficulty restoring operations quickly. To avoid fragmentation, organizations should leverage comprehensive, unified solutions that streamline the disaster recovery process instead of complicating it.
Underestimating Downtime Costs: The true cost of downtime goes beyond immediate financial losses to include long-term reputational damage and customer trust erosion. Many organizations underestimated these impacts, as the outage made painfully clear.
Impact of established RTO and RPO: What happens when your authorized RPO and RTO don't work for the business? Often businesses find themselves in a catch-22 situation. They have invested in business continuity/disaster recovery (BCDR) plans but then find that the plan is not successful in real life when, for example, 24 hours' worth of data loss and 2-3 weeks recovery time becomes a reality and is worse than manually recovering from the outage - if at all possible.

Key Takeaways: What Can We Learn?

Perhaps one of the biggest takeaways from this particular incident is that no testing system can guarantee 100% effectiveness. There will always be the risk of unexpected problems that arise from a software update or configuration change that can cause disruption.

The problem we are faced with is how to return to the resting state before disruption. The solution may vary depending on the cause of the disruption, as it could be as simple as toggling a setting, or as complex as restoring a system through drastic disaster recovery efforts. This is what would constitute a disaster on some level, affecting maybe hundreds or thousands of users or customers who cannot access vital systems.

The CrowdStrike outage impacted thousands of workstations, and each one had to be fixed individually, which involved a very tedious and lengthy recovery process. In a different scenario, a single server disruption could also affect thousands of users and customers, but it would only require that the specific server be fixed to restore universal user access.

However, because every incident is different, businesses should be ready to roll back anything from a configuration setting change to one or more production workloads from recovery data. Some rare disruptions from unforeseen updates could be so severe that they require a full disaster declaration and proceed to use failover to a DR site.

Unfortunately, rolling back a workload to a previous point in time can be difficult, especially if that rollback loses hours or days' worth of data. Fortunately, with continuous data protection technologies, recovery points can be available within seconds of when an update is made. When using traditional backup and snapshot technologies, it is recommended that you take a backup or snapshot before making system updates to ensure a more recent recovery point is available.

How to Avoid Similar Situations Moving Forward

Instead of trying to second guess what happened and cast blame on top of an already tough situation for CrowdStrike and its customers, we can use this moment as a reminder of the collective responsibility we share to prepare for inevitable events such as this. Of course, in a perfect world, these mistakes would never have happened. We know, though, that IT is far from perfect, and we all have a responsibility not just to try to prevent these occurrences, but also to mitigate the damage in the aftermath.

Sometimes, updates that work well in one IT environment may cause problems in another due to different settings and software combinations that the vendor may not be aware of. This is why testing is essential for both the vendor and the customer to ensure compatibility and functionality.

Organizations must use management and recovery tools that allow them to test updates quickly and easily and to undo them if they cause any issues. Implement recovery solutions that let you test without disrupting your operations and revert to the point before the disruption. Make testing a priority for all software updates before applying them to production. We cannot avoid all outages, but we can prevent most and reduce the impact of those that happen.

ABOUT THE AUTHOR

kevin cole

Kevin Cole is the Global Director, Technical Product Marketing, at Zerto, a Hewlett Packard Enterprise company. He leads teams focused on creating and sharing Zerto’s story and unique differentiators to the market through a variety of channels. With Zerto since 2015, Kevin’s recent work has focused on Zerto for Kubernetes, Zerto for ransomware resilience, and the positive outcomes that customers are seeing with the joint power of Zerto and HPE GreenLake together.

Published Wednesday, December 04, 2024 7:34 AM by David Marshall