By Kevin Cole, director of product and technical marketing, HPE
Few people aren't aware of the CrowdStrike incident from earlier
this year that caused a global outage, impacting everything and everyone around
the world. From daily operations to long-term strategic goals, businesses were
left paralyzed, unable to operate without critical technology.
The extent of the widespread disruption revealed a critical truth:
most organizations are not ready for such large-scale incidents. This
highlights the importance of having reliable and robust disaster recovery and business continuity solutions. As we
reflect on this incident, it's important to analyze key takeaways from the
CrowdStrike/Microsoft incident to understand how leveraging the appropriate
disaster recovery technology can help resolve similar incidents.
Reflecting & Recovering from the CrowdStrike Incident
This massive outage that resulted in devices crashing with a blue
screen of death (BSOD) serves as a stark reminder that most
"disaster" events are not caused by natural disasters, but, rather,
human error or, in most cases, a cyberattack. The harsh reality of this outage
exposed weaknesses in companies' disaster recovery strategies, such as
inadequate testing, complexity, underestimation of costs, and unrealistic
recovery objectives:
-
Lack of Frequent Testing of Recovery Plans: A recovery plan
is only as good as its last test. Unfortunately, many businesses do not perform
regular, thorough tests of their disaster recovery plans. Reflecting on the
recent CrowdStrike incident, this lack of testing could be to blame for the
gaps in recovery processes that went unnoticed and eventually lead to the
outage.
-
Complexity and Fragmentation: Using multiple, different systems for data
protection and recovery can lead to inefficiencies and increased risk. The
outage indicated that organizations with complex, fragmented recovery solutions
had more difficulty restoring operations quickly. To avoid fragmentation,
organizations should leverage comprehensive, unified solutions that streamline
the disaster recovery process instead of complicating it.
-
Underestimating Downtime Costs: The true cost of downtime goes beyond immediate
financial losses to include long-term reputational damage and customer trust
erosion. Many organizations underestimated these impacts, as the outage made
painfully clear.
-
Impact of established RTO and RPO: What happens when
your authorized RPO and RTO don't work for the business? Often businesses find
themselves in a catch-22 situation. They have invested in business
continuity/disaster recovery (BCDR) plans but then find that the plan is not
successful in real life when, for example, 24 hours' worth of data loss and 2-3
weeks recovery time becomes a reality and is worse than manually recovering
from the outage - if at all possible.
Key Takeaways: What Can We Learn?
Perhaps one of the biggest takeaways from this particular incident
is that no testing system can guarantee 100% effectiveness. There will always
be the risk of unexpected problems that arise from a software update or
configuration change that can cause disruption.
The problem we are faced with is how to return to the resting
state before disruption. The solution may vary depending on the cause of the
disruption, as it could be as simple as toggling a setting, or as complex as
restoring a system through drastic disaster recovery efforts. This is what
would constitute a disaster on some level, affecting maybe hundreds or
thousands of users or customers who cannot access vital systems.
The CrowdStrike outage impacted thousands of workstations, and
each one had to be fixed individually, which involved a very tedious and
lengthy recovery process. In a different scenario, a single server disruption
could also affect thousands of users and customers, but it would only require
that the specific server be fixed to restore universal user access.
However, because every incident is different, businesses should be
ready to roll back anything from a configuration setting change to one or more
production workloads from recovery data. Some rare disruptions from unforeseen
updates could be so severe that they require a full disaster declaration and
proceed to use failover to a DR
site.
Unfortunately, rolling back a workload to a previous point in time
can be difficult, especially if that rollback loses hours or days' worth of
data. Fortunately, with continuous data protection
technologies, recovery points can be available within seconds of when an
update is made. When using traditional backup and snapshot technologies, it is
recommended that you take a backup or snapshot before making system updates to
ensure a more recent recovery point is available.
How to Avoid Similar Situations Moving Forward
Instead of trying to second guess what happened and cast blame on
top of an already tough situation for CrowdStrike and its customers, we can use
this moment as a reminder of the collective responsibility we share to prepare
for inevitable events such as this. Of course, in a perfect world, these
mistakes would never have happened. We know, though, that IT is far from
perfect, and we all have a responsibility not just to try to prevent these
occurrences, but also to mitigate the damage in the aftermath.
Sometimes, updates that work well in one IT environment may cause
problems in another due to different settings and software combinations that
the vendor may not be aware of. This is why testing is essential for both the
vendor and the customer to ensure compatibility and functionality.
Organizations must use management and recovery tools that allow
them to test updates quickly and easily and to undo them if they cause any
issues. Implement recovery solutions that let you test without disrupting your
operations and revert to the point before the disruption. Make testing a
priority for all software updates before applying them to production. We cannot
avoid all outages, but we can prevent most and reduce the impact of those that
happen.
##
ABOUT THE AUTHOR
Kevin Cole is the Global Director, Technical Product Marketing, at Zerto, a Hewlett Packard Enterprise company. He leads teams focused on creating and sharing Zerto’s story and unique differentiators to the market through a variety of channels. With Zerto since 2015, Kevin’s recent work has focused on Zerto for Kubernetes, Zerto for ransomware resilience, and the positive outcomes that customers are seeing with the joint power of Zerto and HPE GreenLake together.