CrowdStrike Outage Thoughts and Considerations
On July 19, 2024, CrowdStrike released a crucial update that unintentionally caused major disruptions, halting key infrastructure and corporate systems - resulting in widespread chaos and inconvenience.
- Flights were canceled, leaving passengers stranded and airports in disarray.
- Banks faced operational challenges, disrupting services and frustrating customers.
- Medical facilities also suffered, highlighting our critical dependency on cybersecurity and stable systems.
The financial impact is projected to reach billions, with businesses losing revenue and facing operational setbacks. However, the true cost goes beyond immediate financial losses. The recovery process will be lengthy and complex, affecting technical, logistical, and reputational aspects - let's dive in and take a look at what happened.
What Is CrowdStrike?
CrowdStrike is a cybersecurity company whose software protects computers from malware. The software product impacted was their Falcon Sensor which monitors activity on a machine by embedding itself into the operating system. This allows it to monitor all the activity on the computer and take necessary action to protect and alert.
What Happened During the CrowdStrike Update?
Based on what we know, CrowdStrike sent an update to a Channel file - these updates are sent routinely to help Falcon Sensor detect new threats. Unfortunately, this update caused all Windows systems to crash, restart, and crash again - repeatedly!
This required a person to interject to reboot the system in a safe mode, remove the updated channel file, and restart in the default mode for the system to come back up. In other words, it required time and resources that were not necessarily prepared for such an event which resulted in widespread outages.
As of Sunday, July 21st, there are still some systems which are impacted by it.
Although, CrowdStrike has issued a statement regarding what happened (https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/) the technical details of what happened are still vague. The problem with the details is it doesn't really say what triggered the bug. There are reports that the file resulted in a NULL pointer issue (the Falcon Sensor is written in C++ which is not a memory safe language like Java or Rust).
With that said, what we can conclude was that the Falcon Sensor did not handle the misconfiguration in the channel file correctly.
Or did it?
How Could CrowdStrike Have Prevented the Issue
Really this applies to any security vendor. However, before we talk about preventative options, it is important to note that CrowdStrike updates these files regularly. That needs to be taken into consideration when evaluating the response and the actual underlying issue.
The other aspect that needs to be stated is that it is unknown whether the Channel file is parsed and that parsing failed or whether the Channel file is executed dynamically and the error was in the Channel file itself.
Many in the security field have thrown out various options for what would have prevented this issue.
- Testing: This should go without saying. However, what is not clear is if the update went out accidentally. If that is the case, the question is what happened that allowed the update to be released.
- Poor error handling: There are two aspects to the error handling that need to be considered. The first is that if the Channel file is parsed, then the parser needs to catch this and handle it cleaner. If, however, the Channel file is executed then it would be up to the Operating System to handle that. At that point there are only two options, do a system halt or try to remove the offending executable and revert to a known good state.
- Canary Rollout by Crowdstrike: Some have suggested that CrowdStrike could have slowly rolled the update distribution and tested it that way. Basically, select a set of customers and roll out the release and confirm it works.
- Canary Rollouts by Customers: One thing CrowdStrike seems to lack is the ability to configure what systems get updates and when. Give the customer the freedom.
- Systems Crash: This is an option. There are valid reasons for letting the system crash and require manual intervention. The notion that we have to fail clean all the time doesn't always hold in certain extreme cases.
What Is the Expectation for a Company like CrowdStrike Under the Circumstance: Challenges Presented
Given the limited amount of times these types of issues have occurred, it would appear that the Channel file update was sent out inadvertently and/or CrowdStrike didn't use standard operating procedures. This means that CrowdStrike needs to look at its release candidate processes and find out where that gap occurred and address it. It's hard to envision a world where this disaster happens and CrowdStrike fully tested the way it normally does.
Given that stance, the other question remains. What is the appropriate release and response process to follow?
Should CrowdStrike select which companies get the updates first?
It seems like a bad idea for CrowdStrike to decide which systems to evaluate first without having an understanding of the deployment and risk factors that exist. In other words, is this even something that can be done in a safe and feasible manner?
Maybe it could be done through a "discount" or partner arrangement where certain non-critical businesses are incentivized.
Should customers be able to have fine grain control over their own releases?
This honestly seems reasonable. While the implementation may be tricky, the customer deciding how much risk they are willing to take for updates is something of importance.
- Would an airline want to take an update immediately when it could result in outages that dramatically affect their bottom line as well as reputation?
Well, that depends on the vulnerability, the servers being addressed, and more. However, allowing the customer to have this type of control or decision making power to slow roll the deployment over a day or two and feel confident enough that the primary systems won't crash might be the ideal.
Or they can at least load failover systems to compensate.
Should the OS remove the broken code and restart?
Again, this seems reasonable but it falls in the same pattern with CrowdStrike making the decision on who gets the updates first. The OS simply doesn't have the context to know whether it should be removed.
It is also possible that doing this and bringing the server up could leave many servers unprotected plus the customer not knowing due to lack of support.
Should Crowdstrike remove the offending Channel file?
Assuming that CrowdStrike could do this it, this would have some validity to it. However, there would almost certainly need to be a high priority alert that wouldn't get missed. Otherwise, this could end up being a case where machines aren't protected and in the long run things end up worse due to breaches.
Was letting the crashes happen the right move?
It could have been. When enterprises have large numbers of alerts and tools triggering warnings it is easy for things to get overlooked.
Did this Channel update warrant a system crash? It's hard to say when we don't know what malware was being protected against. Imagine a massive breach that caused even more chaos - plus we don't even know the exact details of how the channel file even works.
Ultimately, not having an alternative to compare it to makes it tough to say whether letting the crashes happen was warranted.
CrowdStrike Outage - Final Thoughts
We need to stop expecting software to be perfect. The odds of that happening are slim to none. We also need to recognize systems will not be perfect and evaluating and monitoring the security of any complex system will be imperfect.
The result of the update was crushing. It doesn't seem like there was any loss of life, so at least it wasn't tragic. CrowdStrike will need to continue their transparency as they uncover more of a root cause and disclose what they intend to do to address it.
If, however, it is determined that letting the systems crash is the right thing to do to ensure awareness of a failure, CrowdStrike needs to find a way to let customers control the update process with much greater fine grain details.