On July 19, 2024, CrowdStrike released a crucial update that unintentionally caused major disruptions, halting key infrastructure and corporate systems - resulting in widespread chaos and inconvenience.
The financial impact is projected to reach billions, with businesses losing revenue and facing operational setbacks. However, the true cost goes beyond immediate financial losses. The recovery process will be lengthy and complex, affecting technical, logistical, and reputational aspects - let's dive in and take a look at what happened.
CrowdStrike is a cybersecurity company whose software protects computers from malware. The software product impacted was their Falcon Sensor which monitors activity on a machine by embedding itself into the operating system. This allows it to monitor all the activity on the computer and take necessary action to protect and alert.
Based on what we know, CrowdStrike sent an update to a Channel file - these updates are sent routinely to help Falcon Sensor detect new threats. Unfortunately, this update caused all Windows systems to crash, restart, and crash again - repeatedly!
This required a person to interject to reboot the system in a safe mode, remove the updated channel file, and restart in the default mode for the system to come back up. In other words, it required time and resources that were not necessarily prepared for such an event which resulted in widespread outages.
As of Sunday, July 21st, there are still some systems which are impacted by it.
Although, CrowdStrike has issued a statement regarding what happened (https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/) the technical details of what happened are still vague. The problem with the details is it doesn't really say what triggered the bug. There are reports that the file resulted in a NULL pointer issue (the Falcon Sensor is written in C++ which is not a memory safe language like Java or Rust).
With that said, what we can conclude was that the Falcon Sensor did not handle the misconfiguration in the channel file correctly.
Or did it?
Really this applies to any security vendor. However, before we talk about preventative options, it is important to note that CrowdStrike updates these files regularly. That needs to be taken into consideration when evaluating the response and the actual underlying issue.
The other aspect that needs to be stated is that it is unknown whether the Channel file is parsed and that parsing failed or whether the Channel file is executed dynamically and the error was in the Channel file itself.
Many in the security field have thrown out various options for what would have prevented this issue.
Given the limited amount of times these types of issues have occurred, it would appear that the Channel file update was sent out inadvertently and/or CrowdStrike didn't use standard operating procedures. This means that CrowdStrike needs to look at its release candidate processes and find out where that gap occurred and address it. It's hard to envision a world where this disaster happens and CrowdStrike fully tested the way it normally does.
Given that stance, the other question remains. What is the appropriate release and response process to follow?
It seems like a bad idea for CrowdStrike to decide which systems to evaluate first without having an understanding of the deployment and risk factors that exist. In other words, is this even something that can be done in a safe and feasible manner?
Maybe it could be done through a "discount" or partner arrangement where certain non-critical businesses are incentivized.
This honestly seems reasonable. While the implementation may be tricky, the customer deciding how much risk they are willing to take for updates is something of importance.
Well, that depends on the vulnerability, the servers being addressed, and more. However, allowing the customer to have this type of control or decision making power to slow roll the deployment over a day or two and feel confident enough that the primary systems won't crash might be the ideal.
Or they can at least load failover systems to compensate.
Again, this seems reasonable but it falls in the same pattern with CrowdStrike making the decision on who gets the updates first. The OS simply doesn't have the context to know whether it should be removed.
It is also possible that doing this and bringing the server up could leave many servers unprotected plus the customer not knowing due to lack of support.
Assuming that CrowdStrike could do this it, this would have some validity to it. However, there would almost certainly need to be a high priority alert that wouldn't get missed. Otherwise, this could end up being a case where machines aren't protected and in the long run things end up worse due to breaches.
It could have been. When enterprises have large numbers of alerts and tools triggering warnings it is easy for things to get overlooked.
Did this Channel update warrant a system crash? It's hard to say when we don't know what malware was being protected against. Imagine a massive breach that caused even more chaos - plus we don't even know the exact details of how the channel file even works.
Ultimately, not having an alternative to compare it to makes it tough to say whether letting the crashes happen was warranted.
We need to stop expecting software to be perfect. The odds of that happening are slim to none. We also need to recognize systems will not be perfect and evaluating and monitoring the security of any complex system will be imperfect.
The result of the update was crushing. It doesn't seem like there was any loss of life, so at least it wasn't tragic. CrowdStrike will need to continue their transparency as they uncover more of a root cause and disclose what they intend to do to address it.
If, however, it is determined that letting the systems crash is the right thing to do to ensure awareness of a failure, CrowdStrike needs to find a way to let customers control the update process with much greater fine grain details.