CrowdStrike Incident TLDR
Crowdstrike is a cybersecurity company
The founder and CEO is George Kurtz. He has already apologised.
Curiously, a similar incident happened 14 years ago when he was CTO of McAfee. A defective update falsely labelled svchost.exe
(a core component of Windows) as a malware (the Wecorl.A virus) and quarantined it, causing a worldwide shutdown of Windows XP PCs.
Relevant 2010 sources:
The incident is caused by their flagship product Falcon
CrowdStrike Falcon is a "unified security and IT solution that stops breaches, reduces complexity, and lowers costs". Basically, it's an antivirus software. Lots of people and enterprise servers use it. It seems to be really popular in Australia and New Zealand.
So what exactly happened? CrowdStrike posted a technical report here, but basically:
- Falcon is set up such that CrowdStrike can send updates to users via "channel files". Think of these files as "virus definition updates".
- They sent a channel file that's meant to protect users against newly observed exploits using named pipes, which is used by Windows for interprocess communications.
- This channel file is.. problematic. Either misconfigured or corrupted somehow. They have not explained the root cause yet.
- This problematic file caused their
CSAgent.sys
to crash Windows.
This reddit thread was created during the chaos, and it contains lots of quality anecdotes, analysis, and memes.
The recovery procedure sounds simple: boot Windows into safe mode and delete a particular file. However this is problematic because:
- it cannot be easily automated - most automation scripts or tools require the client machines to be in a booted state
- booting in safe mode may require a BitLocker recovery key, which may not be straightforward to retrieve depending on the company security setup
Many sysadmins I know spent the whole of Friday manually restoring machines one by one.
The crash happened because of a bad logic
Patrick Wardle - a renowned macOS (the irony) security researcher and founder of Objective-See - posted some interesting analysis speculating why CSAgent.sys
crashed here.
Fun fact: if "Wardle" sounds familiar, that's because his cousin Josh Wardle created Wordle.
Anyway, the TL;DR is:
- the crash happens when
CSAgent.sys
accesses a reference to an invalid memory address - this is likely caused by some wrong logic coded into the culprit channel file
- the culprit channel file is unique for each user, so it's not easy to reverse engineer exactly what the logic error is
Many users that attempted to read the culprit channel file saw that it was full of zeroes. Whether it is the cause or the result of the crash is still unclear. In their technical report, CrowdStrike clarified that "this is not related to null bytes contained within Channel File 291 or any other Channel File".
Why doesn't CrowdStrike use a staged rollout?
Staged rollout is a technique for releasing new software or updates to users. For example, a rollout schedule may go like this:
- Day 1: 1% of users
- Day 2: 2%
- Day 3: 5%
- Day 4: 10%
- Day 5: 20%
- Day 6: 50%
- Day 7: 100%
This is usually accompanied by some mechanism to receive feedback or usage statistics (sometimes called "telemetry") so companies know whether the update doesn't create any critical bugs or regressions, or whether the update addressed its objectives.
However, security updates do not share the same goals as typical software updates. Software updates favour stability, security updates prioritise velocity.
What about client side controls?
It seems to be possible for Falcon users to set up policies for how updates are received. Here's some code from their Python SDK that hints at a "sensor update policy".
However, this particular update bypassed such policies. Oof.
I'm not sure what to do with my pitchfork for this one. There may be legitimate reasons for doing this - such as for deploying time-sensitive, critical updates - but this feels bad.
Is this the end of CrowdStrike?
Hours before the incident, Reddit user u/King_Kunta_ made a post on r/wallstreetbets claiming that Crowdstrike is overvalued. He was mocked for his terrible analysis, but half a day later he was hailed as the Lisan Al Ghaib.
CRWD closed at 343 on Thursday. On Friday it opened at 294, immediately dropped to 290, but the market battled around the 300 mark for the rest of the day, finally closing at 304. That's a 11.1% drop from the previous day.
I'm using Heikin-Ashi sticks to show the dip more accurately.
Looking beyond the week, CRWD was actually at this level just a few months ago, before their June earnings call. Maybe this isn't a big deal after all.
Bonus: Lots of new domains popped up
Memes
Educational
Source. Some things in this thread are no longer accessible.