blog.chay.dev

CrowdStrike Incident TLDR

Assumed Audience: anyone interested to learn more about the CrowdStrike incident.

Crowdstrike is a cybersecurity company

The founder and CEO is George Kurtz. He has already apologised.

Curiously, a similar incident happened 14 years ago when he was CTO of McAfee. A defective update falsely labelled svchost.exe (a core component of Windows) as a malware (the Wecorl.A virus) and quarantined it, causing a worldwide shutdown of Windows XP PCs.

Relevant 2010 sources:

The incident is caused by their flagship product Falcon

CrowdStrike Falcon is a "unified security and IT solution that stops breaches, reduces complexity, and lowers costs". Basically, it's an antivirus software. Lots of people and enterprise servers use it. It seems to be really popular in Australia and New Zealand.

So what exactly happened? CrowdStrike posted a technical report here, but basically:

This reddit thread was created during the chaos, and it contains lots of quality anecdotes, analysis, and memes.

The recovery procedure sounds simple: boot Windows into safe mode and delete a particular file. However this is problematic because:

Many sysadmins I know spent the whole of Friday manually restoring machines one by one.

The crash happened because of a bad logic

Patrick Wardle - a renowned macOS (the irony) security researcher and founder of Objective-See - posted some interesting analysis speculating why CSAgent.sys crashed here.

Fun fact: if "Wardle" sounds familiar, that's because his cousin Josh Wardle created Wordle.

Anyway, the TL;DR is:

Many users that attempted to read the culprit channel file saw that it was full of zeroes. Whether it is the cause or the result of the crash is still unclear. In their technical report, CrowdStrike clarified that "this is not related to null bytes contained within Channel File 291 or any other Channel File".

Why doesn't CrowdStrike use a staged rollout?

Staged rollout is a technique for releasing new software or updates to users. For example, a rollout schedule may go like this:

This is usually accompanied by some mechanism to receive feedback or usage statistics (sometimes called "telemetry") so companies know whether the update doesn't create any critical bugs or regressions, or whether the update addressed its objectives.

However, security updates do not share the same goals as typical software updates. Software updates favour stability, security updates prioritise velocity.

What about client side controls?

It seems to be possible for Falcon users to set up policies for how updates are received. Here's some code from their Python SDK that hints at a "sensor update policy".

However, this particular update bypassed such policies. Oof.

I'm not sure what to do with my pitchfork for this one. There may be legitimate reasons for doing this - such as for deploying time-sensitive, critical updates - but this feels bad.

Is this the end of CrowdStrike?

Hours before the incident, Reddit user u/King_Kunta_ made a post on r/wallstreetbets claiming that Crowdstrike is overvalued. He was mocked for his terrible analysis, but half a day later he was hailed as the Lisan Al Ghaib.

CRWD closed at 343 on Thursday. On Friday it opened at 294, immediately dropped to 290, but the market battled around the 300 mark for the rest of the day, finally closing at 304. That's a 11.1% drop from the previous day.

CRWD_2024-07-21_15-21-20_439c2 I'm using Heikin-Ashi sticks to show the dip more accurately.

Looking beyond the week, CRWD was actually at this level just a few months ago, before their June earnings call. Maybe this isn't a big deal after all.

Bonus: Lots of new domains popped up

Memes

Educational

Source. Some things in this thread are no longer accessible.