How OIT Resolved a Critical Incident with Crowdstrike and Enhanced Security Protocols
- July 24th, 2024
- in Outages, Software News
On July 19, 2024, at 4:09 UTC, The University of Alabama’s Office of Information Technology (OIT) faced a significant challenge when a defective Crowdstrike update caused a widespread disruption across campus systems. Crowdstrike, an industry-leading platform used by UA to protect servers from cyber threats, had released a Rapid Response Content update for Windows systems, which inadvertently caused many machines to crash and enter an endless reboot loop.
OIT personnel were alerted to server outages around midnight and swiftly mobilized a cross-functional team to investigate. Working through the early hours, OIT employees from the systems and security teams collaborated with Crowdstrike support and conducted independent research to find a solution. By 4:00 a.m., most systems had been restored, and the team continued to work with various campus units to resolve the remaining issues by 10:00 a.m.
What We’ve Done to Prevent Future Incidents
While OIT Security had previously set Crowdstrike to defer sensor updates for critical systems, this setting applied only to major sensor version updates, not to the nightly content updates responsible for this incident. Following this event, Crowdstrike has introduced new controls, allowing customers to defer content updates. UA has now adopted a staggered deployment strategy: updates are rolled out first to test systems, then to non-critical production systems, and finally to critical systems.
How Crowdstrike Has Enhanced Its Platform
In response to this incident, Crowdstrike has implemented a series of improvements, including:
- Enhanced software testing: Advanced testing techniques such as fault injection and stress testing are used to prevent similar issues.
- Improved resilience: Strengthening error-handling mechanisms in the Falcon sensor to manage content-related errors gracefully.
- Refined deployment strategy: Introducing a staggered rollout, with a small canary system deployment and increased monitoring of system performance during updates.
- Third-party validation: Engaging independent reviews to ensure the quality of development and deployment processes.
The proactive measures taken by both OIT and Crowdstrike underscore a commitment to security, ensuring that UA’s critical systems are better protected from future risks.