Update, Jan. 23: OIT’s networking equipment vendor has verified that a bug within a core network switch resulted in the widespread network outage. Unfortunately the internal redundancy mechanisms built into the switch did not initiate as expected due to the nature of the bug. The vendor has documented this issue and has provided us with a patch to correct the issue.
While it does appear that this outage was caused by a vendor software issue, OIT is using the event as a learning experience for how we can better prepare and respond to future outage incidents. Our teams have met to document areas for further research and/or improvement based upon the lessons learned from the event.
Original Post, Jan. 17: Thurs. Jan. 16, OIT experienced a networking hardware failure that impacted many services and applications on campus.
The networking hardware is designed with dual switches for redundancy; however, the faulty switch appeared to only partially fail, preventing the automatic failover to the redundant switch.
A full root cause analysis is being performed by OIT, and we have also engaged the hardware vendor to assist in hardware diagnostics. Additional details for the root cause of this outage will be provided as they become available.