On October 22nd, 2020, at approximately 6 pm EDT, our Atlanta data center experienced a service degradation. As stated on the status page, the initial cause was reported as a fiber cut. Our Atlanta data center has two diverse dark fiber spans (referred to here as the A and the B side), providing connectivity to our transit providers, public and private peering points in an active/active model. This model is implemented in many of our data centers across the world. Fiber cuts are fairly common, and because of this, each span has more than enough capacity to sustain the data center in the event of a cut or maintenance. In nearly every instance, this failure is instant and seamless. Unfortunately, in this case, it was not.
3116 Digital is in a transition to an upgraded data center. As such, some of our internet connectivity has already been moved. When the A side fiber went down, traffic over the recently moved circuits did not failover as expected, causing some blackholing of customer traffic. The NetOps team quickly found this and turned down those connections, putting the entire A-side of the network hard down. This put all data center connectivity solely on the B side, restoring all services. NetOps engaged 3116 Digital’s fiber provider to investigate and repair the outage. The provider reported a fiber cut in the area and attached our ticket to that issue. The fiber cut took many hours to repair, not completing until late the next morning.
10-23-2020 12:45 pm Full data center outage
When our fiber provider called the cut fixed, we were still down. We continued to work with them to find the cause of the down fiber. It was determined by the fiber provider that they had added us to the larger fiber cut in error, and we were not affected by that outage. At approximately 12:45 PM EDT, we lost all connectivity to the Atlanta data center. While investigating why the A-side was down, the local field technician mistakenly disconnected our B side cutting off the data center completely. The field technician quickly reconnected the fiber, restoring connectivity to the data center, and resumed the effort to restore the A-side. Shortly after finding the cause of the A-side being down, our fiber pair had been disconnected from the provider’s panel. Reconnecting this restored connectivity to our A-side. The accidental disconnection of the B side was due to an incorrect label on the provider’s panel. The fiber provider is investigating the cause of the A-side being disconnected as it was not requested by 3116. Concerning the service degradation on the A-side, the cause has been determined to be BGP not failing as expected in the A-side being down, but the new transit already transitioned to the upgraded data center. This has already been corrected, and in the event of another fiber cut or outage, failover will happen as expected.