At 05:00 CST, our network monitoring system notified our systems team of a potential issue with the network connectivity of several of our servers in our Dallas datacenter. At this time all systems and sites were responding normally so no immediate action was taken as it is the case that sometimes our network usage will spike early in the morning due to the resources and services that clients on our servers provide.
At 07:15 CST, our systems monitoring system notified our systems team of issues affecting the degraded performance of several highly available servers and applications, and our systems team was dispatched to determine the cause of servers and hosted software applications becoming unresponsive.
After our initial investigations, a review of system logs pointed to an upstream issue with our offsite backup providers connected to the AWS backbone. As these servers were being queued up for normal backup procedures, the first server in the queue ran into an issue and thus disrupted the connections to the other servers in the backup queue. With the normal request of services coming into our network infrastructure continuing to mount up while massive amounts of data were trying to be backed up, issues occurred causing all available network resources to be devoted to trying to resolve the issue, causing all network resources available to become unresponsive to this server network.
The issue with the server that affected this issue has been resolved and all other backups have been performed and schedules have been reconfigured to allow for more ample time between backups. If this issue happens again, a failover will take place and siphon off any servers with backup issues to a queue that will be addressed manually, while allowing other systems to complete their scheduled backups normally.
For any questions or concerns, please reach out to support@3116digital.com.