Network Connectivity

Incident Report for 3116 Digital

Postmortem

At 05:00 CST, our network monitoring system notified our systems team of a potential issue with the network connectivity of several of our servers in our Dallas datacenter. At this time all systems and sites were responding normally so no immediate action was taken as it is the case that sometimes our network usage will spike early in the morning due to the resources and services that clients on our servers provide.

At 07:15 CST, our systems monitoring system notified our systems team of issues affecting the degraded performance of several highly available servers and applications, and our systems team was dispatched to determine the cause of servers and hosted software applications becoming unresponsive.

After our initial investigations, a review of system logs pointed to an upstream issue with our offsite backup providers connected to the AWS backbone. As these servers were being queued up for normal backup procedures, the first server in the queue ran into an issue and thus disrupted the connections to the other servers in the backup queue. With the normal request of services coming into our network infrastructure continuing to mount up while massive amounts of data were trying to be backed up, issues occurred causing all available network resources to be devoted to trying to resolve the issue, causing all network resources available to become unresponsive to this server network.

The issue with the server that affected this issue has been resolved and all other backups have been performed and schedules have been reconfigured to allow for more ample time between backups. If this issue happens again, a failover will take place and siphon off any servers with backup issues to a queue that will be addressed manually, while allowing other systems to complete their scheduled backups normally.

For any questions or concerns, please reach out to support@3116digital.com.

Posted Dec 31, 2020 - 09:20 CST

Resolved

This incident has been resolved.

Posted Dec 31, 2020 - 09:00 CST

Update

We are continuing to monitor for any further issues.

Posted Dec 31, 2020 - 08:23 CST

Monitoring

A fix has been implemented and we continue to monitor network performance. All sites and applications should now be responding.

Posted Dec 31, 2020 - 07:35 CST

Identified

The issue has been identified and we are currently working with our team on the ground in to resolve.

Posted Dec 31, 2020 - 06:56 CST

Investigating

We are currently investigating an issue with network connectivity to our server networks in the Dallas datacenter.

Posted Dec 31, 2020 - 06:00 CST

This incident affected: CDN / Cache (Cloudflare CDN) and DataCenters (US-Central (Dallas), Hosted DNS).