EDIT: Please see our status page for more information and for future updates.
Summary of Events
On October 26, 2016, at around 2 PM ET we received an alert that a WAN uplink in one of the core routers was down. We determined the optic was failed and needed replaced. A technician was dispatched to replace the dead optic. At 4:05 PM ET, upon inserting the optic in the Secondary router, the Primary router panicked and rebooted into read-only mode. Our tech and network operators immediately restarted the Primary. The reboot took 8 minutes. By 4:15 PM ET the core routers had re-initialized were back online. Our monitoring systems cleared up except for a few servers.
We immediately noticed that our server for magemojo.com was experiencing about 50% packet loss on its internal network. We continued to receive a few customer reports of problems. We scanned the internal network for all customer problems and isolated the packet loss to one /24 VLAN. We thought the problem was related to the abrupt failover of the core routers and decided it was best to try switching back routers to the first Primary. At 6:40 PM ET we initiated the core router failover back to the previous Primary. The failover did not go smoothly and resulted in another reboot which caused another 8-minute network disruption. The core routers returned with the old Secondary as Primary, and the problem remained.
We reviewed the configuration to ensure there were no configuration errors with the new Primary but did not find any. We ran diagnostics on all network hardware and interfaces to identify the problem. We found no problems. We ran diagnostics a second time to check for anything we might have missed. Again, we found no problems. At this point, we started going through the changelog working our way backward to look for any changes that could have caused the problem. We found a change from September that looked possibly related to the packet loss. We reverted the change, and the packet loss stopped at 9:10 PM ET. The core continues to remain stable with the new Primary.
THE NITTY GRITTY DETAILS
Our core network uses 2 x Cisco 6500E Series Switches each with its own Supervisor 2T XL. Both Sup2T's are combined using Cisco's Virtual Switching Solution (VSS) to create a single virtual switch in HA. Each rack has 2 x Cisco 3750X switches stacked for HA using StackWise with 10G fiber uplinks to the core 6500's. Servers then have 2x1G uplinks in HA, 1 to each 3750X. Our network is fully HA all the way through and at no point should a single networking device cause an outage. We have thoroughly tested our network HA and confirmed all failover scenarios worked perfectly. Why inserting an optic into a 6500E blade would cause the other switch, the Primary, to reboot is completely unexpected and unknown at this moment. Why a single switch rebooting would cause both to go down is unknown. Why the previous Primary failed to resume its role is also unknown. We have Cisco support researching these events with us.
The packet loss problem was challenging to figure out. First, it wasn't clear what the common denominator was among the servers with packet loss. Packet loss happened on servers across all racks and all physical hardware. We thought for sure the problem was related to the abrupt switchover of the core. We focused our search isolating the packet loss and narrowed it down to one particular vlan. Why one vlan would have packet loss, but not be completely down, was a big mystery. We thoroughly ran diagnostics and combed through every bit of hardware, routing tables, arp tables, etc.. We isolated the exact location the problem was happening but why remained an unknown. Finally, we concluded the problem was not caused by the network disturbance earlier. That's when we focused on the change log.
The change was related to an "ip redirects" statement in the config and how we use route-maps to keep internal traffic routing internally and external traffic routing through the F5 Viprion cluster. During tuning of the Sup2T CPU performance, this line changed for one particular vlan. At the time it created no problems and packets routed correctly. However, after the core failover, the interfaces changed and subsequently the F5 Viprion cluster could not consistently route all packets coming from that internal vlan back to the internal network interface from which they originated.
WHERE WE WENT WRONG
Honestly, we did few things wrong. First, we should not have made any changes to the core network during the afternoon unless those changes were 100% mission critical. Second, we jumped right into fixing the problem and replying to customers but never publicly acknowledged the problem started. Third, our support desk overloaded with calls and tickets. We tried very hard to respond to everyone, but it's just not possible to speak to 100's of customers on the phone at once. We were not able to communicate with everyone.
HOW WE'RE DOING BETTER
We're re-evaluating our prioritization of events to reclassify mission critical repairs. All work, even if we think there is zero chance of a problem, should be done at night and scheduled in advance. Customers will be notified, again, even if we don't expect any disruption of services.
We also know that we need to post on Twitter, our status page, and enable a pre-recorded voice message that says "We are aware that we are currently experiencing a problem and we are working on the issue." as soon as we first identify a problem. We're working on setting up a new status page where we can post event updates. We're also working with our engineers to guide them on how to provide better status updates to us internally so that we can relay information to customers. Unfortunately, during troubleshooting, there is no ETA and any given ETA will be wildly inaccurate. But we understand at the very least an update of "no new updates at this time" is better than no update at all.
Finally, we're working with Cisco engineers to find the cause of the reboot, upgrade IOS, and replace any hardware that might have caused the problem.
Thank you for being a customer here at Mage Mojo. We want you to know that we work incredibly hard every day to provide you with the highest level support, best performance, and lowest prices. Our entire team dedicates themselves to making your experience with Magento a good one. When something like this happens, we hope you'll understand how hard we work, and how much we care about you. Everything we do is for you the customer and we appreciate your business. Trust in us to learn from this experience and know that we'll grow stronger providing an even better service for you in the future. Thank you again for being a customer.
Update Nov 8, 2016
We are scheduling a network maintenance window on Thursday November 10th 2016 from 12AM ET to 1AM ET. During this window will replace a faulty line card in our core network and upgrade our version of the router software. This line card is partially responsible for the outage that occurred on Oct 26th. The other problems were two bugs identified in ios CSCts44718 and CSCui91801.
For this replacement, we are hoping to be able to replace the card with only minor network interruption of a few seconds, but due to the state of the card, it may require a reboot of the network routers which would require approximately 15 minutes of downtime. After replacing the line card, we are going to perform an in-service software upgrade (ISSU) which should not cause any downtime. We do not anticipate a network disruption during the full window, but we are scheduling an hour window in case we run into any unanticipated problems.