Monthly Archives: October 2016

  • Post Mortem - October 26, 2016

    EDIT: Please see our status page for more information and for future updates.

    Summary of Events

    On October 26, 2016, at around 2 PM ET we received an alert that a WAN uplink in one of the core routers was down. We determined the optic was failed and needed replaced. A technician was dispatched to replace the dead optic. At 4:05 PM ET, upon inserting the optic in the Secondary router, the Primary router panicked and rebooted into read-only mode. Our tech and network operators immediately restarted the Primary. The reboot took 8 minutes. By 4:15 PM ET the core routers had re-initialized were back online. Our monitoring systems cleared up except for a few servers.

    We immediately noticed that our server for magemojo.com was experiencing about 50% packet loss on its internal network. We continued to receive a few customer reports of problems. We scanned the internal network for all customer problems and isolated the packet loss to one /24 VLAN. We thought the problem was related to the abrupt failover of the core routers and decided it was best to try switching back routers to the first Primary. At 6:40 PM ET we initiated the core router failover back to the previous Primary. The failover did not go smoothly and resulted in another reboot which caused another 8-minute network disruption. The core routers returned with the old Secondary as Primary, and the problem remained.

    We reviewed the configuration to ensure there were no configuration errors with the new Primary but did not find any. We ran diagnostics on all network hardware and interfaces to identify the problem. We found no problems. We ran diagnostics a second time to check for anything we might have missed. Again, we found no problems. At this point, we started going through the changelog working our way backward to look for any changes that could have caused the problem. We found a change from September that looked possibly related to the packet loss. We reverted the change, and the packet loss stopped at 9:10 PM ET. The core continues to remain stable with the new Primary.

    THE NITTY GRITTY DETAILS

    Our core network uses 2 x Cisco 6500E Series Switches each with its own Supervisor 2T XL. Both Sup2T's are combined using Cisco's Virtual Switching Solution (VSS) to create a single virtual switch in HA. Each rack has 2 x Cisco 3750X switches stacked for HA using StackWise with 10G fiber uplinks to the core 6500's. Servers then have 2x1G uplinks in HA, 1 to each 3750X. Our network is fully HA all the way through and at no point should a single networking device cause an outage. We have thoroughly tested our network HA and confirmed all failover scenarios worked perfectly. Why inserting an optic into a 6500E blade would cause the other switch, the Primary, to reboot is completely unexpected and unknown at this moment. Why a single switch rebooting would cause both to go down is unknown. Why the previous Primary failed to resume its role is also unknown. We have Cisco support researching these events with us.

    The packet loss problem was challenging to figure out. First, it wasn't clear what the common denominator was among the servers with packet loss. Packet loss happened on servers across all racks and all physical hardware. We thought for sure the problem was related to the abrupt switchover of the core. We focused our search isolating the packet loss and narrowed it down to one particular vlan. Why one vlan would have packet loss, but not be completely down, was a big mystery. We thoroughly ran diagnostics and combed through every bit of hardware, routing tables, arp tables, etc.. We isolated the exact location the problem was happening but why remained an unknown. Finally, we concluded the problem was not caused by the network disturbance earlier. That's when we focused on the change log.

    The change was related to an "ip redirects" statement in the config and how we use route-maps to keep internal traffic routing internally and external traffic routing through the F5 Viprion cluster. During tuning of the Sup2T CPU performance, this line changed for one particular vlan. At the time it created no problems and packets routed correctly. However, after the core failover, the interfaces changed and subsequently the F5 Viprion cluster could not consistently route all packets coming from that internal vlan back to the internal network interface from which they originated.

    WHERE WE WENT WRONG

    Honestly, we did few things wrong. First, we should not have made any changes to the core network during the afternoon unless those changes were 100% mission critical. Second, we jumped right into fixing the problem and replying to customers but never publicly acknowledged the problem started. Third, our support desk overloaded with calls and tickets. We tried very hard to respond to everyone, but it's just not possible to speak to 100's of customers on the phone at once. We were not able to communicate with everyone.

    HOW WE'RE DOING BETTER

    We're re-evaluating our prioritization of events to reclassify mission critical repairs. All work, even if we think there is zero chance of a problem, should be done at night and scheduled in advance. Customers will be notified, again, even if we don't expect any disruption of services.

    We also know that we need to post on Twitter, our status page, and enable a pre-recorded voice message that says "We are aware that we are currently experiencing a problem and we are working on the issue." as soon as we first identify a problem. We're working on setting up a new status page where we can post event updates. We're also working with our engineers to guide them on how to provide better status updates to us internally so that we can relay information to customers. Unfortunately, during troubleshooting, there is no ETA and any given ETA will be wildly inaccurate. But we understand at the very least an update of "no new updates at this time" is better than no update at all.

    Finally, we're working with Cisco engineers to find the cause of the reboot, upgrade IOS, and replace any hardware that might have caused the problem.

    THANK YOU

    Thank you for being a customer here at Mage Mojo. We want you to know that we work incredibly hard every day to provide you with the highest level support, best performance, and lowest prices. Our entire team dedicates themselves to making your experience with Magento a good one. When something like this happens, we hope you'll understand how hard we work, and how much we care about you. Everything we do is for you the customer and we appreciate your business. Trust in us to learn from this experience and know that we'll grow stronger providing an even better service for you in the future. Thank you again for being a customer.

    Update Nov 8, 2016

    We are scheduling a network maintenance window on Thursday November 10th 2016 from 12AM ET to 1AM ET. During this window will replace a faulty line card in our core network and upgrade our version of the router software. This line card is partially responsible for the outage that occurred on Oct 26th. The other problems were two bugs identified in ios CSCts44718 and CSCui91801.

    For this replacement, we are hoping to be able to replace the card with only minor network interruption of a few seconds, but due to the state of the card, it may require a reboot of the network routers which would require approximately 15 minutes of downtime. After replacing the line card, we are going to perform an in-service software upgrade (ISSU) which should not cause any downtime. We do not anticipate a network disruption during the full window, but we are scheduling an hour window in case we run into any unanticipated problems.

    Ref:
    https://tools.cisco.com/bugsearch/bug/CSCts44718/?reffering_site=dumpcr

    https://tools.cisco.com/bugsearch/bug/CSCui91801/?reffering_site=dumpcr

  • MEET MAGENTO VIETNAM 2016

     

    The second Meet Magento Vietnam conference in 2016 took us to Ho Chi Minh City, Vietnam in October of 2016. 

    We were grateful for a relatively uneventful trip through Shanghai, but left the airport much lighter than expected - our luggage had not yet arrived! Not that we could clear the customs check though. The airport had a power outage which took all of the equipment down. That guaranteed a quick exit from the airport, without local currency, since ATMs went down with the power out. Yay for cellphones  - we were still able to get an Uber to our hotel. The complimentary dressing gown in the closet was a blessing since that was the only clean piece of clothing  we had :)

    We finally got some Vietnamese dong and were instant millionaires. With the exchange rate at about 22 500 to the dollar we were walking around with more than a million dong! We were all set for a colorful trip to the local markets to hunt for anything worthy of a conference attendee. We would have been happier exploring the city and meeting others in town for the event but it was intriguing to pick up some local traditional stuff for the event the next day. Besides traditional gear was the only practical choice - it seemed that dresses in Vietnam were typically shorter and quite comical on someone of my height.  The best items are usually ordered in advance and tailor made to fit. No time for that though. The quick and easy tourist options would have to do. I got all primped the next morning with my Aodai over a pair of jeans, only to get rained on, on the way to the event. All that trouble only to arrive looking disheveled anyway!

    It was amazing to meet a whole new subset of the Magento community as well as lots of really keen students. We made great friends who also made for wonderful networkers, tour guides of Ho Chi Minh City, its hottest club spots, dinner haunts and of course coffee shops. Vietnamese filter coffee is not for the faint hearted! This flavorful strong brew is bound to keep you alert for many hours! One of the highlights of this event was getting to know Thomas Goletz and watching him interact with the community. It was truly inspiring. Pictured here is Tra My Nguyen, one of the conference organizers, towering over Thomas Goletz. The conference venue was well set up with enormous screens which were great for large audiences.

    Since we commented on the showers in our  blog post covering Magetitans UK, we felt it fitting to offer a detailed account of the bathrooms we encountered in Vietnam. They are often appropriately called wet bathrooms. Sure enough, taking a shower gets everything wet. Bath tubs and shower cubicles are often not common place. Many shower areas we encountered doubled as the area with the toilet and basin. The shower head simply protruded from the wall in that room - really efficient use of space.

    We made sure we had time to explore the many charms of this amazing country. Apart from museums, street food and cafes, we wandered over to Bui Vien, a densely populated tourist and backpacker area which appeared to be alive all night. It was a great place to hang out outdoors and people watch. It was also a great place to make friends and enjoy local delicacies prepared in front of you in a matter of minutes. We very quickly found our favorite locales for great coffee, meals and cocktails and were always amazed at the enterprising, friendly and hardworking nature of folks who we miss dearly. A warning though, the electrical wiring would make anyone cringe. No, this is not your typical data-center wiring standard!

    Hahn at the restaurant a few doors down from our hotel, was always happy to see us and continued to surprise us with amazing dishes that were not even available on her menu. She was even happy to leave her restaurant just to walk with me through winding alley ways off the main drag so we could find fresh fruit.

    I was able to establish a routine in the craziness of Bui Vien as if I had always been there. We found a charming hotel managed by 2 young men who filled the roles of concierge, porter, hotel reception, tour and travel office, cleaning staff and all round great guys. Our first floor room was carefully chosen with large windows so we could always feel like we were a part of the daily activities, which were distinctively different at various parts of the day. The breakfast period was perhaps the calmest but by no means quiet. Various store fronts and road side vendors popped up almost everywhere with motorcycles, bicycles, taxis, Ubers, and pedestrians buzzing around with no clear demarkation of sidewalk and roadway. Stores and street vendors set up little plastic stools or a rolling cart on the roadside for a breakfast of banh-mi (sandwiches), soup, rice porridge or Vietnamese omelettes. Many of these vendors vanished until lunchtime and vanished again until dinner, not there there was ever a shortage of delectable food options or a welcoming smile. We once popped our unannounced heads into a language Institute that seemed to be preparing for Halloween. The students welcomed us as if they were expecting guests. We wandered through the scary maze they were setting up left and left after  great conversations and lots of pictures.

    We were always amazed by the incredible friendliness and perseverance of folks everywhere. One of our favorite things was offering larger than usual tips to various people when the opportunity arose. It was so easy to communicate with as little as a smile and gratitude so it was wonderful to add a token of appreciation.The exchange rate was so heavily in our favor and what is a small token of gratitude in the US, seemed to make someones day in Vietnam.  Our oarsman in Ha Long Bay certainly made our day, as did everyone we met in business and social interactions. He was great at finding the hot picture spots and directing us into appropriate poses using hilarious gestures. He would maneuver the boat with an oar in one hand while taking pictures of us with the other. He seemed to be skilled at using absolutely any phone or camera handed to him while still rowing with one arm :) Much to the amusement of us all, he also scolded me for my meek attempt at a smile when  he tried to take my picture :D

    One rule of survival was to quickly learn how to safely cross the street. I thought I has conquered this as a kid :D Stepping into traffic and maintaining a constant pace was the safest so motorists could anticipate your your next move. The throng of motorbikes, bicycles and larger vehicles simply buzzed around pedestrians and we did not observe a single accident involving a pedestrian. It amazed us, that even larger and fragile items like construction material, floral bouquets, and multiple trays of eggs and were all safely transported by bike. This was perhaps a common theme - things that may have seemed daunting or not possible back home were regular occurrences in daily life. We were sad to leave all this behind and hope to return many times. We keep in touch with friends made along the way and look forward to visiting in the future. Perhaps someday we can host some of them in the US.

     

     

     

     

2 Item(s)

  • Cisco
  • Intel
  • Redis
  • Magento
  • Nginx
  • Dell
  • Percona
  • MemCached
  • PCI Compliant
  • BBB