Post Mortem

  • Stratus : The Legendary Journey

    The launch and improvement of Mojo Stratus has been a bumpy road. Stratus was launched just before Meet Magento New York. Stratus was our major release and plastered on every wall at MMNYC. We wanted to do something innovative and different with Mojo Stratus.

    Stratus is different. Rather than continuing down the path of traditional server offerings – i.e. you get a server, things are installed on it, and you have this big monolithic piece of hardware running whatever you need – we decided to use containers. A long time ago we looked at Docker and containers to use when developing our panel Mojo Host Manager. That was several years ago and containers were an unstable parlor trick. Great for production if you were ok with your production site constantly being on fire.

    Support for containers is now widespread especially with Google releasing their Kubernetes technology. We decided to use containers to build the services for Mojo Stratus. All the services your average Magento 2 store would need, and our initial release worked despite several issues. We tweaked, and tweaked, and got through the Thanksgiving sales season unscathed. Then in December major systemic issues began to appear with no obvious explanation.

    First, we saw database problems. On Mojo Stratus, Amazon Aurora hosts all the databases. Aurora's main strength is scaling out read replicas. If you have ever tried to set up your own MySQL master-slave setups or other DIY clustering, then you know it is not very fun. Aurora makes this easy and we wanted to have read replicas for future scaling. It is still MySQL, though, and subject to the same problems you might expect from high usage and other bugs in MySQL. What we saw were patterns of locks in MySQL which would freeze all transactions on all stores for a few seconds. Then a huge spike in active connections as all the traffic on Stratus backed up into Aurora. Site alarms would go off, the sky would fall, customers noticed, cats and dogs living together, mass hysteria. We needed deeper insight into Aurora and fast.

    Getting insight into Aurora was not easy. We needed something pre-built. Basic stats from AWS or even getting them yourself from the MySQL engine are not useful. That isn’t a fault in Aurora. For an application specific problem you want to see what queries are happening during a failure event. After some trial and error, we came up Vivid Cortex (https://www.vividcortex.com/) and hooked it into Stratus. Vivid Cortex provides tons of information about what queries are running. Vivid Cortex helped us answer questions like:

    What queries are running
    How often do certain queries run
    What databases run certain queries the most
    What queries are the most time consuming
    We can’t give Vivid Cortex enough love for watching database performance

    After gathering a lot of data, we found a pattern. Stratus would lock up during certain types of queries. They would occur during certain actions on particular stores and lock everything up. On top of this, the Magento 2 crons were going haywire. Magento 2 has a bug (https://github.com/magento/magento2/issues/11002) where the cron_schedule table can inflate to infinity. Crons start running all over each other and destroying your server in the process. Certain extensions can have particularly heavy cron tasks within them. Either due to necessity or inefficiencies in the code. And they all run at once.

    With a bug causing locks, crons waging war against everything, and bad queries coming in, we had a recipe for poor performance. Even with the massive resources of Amazon Aurora. We used multiple approaches to bring everything under control. First, we notified customers about problem extensions and site code. Second, we started limiting crons ultimately creating our own extension to manage them. We've also pushed forward support for various services like Elasticsearch. Search in Magento should not be the default through MySQL if possible. We are also working on a MySQL reader extension for Magento 2 to take full advantage of Aurora scaling.

    Those solutions helped, but we still had issues on the file system side. The file issues left us baffled. We had no issues going through the busiest days of the season Black Friday and Cyber Monday. From the start we have been using ObjectiveFS. An amazing filesystem that lets us store data on S3 and it gets pulled locally. Files are cached to speed up performance, running anything direct off S3 would be very slow. Especially Magento where thousands of files maybe opened and called on a single request.

    ObjectiveFS would use a lot of CPU , spike iowait affecting every customer. That issue started in December and was not a problem in the last few months of 2017. The iowait spikes became more frequent and severe, unrelated to specific traffic, and we had to do something. We shopped around for other file system solutions and came up Weka.io. Weka.io is a high performance file sharing solution that offers low latency and high throughput over the network. With most file systems like Weka, you can't get the low latencies needed for an application like Magento.

    Weka promised it all with file system latencies in the microsecond range. Well known share tech like CEPH etc all have times in the millisecond. It used its own kernel driver and relied on i3 instances and NVME storage. Again, when loading 1000 plus files per page load, you need low latency, that’s why the switch to SSDs was so important early on for MageMojo. It looked like a drop in replacement file system and we fired it up and got It working.

    The initial results were promising. Weka handled 300k request per second for files over the network without issues, and writes were no problem. You could write from one place and see the file nearly instantly from another source. Many frontend and backend parts of Magento write to a file and display it via an ajax request (product image uploads). Where on other systems an image upload would work, but the thumbnail would not appear since the write was not fast enough.

    After more testing, we went ahead and moved everyone off ObjectiveFS to the weka.io filesystem. We had a few issues with its configuration, and worked with their team to get everything set up correctly. For a while life was good. But Weka.io added significant latency to the load times, even with its microsecond response over the network. On average about a full second compared to the original ObjectiveFS system. The load time was a trade-off for what we believed to be stability.

    In February we had a critical failure on the Weka.io cluster a few weeks after completing our migration to their filesystem. The system was designed to be redundant so that 2 storage nodes can fail without data loss. In our case 3 nodes failed, putting the data in the ephemeral storage at risk of recovery. A bug in the Weka.io software caused the entire cluster to become unresponsive and we were never given the full explanation from the Weka.io team, unfortunately.

    We brought the stores online within 24 hours using an older copy of the data. In the following days we restored files as we could and helped bring back stores using their more recent data. We stabilized again on ObjectiveFS. We got back to business . ObjectiveFS was not as bad as we recalled, having fixed some other issues Aurora related. And not long before the Weka failure, we learned about the Meltdown vulnerability.

    This is the real kicker on top of it all. Once Meltdown became public knowledge, we learned that Amazon had secretly patched all their systems in mid-December. Meltdown patches coincide with the random systemic issues. We thought they were ObjectiveFS specific. It was not until we went back to ObjectiveFS that we realized there could be a connection. We also had AWS Enterprise support confirm the patching timeline. They were under embargo not to reveal the vulnerability.

    In hindsight, that change severely impacted our file system performance and we know the Meltdown patches can hurt the specific load created by Magento especially stat calls, and Magento makes thousands of them per request. Post Black Friday, multiple issues converged to create a sudden unstable system. We failed to identify it correctly and tried to fix it with different technology. In hindsight, that was a major mistake on our part. A lot of sleepless nights paid on that debt.

    With the realization about Meltdown and a new look at ObjectiveFS, we resumed testing and making more tweaks. Performance was better but not the best we hoped for. More and more updates gave us incremental improvements. In the first iteration we used multiple ObjectiveFS mounts. They covered many stores on a given physical node, and those mounts existed on all workers in the Stratus cluster. As a store scaled out, the containers already had the files available. Requests would cache the files a container needed on the respective node over time. But with many stores sharing a mount, the cache sizes became very large relative to a store. With such a large cache, any given request needed to fetch a lot of specific files from a large haystack. Testing confirmed it was a major bottleneck.

    For Stratus 2.5, the current generation, we moved to having a single ObjectiveFS file mount per store. Each store has its own file cache local to a node running its containers on disk and in memory. We launched Stratus 2.5 2 weeks ago and it has solved every file system issue we’ve received complaints about, especially update slowness in Magento admin. Site performance is faster than ever, according to our New Relic data every store is 30% faster now. Stores with heavy file operations on load show even more improvement.

    We’ve also added a lesser known feature called Stratus Cache. Stratus cache directly adds most of your code base into the container images we use for scaling. Stratus caches bypasses the file system for a majority of the system calls and improves performance while making scaling for large sales a breeze. If you are planning a large promotion or traffic influx, please let us know and we help get that working for you.

    To contribute back to the community and improve Stratus, we’ve started making our own Magento 2 modules to address specific concerns we have about Magento 2 performance. Our first release was a complete re-work of the cron system in Magento 2. On Github at https://github.com/magemojo/m2-ce-cron . By default the Magento 2 crons can take a server down in the right conditions and they constantly fight each other and run the same task multiple times. Our module eliminates that problem, because it causes issues with stores and vital cron tasks are missed.

    Next we have our split DB extension viewable at https://github.com/magemojo/m2-ce-splitdb . Magento 1 CE allowed merchants to easily use a master-slave database setup with a dedicated reader. Stratus uses Aurora which scales by having seamless multiple readers in a cluster. Since M2 CE does not support this at all out of the box, we had to build our solution. We believe Community should be able to scale just as well as Enterprise.

    As we near Magento Imagine, we are working on improving the dev experience on Stratus. We provide free dev instances which are the same CDN and stack used by any production Stratus instance. Going forward, we want to include more tools, tests and utilities to make a developer friendly environment. The primary feature will be Live Preview. At the click of a button, customers can create exact copies of their production store, including the database. Then developers can go in and make changes, commit them, run tests, and push to production. Preview sites will be storable so you can save different versions of the site and refer to them as needed. After the initial release of Live Preview, we will be adding tools to perform Selenium and unit tests.

    Stratus is now the premiere platform for Magento hosting. Nothing can scale and run your Magento store better. We've come a long way and we are grateful for our customer's patience. Now it's time to get back to business and stop worrying about your server.

  • Post Mortem - October 26, 2016

    EDIT: Please see our status page for more information and for future updates.

    Summary of Events

    On October 26, 2016, at around 2 PM ET we received an alert that a WAN uplink in one of the core routers was down. We determined the optic was failed and needed replaced. A technician was dispatched to replace the dead optic. At 4:05 PM ET, upon inserting the optic in the Secondary router, the Primary router panicked and rebooted into read-only mode. Our tech and network operators immediately restarted the Primary. The reboot took 8 minutes. By 4:15 PM ET the core routers had re-initialized were back online. Our monitoring systems cleared up except for a few servers.

    We immediately noticed that our server for magemojo.com was experiencing about 50% packet loss on its internal network. We continued to receive a few customer reports of problems. We scanned the internal network for all customer problems and isolated the packet loss to one /24 VLAN. We thought the problem was related to the abrupt failover of the core routers and decided it was best to try switching back routers to the first Primary. At 6:40 PM ET we initiated the core router failover back to the previous Primary. The failover did not go smoothly and resulted in another reboot which caused another 8-minute network disruption. The core routers returned with the old Secondary as Primary, and the problem remained.

    We reviewed the configuration to ensure there were no configuration errors with the new Primary but did not find any. We ran diagnostics on all network hardware and interfaces to identify the problem. We found no problems. We ran diagnostics a second time to check for anything we might have missed. Again, we found no problems. At this point, we started going through the changelog working our way backward to look for any changes that could have caused the problem. We found a change from September that looked possibly related to the packet loss. We reverted the change, and the packet loss stopped at 9:10 PM ET. The core continues to remain stable with the new Primary.

    THE NITTY GRITTY DETAILS

    Our core network uses 2 x Cisco 6500E Series Switches each with its own Supervisor 2T XL. Both Sup2T's are combined using Cisco's Virtual Switching Solution (VSS) to create a single virtual switch in HA. Each rack has 2 x Cisco 3750X switches stacked for HA using StackWise with 10G fiber uplinks to the core 6500's. Servers then have 2x1G uplinks in HA, 1 to each 3750X. Our network is fully HA all the way through and at no point should a single networking device cause an outage. We have thoroughly tested our network HA and confirmed all failover scenarios worked perfectly. Why inserting an optic into a 6500E blade would cause the other switch, the Primary, to reboot is completely unexpected and unknown at this moment. Why a single switch rebooting would cause both to go down is unknown. Why the previous Primary failed to resume its role is also unknown. We have Cisco support researching these events with us.

    The packet loss problem was challenging to figure out. First, it wasn't clear what the common denominator was among the servers with packet loss. Packet loss happened on servers across all racks and all physical hardware. We thought for sure the problem was related to the abrupt switchover of the core. We focused our search isolating the packet loss and narrowed it down to one particular vlan. Why one vlan would have packet loss, but not be completely down, was a big mystery. We thoroughly ran diagnostics and combed through every bit of hardware, routing tables, arp tables, etc.. We isolated the exact location the problem was happening but why remained an unknown. Finally, we concluded the problem was not caused by the network disturbance earlier. That's when we focused on the change log.

    The change was related to an "ip redirects" statement in the config and how we use route-maps to keep internal traffic routing internally and external traffic routing through the F5 Viprion cluster. During tuning of the Sup2T CPU performance, this line changed for one particular vlan. At the time it created no problems and packets routed correctly. However, after the core failover, the interfaces changed and subsequently the F5 Viprion cluster could not consistently route all packets coming from that internal vlan back to the internal network interface from which they originated.

    WHERE WE WENT WRONG

    Honestly, we did few things wrong. First, we should not have made any changes to the core network during the afternoon unless those changes were 100% mission critical. Second, we jumped right into fixing the problem and replying to customers but never publicly acknowledged the problem started. Third, our support desk overloaded with calls and tickets. We tried very hard to respond to everyone, but it's just not possible to speak to 100's of customers on the phone at once. We were not able to communicate with everyone.

    HOW WE'RE DOING BETTER

    We're re-evaluating our prioritization of events to reclassify mission critical repairs. All work, even if we think there is zero chance of a problem, should be done at night and scheduled in advance. Customers will be notified, again, even if we don't expect any disruption of services.

    We also know that we need to post on Twitter, our status page, and enable a pre-recorded voice message that says "We are aware that we are currently experiencing a problem and we are working on the issue." as soon as we first identify a problem. We're working on setting up a new status page where we can post event updates. We're also working with our engineers to guide them on how to provide better status updates to us internally so that we can relay information to customers. Unfortunately, during troubleshooting, there is no ETA and any given ETA will be wildly inaccurate. But we understand at the very least an update of "no new updates at this time" is better than no update at all.

    Finally, we're working with Cisco engineers to find the cause of the reboot, upgrade IOS, and replace any hardware that might have caused the problem.

    THANK YOU

    Thank you for being a customer here at Mage Mojo. We want you to know that we work incredibly hard every day to provide you with the highest level support, best performance, and lowest prices. Our entire team dedicates themselves to making your experience with Magento a good one. When something like this happens, we hope you'll understand how hard we work, and how much we care about you. Everything we do is for you the customer and we appreciate your business. Trust in us to learn from this experience and know that we'll grow stronger providing an even better service for you in the future. Thank you again for being a customer.

    Update Nov 8, 2016

    We are scheduling a network maintenance window on Thursday November 10th 2016 from 12AM ET to 1AM ET. During this window will replace a faulty line card in our core network and upgrade our version of the router software. This line card is partially responsible for the outage that occurred on Oct 26th. The other problems were two bugs identified in ios CSCts44718 and CSCui91801.

    For this replacement, we are hoping to be able to replace the card with only minor network interruption of a few seconds, but due to the state of the card, it may require a reboot of the network routers which would require approximately 15 minutes of downtime. After replacing the line card, we are going to perform an in-service software upgrade (ISSU) which should not cause any downtime. We do not anticipate a network disruption during the full window, but we are scheduling an hour window in case we run into any unanticipated problems.

    Ref:
    https://tools.cisco.com/bugsearch/bug/CSCts44718/?reffering_site=dumpcr

    https://tools.cisco.com/bugsearch/bug/CSCui91801/?reffering_site=dumpcr

2 Item(s)

  • Cisco
  • Intel
  • Redis
  • Magento
  • Nginx
  • Dell
  • Percona
  • MemCached
  • PCI Compliant
  • BBB