The launch and improvement of Mojo Stratus has been a bumpy road. Stratus was launched just before Meet Magento New York. Stratus was our major release and plastered on every wall at MMNYC. We wanted to do something innovative and different with Mojo Stratus.
Stratus is different. Rather than continuing down the path of traditional server offerings – i.e. you get a server, things are installed on it, and you have this big monolithic piece of hardware running whatever you need – we decided to use containers. A long time ago we looked at Docker and containers to use when developing our panel Mojo Host Manager. That was several years ago and containers were an unstable parlor trick. Great for production if you were ok with your production site constantly being on fire.
Support for containers is now widespread especially with Google releasing their Kubernetes technology. We decided to use containers to build the services for Mojo Stratus. All the services your average Magento 2 store would need, and our initial release worked despite several issues. We tweaked, and tweaked, and got through the Thanksgiving sales season unscathed. Then in December major systemic issues began to appear with no obvious explanation.
First, we saw database problems. On Mojo Stratus, Amazon Aurora hosts all the databases. Aurora's main strength is scaling out read replicas. If you have ever tried to set up your own MySQL master-slave setups or other DIY clustering, then you know it is not very fun. Aurora makes this easy and we wanted to have read replicas for future scaling. It is still MySQL, though, and subject to the same problems you might expect from high usage and other bugs in MySQL. What we saw were patterns of locks in MySQL which would freeze all transactions on all stores for a few seconds. Then a huge spike in active connections as all the traffic on Stratus backed up into Aurora. Site alarms would go off, the sky would fall, customers noticed, cats and dogs living together, mass hysteria. We needed deeper insight into Aurora and fast.
Getting insight into Aurora was not easy. We needed something pre-built. Basic stats from AWS or even getting them yourself from the MySQL engine are not useful. That isn’t a fault in Aurora. For an application specific problem you want to see what queries are happening during a failure event. After some trial and error, we came up Vivid Cortex (https://www.vividcortex.com/) and hooked it into Stratus. Vivid Cortex provides tons of information about what queries are running. Vivid Cortex helped us answer questions like:
What queries are running
How often do certain queries run
What databases run certain queries the most
What queries are the most time consuming
We can’t give Vivid Cortex enough love for watching database performance
After gathering a lot of data, we found a pattern. Stratus would lock up during certain types of queries. They would occur during certain actions on particular stores and lock everything up. On top of this, the Magento 2 crons were going haywire. Magento 2 has a bug (https://github.com/magento/magento2/issues/11002) where the cron_schedule table can inflate to infinity. Crons start running all over each other and destroying your server in the process. Certain extensions can have particularly heavy cron tasks within them. Either due to necessity or inefficiencies in the code. And they all run at once.
With a bug causing locks, crons waging war against everything, and bad queries coming in, we had a recipe for poor performance. Even with the massive resources of Amazon Aurora. We used multiple approaches to bring everything under control. First, we notified customers about problem extensions and site code. Second, we started limiting crons ultimately creating our own extension to manage them. We've also pushed forward support for various services like Elasticsearch. Search in Magento should not be the default through MySQL if possible. We are also working on a MySQL reader extension for Magento 2 to take full advantage of Aurora scaling.
Those solutions helped, but we still had issues on the file system side. The file issues left us baffled. We had no issues going through the busiest days of the season Black Friday and Cyber Monday. From the start we have been using ObjectiveFS. An amazing filesystem that lets us store data on S3 and it gets pulled locally. Files are cached to speed up performance, running anything direct off S3 would be very slow. Especially Magento where thousands of files maybe opened and called on a single request.
ObjectiveFS would use a lot of CPU , spike iowait affecting every customer. That issue started in December and was not a problem in the last few months of 2017. The iowait spikes became more frequent and severe, unrelated to specific traffic, and we had to do something. We shopped around for other file system solutions and came up Weka.io. Weka.io is a high performance file sharing solution that offers low latency and high throughput over the network. With most file systems like Weka, you can't get the low latencies needed for an application like Magento.
Weka promised it all with file system latencies in the microsecond range. Well known share tech like CEPH etc all have times in the millisecond. It used its own kernel driver and relied on i3 instances and NVME storage. Again, when loading 1000 plus files per page load, you need low latency, that’s why the switch to SSDs was so important early on for MageMojo. It looked like a drop in replacement file system and we fired it up and got It working.
The initial results were promising. Weka handled 300k request per second for files over the network without issues, and writes were no problem. You could write from one place and see the file nearly instantly from another source. Many frontend and backend parts of Magento write to a file and display it via an ajax request (product image uploads). Where on other systems an image upload would work, but the thumbnail would not appear since the write was not fast enough.
After more testing, we went ahead and moved everyone off ObjectiveFS to the weka.io filesystem. We had a few issues with its configuration, and worked with their team to get everything set up correctly. For a while life was good. But Weka.io added significant latency to the load times, even with its microsecond response over the network. On average about a full second compared to the original ObjectiveFS system. The load time was a trade-off for what we believed to be stability.
In February we had a critical failure on the Weka.io cluster a few weeks after completing our migration to their filesystem. The system was designed to be redundant so that 2 storage nodes can fail without data loss. In our case 3 nodes failed, putting the data in the ephemeral storage at risk of recovery. A bug in the Weka.io software caused the entire cluster to become unresponsive and we were never given the full explanation from the Weka.io team, unfortunately.
We brought the stores online within 24 hours using an older copy of the data. In the following days we restored files as we could and helped bring back stores using their more recent data. We stabilized again on ObjectiveFS. We got back to business . ObjectiveFS was not as bad as we recalled, having fixed some other issues Aurora related. And not long before the Weka failure, we learned about the Meltdown vulnerability.
This is the real kicker on top of it all. Once Meltdown became public knowledge, we learned that Amazon had secretly patched all their systems in mid-December. Meltdown patches coincide with the random systemic issues. We thought they were ObjectiveFS specific. It was not until we went back to ObjectiveFS that we realized there could be a connection. We also had AWS Enterprise support confirm the patching timeline. They were under embargo not to reveal the vulnerability.
In hindsight, that change severely impacted our file system performance and we know the Meltdown patches can hurt the specific load created by Magento especially stat calls, and Magento makes thousands of them per request. Post Black Friday, multiple issues converged to create a sudden unstable system. We failed to identify it correctly and tried to fix it with different technology. In hindsight, that was a major mistake on our part. A lot of sleepless nights paid on that debt.
With the realization about Meltdown and a new look at ObjectiveFS, we resumed testing and making more tweaks. Performance was better but not the best we hoped for. More and more updates gave us incremental improvements. In the first iteration we used multiple ObjectiveFS mounts. They covered many stores on a given physical node, and those mounts existed on all workers in the Stratus cluster. As a store scaled out, the containers already had the files available. Requests would cache the files a container needed on the respective node over time. But with many stores sharing a mount, the cache sizes became very large relative to a store. With such a large cache, any given request needed to fetch a lot of specific files from a large haystack. Testing confirmed it was a major bottleneck.
For Stratus 2.5, the current generation, we moved to having a single ObjectiveFS file mount per store. Each store has its own file cache local to a node running its containers on disk and in memory. We launched Stratus 2.5 2 weeks ago and it has solved every file system issue we’ve received complaints about, especially update slowness in Magento admin. Site performance is faster than ever, according to our New Relic data every store is 30% faster now. Stores with heavy file operations on load show even more improvement.
We’ve also added a lesser known feature called Stratus Cache. Stratus cache directly adds most of your code base into the container images we use for scaling. Stratus caches bypasses the file system for a majority of the system calls and improves performance while making scaling for large sales a breeze. If you are planning a large promotion or traffic influx, please let us know and we help get that working for you.
To contribute back to the community and improve Stratus, we’ve started making our own Magento 2 modules to address specific concerns we have about Magento 2 performance. Our first release was a complete re-work of the cron system in Magento 2. On Github at https://github.com/magemojo/m2-ce-cron . By default the Magento 2 crons can take a server down in the right conditions and they constantly fight each other and run the same task multiple times. Our module eliminates that problem, because it causes issues with stores and vital cron tasks are missed.
Next we have our split DB extension viewable at https://github.com/magemojo/m2-ce-splitdb . Magento 1 CE allowed merchants to easily use a master-slave database setup with a dedicated reader. Stratus uses Aurora which scales by having seamless multiple readers in a cluster. Since M2 CE does not support this at all out of the box, we had to build our solution. We believe Community should be able to scale just as well as Enterprise.
As we near Magento Imagine, we are working on improving the dev experience on Stratus. We provide free dev instances which are the same CDN and stack used by any production Stratus instance. Going forward, we want to include more tools, tests and utilities to make a developer friendly environment. The primary feature will be Live Preview. At the click of a button, customers can create exact copies of their production store, including the database. Then developers can go in and make changes, commit them, run tests, and push to production. Preview sites will be storable so you can save different versions of the site and refer to them as needed. After the initial release of Live Preview, we will be adding tools to perform Selenium and unit tests.
Stratus is now the premiere platform for Magento hosting. Nothing can scale and run your Magento store better. We've come a long way and we are grateful for our customer's patience. Now it's time to get back to business and stop worrying about your server.