2019-03-25

New KB released, old MHM content removed, stratus guides updated for UI v2.10, release notes cleaned.

This release was mainly around stability. We have seen random reboots on z1d.12xlarge. According to AWS this was caused by their Nitro hypervisor and KVM under high load. We migrated all instances to z1d.metal instance types and stability returned.

This was difficult to diagnose. Console output on AWS Nitro based instances is limited to 64k screenshot. Previously they were able to pull the full console log. Netconsole cannot be used to send console logs remotely because ENA driver on Nitro does not support polling. Lastly, using kdump required AWS support to install correctly on Ubuntu on the Nitro instances.

The following changes are made:

  • Replaced Samba with NFS
  • Disabled unattended upgrades, causing different kernel versions, with different ENA, ZFS driver versions
  • Increased the size to 6Tb of ST1 type EBS devices, which ran out of burst credits when backups ran causing ZFS lockups and subsequent crashes
  • Ensured that SWAP was enabled on reboots and new servers
  • Installed Kdump which records crashes in /var/crash
  • Set nvme_core.timeout value from the 30s to the recommended highest setting
  • Ensured termination protection for instances and EBS devices
  • Enabled additional real-time logs in UI
  • Fixed an NGINX double save bug in UI
  • Fixed a bug in kube's coredns that is used for internal DNS resolution
  • Aadded new sections in Stratus UI, Cloudfront, Sphinx, Elasticsearch, Varnish, Redis, Memcache, Nginx, MySQL, PHP, Logs, Cron, Magento
  • Added mssql, ssh2 for PHP, and sass gem for Ruby
  • Fixed Autoscaling and Symlinks issue
  • Fixed a bug in Autoscaling exclude directories

The following internal issues were also fixed:

  • STRAT-1065 Disable zfs scrub on node creation
  • STRAT-1085 Alert on hugepages not being disabled
  • STRAT-1085 Disable hugepages
  • STRAT-1088 STRAT-1224 STRAT-1257 Enable Swap, kubelet and kubeadm
  • STRAT-1146 Change tmp locations to customer zfs shares
  • STRAT-1221 l2arc cache files in unavailable state after node stop
  • STRAT-1224 Swap and cache settings not persisting after node stop
  • STRAT-1257 142f reboot 1/29/2019 Ref: New Kernel, linux-crashdump, zfs, kvm, ena, swap, smb
  • STRAT-1257 Install kdump and record crashes in /var/crash/
  • STRAT-1257 KVM double_fault bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1744199/comments/3
  • STRAT-1257 Update ENA driver to 2.0.2K https://github.com/amzn/amzn-drivers/blob/master/kernel/linux/ena/RELEASENOTES.md
  • STRAT-1257 Update Kernel to 4.15.0-1032-aws which will update ENA, ZFS, KVM drivers info below / New AMI builds, disable unattended upgrades related
  • STRAT-1257 Update ZFS driver to 0.7.5-1ubuntu16.4 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781364/comments/25
  • STRAT-1266 Remove Samba* confirmed to cause Kernel Panics ref: Marty Simmons Eric Hileman Aleks Loz
  • STRAT-1270 Larger root device / 1Tb to curb kicking nodes on disc pressure
  • STRAT-1270 ephemeral storage pod eviction
  • STRAT-1272 Broken /stratus/cache and /var/lib/docker mounts in AU
  • STRAT-1282 /tmp setting a symlink in each container to reroute /tmp to their /tmp
  • STRAT-1307 Install nfs-kernel packages in launch template and in salt
  • STRAT-1309 migration script with enhancements
  • STRAT-1310 automated snapshots of st1 volumes
  • STRAT-1313 Remove all 12x testing workers from dev cluster
  • STRAT-1324 Mojo plans instance types and templates
  • STRAT-1325 standardized launch template names
  • STRAT-1334 Remove unattended upgrades
  • STRAT-1334 Remove unattended-upgrades package from the template in All regions
  • STRAT-1335 Varnish cache is going to emptyDir and thus to /var/lib/kubelet
  • STRAT-1336 Orphaned dir in var/lib/kubelet
  • STRAT-1345 i-0df35473dcb77c50e down would not restart by itself
  • STRAT-1347 Dev-Flight for STRAT-1297 Migration to new nodes with limits and correct mounts
  • STRAT-1367 Migration script not working 100% re: Flex volume
  • STRAT-1367 STRAT-1309 STRAT-1297 Optimise Migration script, and optimise not lose files
  • STRAT-708 Disable hugepages on new worker creation