Current Mapping Platform and Network Status: Online
Incident Report - Service disruption event June 12th, 13th, and 14th.
Our Data Center (DC) team working closely with Dell engineers determined that the source of the platform degradation was caused by a bug in the Dell Hardware. This bug caused the SAN (Storage Area Network) to have a hard shutdown causing the database to become corrupted. In addition, Dell’s RAID system architecture is expected to provide failsafe for database corruption; when one disk gets corrupted, the other disk provides access to clean data. Unfortunately, the RAID system was subject to the same bug in Dell’s firmware causing it not to work properly. We do have online replication at another data center for DR (Disaster Recovery) purposes which covers us in the event one data center is down. However, online replication is not designed to cover database corruption, because the corrupt database will invariably get copied to both data centers. As noted, the RAID is supposed to cover database corruptions but the RAID did not work because of the bug in Dell’s SAN. The only way to bring the database live was through a manual back-up process. We were successful in getting the database online but this was time consuming because of the failure mentioned on the primary SAN. We are now aware of this bug in Dell’s firmware. Dell has provided us a way to avoid triggering this bug creating a similar situation in future.
Notably, this is an extremely rare hardware level event. Dell’s SANs are usually quite reliable and they are provided to and utilized by a large number of F500 companies. We are confident that this will not occur again. We apologize for this service interruption and any disruption it has caused on your business. Our technical team strives to provide a world-class service and we take any service interruptions seriously.
- At approximately 08:15 AM EDT on June 12, 2020 DC engineers responded to alerts generated by our internal monitoring systems indicating platform performance problems. Given the unusual behavior the platform was exhibiting, fix agents from network engineering, development and database administration teams joined the troubleshooting efforts within minutes of problem detection.
- By approximately 10:15 AM, the team determined the source of the platform degradation was the primary SAN (Storage Area Network) where the Position Logic platform databases are stored.
- After verification of the problem source, system engineers followed standard troubleshooting work instructions by initiating a failover to secondary (backup) controllers on the SAN. Additionally, DC engineers contacted the SAN Vendor and Dell for additional support with the issue. After successfully failing over the controllers of the first four nodes in the SAN, the failover process on the fifth node failed and the SAN experienced an unclean shutdown.
- Dell would later confirm that a known issue in the firmware that was running on the SAN could cause the controllers to unexpectedly shutdown or hard reset if the management port of the controller is not enabled, which it was not at the time the failover was attempted. After attempting to bring the SAN online following the shutdown, Dell engineers performed routine diagnostics on the system that indicated there were lost blocks on several volumes.
- After Dell engineers determined that the storage systems had suffered a hard shutdown and lost blocks, recovery of the SAN became much more challenging as multiple patches had to be applied to the controllers, one by one, until the SAN would be in a state where a repair attempt could be made. As this SAN recovery process was started, a disaster was declared and DC engineers started recovery procedures in the backup data center.
- Scattered database corruption that was present on the primary storage systems also replicated to the secondary Dell SAN in the backup data center. Engineers attempted repairs of the corrupted data but eventually we were forced to initiate our second data recovery contingency, which is to restore any corrupted databases from the latest backup. Given the complexity and scale of an all-client impact event, these database restorations took longer than anticipated.
- By 19:11 on June 12, the primary SAN was back online and many client platform services were restored.
- Restoration of the remaining client databases with known corruption continued throughout the weekend and was completed by 3PM on Sunday June 14.
- Our team and our DC team continually monitors stability in our system. In addition there will be additional maintenance checks over the next few days and weeks to make sure we maintain proper performance and stability of the platform service.