Storage latency

Incident Report for IdealCloud

Postmortem

Summary Problem Description

Storage latency from a controller failover caused increased latency, impacting services and storage IO.

‌

Immediate Remediation

We engaged our vendor, and they escalated the issue, during this time, we began to see latency return to normal, however it did not remain stable.

‌

Resolution

We migrated busy and priority workloads over to a new array that was more capable of handling the traffic and reduced the workload on the affected array.

‌

Analysis Findings/Review

The array was becoming overtaxed with both disk IO as well as CPU contention from the VM workloads. This trigged a bug that was found to restart the data service under heavy usage of the controller. This caused already heavy IO to get backed up, and compile into a situation where the array could no longer keep up serving IO fast enough for certain VM workloads. The manufacturer was able to adjust some CPU settings and we began seeing latency return to normal, however the bug was trigged again, and the controller failed over again, exacerbating the issue.
The only recommendation at that point by the manufacturer was to continue to move workloads off the array to reduce IO contention. We continued to expedite those moves, although that caused the latency issues to be drawn out longer than expected, as the migrations caused additional capacity churn, which prompted some of the array’s garbage collection services to consume more CPU than usual. As we continued to move workloads off, and the cleanup for capacity continued to complete, we saw the array return to a normal amount of latency and workload performance restored to normal.

‌

Recommendations

Our manufacturer has released a fix for the bug, which we will be applying to the array.
Internally, we will be monitoring all arrays based on CPU and IO metrics, and will be triggering resource upgrades, or migrations, as necessary to ensure balanced IO and CPU resources are available in the future before this issue would occur.

Posted Jul 21, 2022 - 12:27 EDT

Resolved

This issue has been resolved. The storage vendor has confirmed that the array is in a safe, healthy, and normal operational state. An RCA will be available upon request once it's completed.

Posted Jul 07, 2022 - 12:00 EDT

Update

The array continues to remain stable at this time. We are continuing to closely monitor and assess with the vendor.

Posted Jul 04, 2022 - 14:18 EDT

Update

The block reclamation revalidation has completed successfully at this time. The array continues to remain stable at this time. We are continuing to closely monitor and assess with the vendor.

Posted Jul 03, 2022 - 09:16 EDT

Update

The array has continued to remain stable with minimal latency while the background process continues to run. We will continue to monitor along with our vendor at this time.

Posted Jul 02, 2022 - 10:28 EDT

Update

Storage latency has remained at expected levels throughout the day today. At the request of our vendor, we are starting a process to assist with block reclamation. This change is non-service affecting and is not expected to create additional latency and is only planned to run for a limited period of time. The process will be stopped if any unexpected issues occur.

Posted Jul 01, 2022 - 22:09 EDT

Update

The lower latency we saw this morning has continued throughout the day today and the system has remained stable. We are continuing to monitor and adjust based on recommendations from the storage vendor's engineering team.

Posted Jun 30, 2022 - 19:16 EDT

Update

We are continuing to monitor and adjust based on recommendations from the storage vendor's engineering team. Further work overnight and into this morning has led to a significant further reduction in latency on the array and the system remains stable at this time.

Posted Jun 30, 2022 - 12:21 EDT

Update

At the request of the storage vendor we are postponing this planned failover event. We will update prior to a future scheduled failover.

Posted Jun 29, 2022 - 17:22 EDT

Update

Overall latency has been stable, but still elevated. We are continuing to move workloads off of the affected array. We are forcing a failover of the affected SAN's controllers at 5:30 PM EDT today. Customers will experience high disk latency as the controllers failover and IO is transferred to the standby controller.

Posted Jun 29, 2022 - 16:48 EDT

Update

Latency has improved, but it still elevated. We are continuing to work with the storage vendor while moving workloads off of the affected array.

Posted Jun 29, 2022 - 07:47 EDT

Update

We continue to work with the storage vendor to address the latency issue. Latency has been reduced further overnight but is still elevated. While we continue to work with the storage vendor, we are still moving workloads off of the affected array.

Posted Jun 29, 2022 - 05:44 EDT

Update

We continue to work with the storage vendor to address the latency issue. Latency has been reduced, but is still elevated. While we continue to work with the storage vendor, we are still moving workloads off of the affected array.

Posted Jun 28, 2022 - 23:27 EDT

Update

The original bug that this array encountered when the latency issue started was run into again. We are continuing to work with the storage vendor's engineering team to resolve the issue. While that work continues we are also moving additional workloads off the array to try and stabilize the performance of the array.

Posted Jun 28, 2022 - 18:58 EDT

Investigating

SAN latency had been stable throughout the day. However, at approximately 4:30PM the SAN experienced an unexpected service restart. This has increased latency while the services catch back up. We are working with the vendor to determine the cause of the latest service restart.

Posted Jun 28, 2022 - 16:48 EDT

Update

Latency has remained stable since the last changes. We are continuing to monitor the status while we work to move workloads off of this array.

Posted Jun 28, 2022 - 08:43 EDT

Monitoring

The recent changes have had a positive impact on the array and performance has improved. We are continuing to monitor and review with the vendor.

Posted Jun 28, 2022 - 04:58 EDT

Update

We continue to work the storage vendor's engineering department. An additional change has been recommended. We are applying the change at 3:30AM. This requires a restart of the data service. Customers may experience slightly increased latency as the services restart.

Posted Jun 28, 2022 - 03:14 EDT

Update

The changes have improved latency. The latency is improving more over time. We are taking additional measures to reduce overall IO on the array as well. We will provide additional updates as conditions change.

Posted Jun 27, 2022 - 18:35 EDT

Update

The storage vendor has identified a setting that can help reduce the latency. We will be implementing the change at 5PM EDT. Customers may see slightly higher latency as services are restarted to apply the change.

Posted Jun 27, 2022 - 16:53 EDT

Update

We are continuing to work with the storage vendor to resolve the issue. It has been escalated to their engineering team to continue working on a resolution. We have implemented some workarounds to try and reduce the IO on the array until the root cause can be resolved. We are reviewing potential additional changes to reduce the latency even further. Additional updates will be provided as they become available.

Posted Jun 27, 2022 - 16:12 EDT

Update

Latency is still elevated and sporadic. We are continuing to work through the issue. More information will be provided as it becomes available.

Posted Jun 27, 2022 - 14:31 EDT

Investigating

We have continued to monitor the status and review the cause with our storage vendor. Disk latency has started to slowly rise over the last 30 minutes. We are reviewing additional steps to get the latency back to appropriate levels.

Posted Jun 27, 2022 - 13:58 EDT

Monitoring

The failover completed successfully and we have seen latency remain stable. We are continuing to monitor the situation. We will continue to work with the vendor to investigate the root cause and actions to prevent this issue in the future.

Posted Jun 27, 2022 - 12:53 EDT

Investigating

Per vendor's recommendation we will be initiating a fail-over of the controllers at 12:15PM EDT today to try and alleviate the continued latency issues. During the failover latency may increase briefly as IO is moved to the redundant controller. An update will be sent out when that is completed.

Posted Jun 27, 2022 - 12:05 EDT

Update

We continue to see fluctuations in latency and are continuing to troubleshoot with the vendor. More information will be provided as it becomes available.

Posted Jun 27, 2022 - 11:37 EDT

Update

We started to see an increase in storage latency again. We are continuing to investigate the issue with the storage vendor.

Posted Jun 27, 2022 - 10:58 EDT

Monitoring

Disk latency is returning to normal. We are still investigating the root cause and monitoring the situation. Additional updates will be provided as available.

Posted Jun 27, 2022 - 10:45 EDT

Update

The active SAN controller experienced a software failure that caused it to crash and fail-over. The fail-over should not cause high latency, we are reviewing the issue with the vendor. The controllers are in a fully redundant state. Additional details will be provided as available.

Posted Jun 27, 2022 - 10:18 EDT

Investigating

We are currently investigating a storage latency issue with one of our arrays. Customers on the affected array may see slow response times for for storage. We will provide more information as we have it.

Posted Jun 27, 2022 - 09:43 EDT

This incident affected: PIT01 (Pittsburgh, PA) (Storage Infrastructure).