Storage latency
Incident Report for IdealCloud
Postmortem

Summary Problem Description

  • Storage latency from a controller failover caused increased latency, impacting services and storage IO.

Immediate Remediation

  • We engaged our vendor, and they escalated the issue, during this time, we began to see latency return to normal, however it did not remain stable.

Resolution

  • We migrated busy and priority workloads over to a new array that was more capable of handling the traffic and reduced the workload on the affected array.

Analysis Findings/Review

  • The array was becoming overtaxed with both disk IO as well as CPU contention from the VM workloads. This trigged a bug that was found to restart the data service under heavy usage of the controller. This caused already heavy IO to get backed up, and compile into a situation where the array could no longer keep up serving IO fast enough for certain VM workloads. The manufacturer was able to adjust some CPU settings and we began seeing latency return to normal, however the bug was trigged again, and the controller failed over again, exacerbating the issue.
  • The only recommendation at that point by the manufacturer was to continue to move workloads off the array to reduce IO contention. We continued to expedite those moves, although that caused the latency issues to be drawn out longer than expected, as the migrations caused additional capacity churn, which prompted some of the array’s garbage collection services to consume more CPU than usual. As we continued to move workloads off, and the cleanup for capacity continued to complete, we saw the array return to a normal amount of latency and workload performance restored to normal.

Recommendations

  • Our manufacturer has released a fix for the bug, which we will be applying to the array.
  • Internally, we will be monitoring all arrays based on CPU and IO metrics, and will be triggering resource upgrades, or migrations, as necessary to ensure balanced IO and CPU resources are available in the future before this issue would occur.
Posted Jul 21, 2022 - 12:27 EDT

Resolved
This issue has been resolved. The storage vendor has confirmed that the array is in a safe, healthy, and normal operational state. An RCA will be available upon request once it's completed.
Posted Jul 07, 2022 - 12:00 EDT
Update
The array continues to remain stable at this time. We are continuing to closely monitor and assess with the vendor.
Posted Jul 04, 2022 - 14:18 EDT
Update
The block reclamation revalidation has completed successfully at this time. The array continues to remain stable at this time. We are continuing to closely monitor and assess with the vendor.
Posted Jul 03, 2022 - 09:16 EDT
Update
The array has continued to remain stable with minimal latency while the background process continues to run. We will continue to monitor along with our vendor at this time.
Posted Jul 02, 2022 - 10:28 EDT
Update
Storage latency has remained at expected levels throughout the day today. At the request of our vendor, we are starting a process to assist with block reclamation. This change is non-service affecting and is not expected to create additional latency and is only planned to run for a limited period of time. The process will be stopped if any unexpected issues occur.
Posted Jul 01, 2022 - 22:09 EDT
Update
The lower latency we saw this morning has continued throughout the day today and the system has remained stable. We are continuing to monitor and adjust based on recommendations from the storage vendor's engineering team.
Posted Jun 30, 2022 - 19:16 EDT
Update
We are continuing to monitor and adjust based on recommendations from the storage vendor's engineering team. Further work overnight and into this morning has led to a significant further reduction in latency on the array and the system remains stable at this time.
Posted Jun 30, 2022 - 12:21 EDT
Update
At the request of the storage vendor we are postponing this planned failover event. We will update prior to a future scheduled failover.
Posted Jun 29, 2022 - 17:22 EDT
Update
Overall latency has been stable, but still elevated. We are continuing to move workloads off of the affected array. We are forcing a failover of the affected SAN's controllers at 5:30 PM EDT today. Customers will experience high disk latency as the controllers failover and IO is transferred to the standby controller.
Posted Jun 29, 2022 - 16:48 EDT
Update
Latency has improved, but it still elevated. We are continuing to work with the storage vendor while moving workloads off of the affected array.
Posted Jun 29, 2022 - 07:47 EDT
Update
We continue to work with the storage vendor to address the latency issue. Latency has been reduced further overnight but is still elevated. While we continue to work with the storage vendor, we are still moving workloads off of the affected array.
Posted Jun 29, 2022 - 05:44 EDT
Update
We continue to work with the storage vendor to address the latency issue. Latency has been reduced, but is still elevated. While we continue to work with the storage vendor, we are still moving workloads off of the affected array.
Posted Jun 28, 2022 - 23:27 EDT
Update
The original bug that this array encountered when the latency issue started was run into again. We are continuing to work with the storage vendor's engineering team to resolve the issue. While that work continues we are also moving additional workloads off the array to try and stabilize the performance of the array.
Posted Jun 28, 2022 - 18:58 EDT
Investigating
SAN latency had been stable throughout the day. However, at approximately 4:30PM the SAN experienced an unexpected service restart. This has increased latency while the services catch back up. We are working with the vendor to determine the cause of the latest service restart.
Posted Jun 28, 2022 - 16:48 EDT
Update
Latency has remained stable since the last changes. We are continuing to monitor the status while we work to move workloads off of this array.
Posted Jun 28, 2022 - 08:43 EDT
Monitoring
The recent changes have had a positive impact on the array and performance has improved. We are continuing to monitor and review with the vendor.
Posted Jun 28, 2022 - 04:58 EDT
Update
We continue to work the storage vendor's engineering department. An additional change has been recommended. We are applying the change at 3:30AM. This requires a restart of the data service. Customers may experience slightly increased latency as the services restart.
Posted Jun 28, 2022 - 03:14 EDT
Update
The changes have improved latency. The latency is improving more over time. We are taking additional measures to reduce overall IO on the array as well. We will provide additional updates as conditions change.
Posted Jun 27, 2022 - 18:35 EDT
Update
The storage vendor has identified a setting that can help reduce the latency. We will be implementing the change at 5PM EDT. Customers may see slightly higher latency as services are restarted to apply the change.
Posted Jun 27, 2022 - 16:53 EDT
Update
We are continuing to work with the storage vendor to resolve the issue. It has been escalated to their engineering team to continue working on a resolution. We have implemented some workarounds to try and reduce the IO on the array until the root cause can be resolved. We are reviewing potential additional changes to reduce the latency even further. Additional updates will be provided as they become available.
Posted Jun 27, 2022 - 16:12 EDT
Update
Latency is still elevated and sporadic. We are continuing to work through the issue. More information will be provided as it becomes available.
Posted Jun 27, 2022 - 14:31 EDT
Investigating
We have continued to monitor the status and review the cause with our storage vendor. Disk latency has started to slowly rise over the last 30 minutes. We are reviewing additional steps to get the latency back to appropriate levels.
Posted Jun 27, 2022 - 13:58 EDT
Monitoring
The failover completed successfully and we have seen latency remain stable. We are continuing to monitor the situation. We will continue to work with the vendor to investigate the root cause and actions to prevent this issue in the future.
Posted Jun 27, 2022 - 12:53 EDT
Investigating
Per vendor's recommendation we will be initiating a fail-over of the controllers at 12:15PM EDT today to try and alleviate the continued latency issues. During the failover latency may increase briefly as IO is moved to the redundant controller. An update will be sent out when that is completed.
Posted Jun 27, 2022 - 12:05 EDT
Update
We continue to see fluctuations in latency and are continuing to troubleshoot with the vendor. More information will be provided as it becomes available.
Posted Jun 27, 2022 - 11:37 EDT
Update
We started to see an increase in storage latency again. We are continuing to investigate the issue with the storage vendor.
Posted Jun 27, 2022 - 10:58 EDT
Monitoring
Disk latency is returning to normal. We are still investigating the root cause and monitoring the situation. Additional updates will be provided as available.
Posted Jun 27, 2022 - 10:45 EDT
Update
The active SAN controller experienced a software failure that caused it to crash and fail-over. The fail-over should not cause high latency, we are reviewing the issue with the vendor. The controllers are in a fully redundant state. Additional details will be provided as available.
Posted Jun 27, 2022 - 10:18 EDT
Investigating
We are currently investigating a storage latency issue with one of our arrays. Customers on the affected array may see slow response times for for storage. We will provide more information as we have it.
Posted Jun 27, 2022 - 09:43 EDT
This incident affected: PIT01 (Pittsburgh, PA) (Storage Infrastructure).