RMS Maintenance Wrap-Up: Fixes and Next Steps
The Remote Management System (RMS) has been fully restored to pre-maintenance levels. However, devices running significantly older firmware versions may still experience connectivity issues. Please report any such cases to us, we will assist you immediately!
Summary of the Incident
Unexpected complications during planned maintenance led to extended downtime. Server updates caused connection issues with devices, and despite efforts, these could not be resolved within a reasonable timeframe. The system was reverted, though connection problems persisted, resulting in approximately 24 hours of downtime, with devices gradually reconnecting over the next 72 hours.
Root Cause
The primary issue stemmed from devices running firmware versions older than 07.07, which did not handle disconnection gracefully following the RMS server restart. These devices ignored standard reconnection protocols, instead initiating rapid and continuous reconnection attempts.
Due to AWS load balancer limitations, even successfully connected devices were subsequently disconnected, creating a cycle of connections and reconnections by hundreds of thousands of devices. This flood of requests rendered the system temporarily inoperable. Over the weekend, we increased the number of active load balancers, redirected traffic, and made backend adjustments to restore service gradually.
Actions Taken and Next Steps
This incident has been a significant learning experience for our team and has underscored areas for improvement. While we are still finalizing our long-term action plan, here are several steps we are committed to taking:
Enhanced Staging Testing
Despite months of preparation in our staging environment, this issue took us by surprise. We will thoroughly review and update our staging procedures. Specifically, we will increase the number of demo staging devices to better simulate high-load scenarios on our system.
Improved Load Management
To prevent similar incidents, we will enhance our on-demand load balancing capabilities and implement a DNS/Proxy/port-redirection subsystem to manage potentially large traffic streams more effectively.
RMS Status Page
To improve communication during future maintenance or downtime, we will establish an RMS status page to keep our customers informed in real-time.
Need help?
If you need assistance in the meantime, our support team is ready to help. Reach out via our Helpdesk or Community Forum, and we‘ll answer all of your questions.
Thank you for your understanding and patience as we work to improve the resilience of our systems and prevent future disruptions!