Investigating connectivity issues.
Resolved
Aug 01 at 12:45pm EDT
Please find the post-mortem details for the incident occurring on July 18th below:
What happened?
The Wattch Edge Controller (WEC) uses an operating system (OS), called Ubuntu Core. While Wattch maintains an application on the WEC (the “firmware”) that implements all Wattch user facing functionality, the underlying Ubuntu Core OS is maintained by our partner Canonical. Canonical is a global leader in enterprise computing and cyber security - their engineering efforts help ensure the operating system remains secure by shipping periodic patches and constantly evaluating threat vectors.
On July 18th, Canonical deployed a security update to the Ubuntu Core OS. As WECs picked up the software update, they began to fall offline. We were able to identify that only the first ~100 devices sold by Wattch were affected . This was because they had an earlier version of the underlying hardware (Broadcom BCM2710A1 vs. BCM2711 CPU core). We concluded that the earlier version of the hardware was incompatible with the updated OS.
Wattch’s engineering team worked from Thursday through Sunday with Canonical’s 24/7 incident response team to identify the root cause of the issue. Canonical froze and eventually globally rolled back their deployment of the security patch during this investigation process.
Once the root cause was identified, Wattch created a recovery process for the two subtypes of hardware devices affected and worked with customers to deploy the fixes across any affected sites.
What are we doing to prevent this in the future?
Wattch implements a multi-month, three tier approach to rolling out changes to our application payload (the “firmware”):
- Extensive in-lab bench testing on known-good hardware systems
- “Canary” hardware owned by Wattch installed at real PV/BESS sites
- Gradual rollout to customer sites based on geography and asset-type
Historically, security and reliability patches to Canonical’s Ubuntu Core OS have bypassed Wattch’s rollout procedure, instead relying on Canonical’s testing process.
Moving forward, Wattch will fold all patches issued by Canonical into our standard testing procedure. This will allow us to maintain control of the rollout of any new software and catch any breaking bugs before they reach customer devices.
What was the impact?
- Site monitoring was unavailable for affected sites during the incident. All affected sites required a site visit to recover communications and data flow. Sites with Revolution series hardware (orange faceplate) were power cycled and sites with Strato series devices required the WEC to be physically repaired by Wattch staff or replaced.
- Inverter level production data was NOT recorded in Wattch for the period the WEC was offline. Wattch cellular modems remained functional during the incident. If your devices are connected to an inverter native platform via the Wattch modem, you should have continued to collect device data in that platform(s).
- Revenue Grade Meter data WAS recorded for the period the WEC was offline. Cumulative revenue values from the meter backfill in Wattch once the WEC is back online. The recovered delta in data will be mis-attributed to the day the WEC was recovered.
Affected services
Updated
Jul 23 at 04:35pm EDT
A final fix has been identified and we are working with all affected customers to implement the solution.
Affected services
Updated
Jul 23 at 01:19pm EDT
A final fix has been identified and all affected customers have been contacted with a solution.
Affected services
Updated
Jul 22 at 02:54pm EDT
Over the weekend, we continued to work with Canonical to fully identify the root cause of the issue preventing a subset of WECs from connecting to Wattch. We have been able to successfully recover a StratoPi device on our test bench.
Currently, we are working to develop an easy in-field recovery solution for sites with CMP3 core StratoPi devices. Affected customers will be notified with remediation steps once confirmed.
NOTE: If you have not received an email from Wattch about this issue, your account is unaffected.
Affected services
Updated
Jul 19 at 05:36pm EDT
We have worked with our partner, Canonical, to build a remediation strategy for the two types of affected WEC devices. We have an identified solution for "RevPi" devices and are still working on a solution for "StratoPi" devices. We have tested extensively on our in-office bench devices as well as at several sites in the field
Customers with affected devices have been emailed with specific instructions and next steps.
Further communication and update will be sent on Monday.
Affected services
Updated
Jul 19 at 11:21am EDT
Yesterday, we identified a bug in a regularly scheduled security patch update by our third party provider, Canonical, that caused older model WECs to fall offline.
We’re actively working with our third party provider to establish a recovery process for bringing these WECs back online. Please DO NOT try to physically reboot affected devices at this time. We will be communicating to customers with offline sites about the recovery process as soon as it is established.
For affected sites, we expect timeseries inverter/device data to be lost for the period of the outage but cumulative (odometer) revenue grade meter data will NOT be lost.
Affected services
Updated
Jul 18 at 04:19pm EDT
We've identified an issue and are working with our third-party vendor to fix impacted WECs.
Affected services
Updated
Jul 18 at 12:18pm EDT
Wattch is currently investigating issues involving Edge Controller authentication with the Cloud that are preventing devices from periodically refreshing their secure connection. This is manifesting as a small subset of Wattch Core (Edge Controller-connected) sites appearing to go offline. Wattch Lite (API Connected) sites are not impacted. We do not anticipate any loss of data - missing points will backfill once connection is reestablished.
Resolution is currently expected within the next 3-4 hours, but we will share more updates as the investigation continues.
Affected services
Created
Jul 18 at 11:29am EDT
We are currently investigating issues with connectivity to WECs. We will update as we learn more.
Affected services