Teleport Cloud connectivity issues in US West & Europe
Incident Report for Teleport Cloud
Postmortem

Summary

Teleport Cloud customers experienced connectivity issues between the hours of 17:10 UTC and 23:14 UTC on 2022-06-08. Increased memory consumption by Teleport Cloud Worldwide components triggered the outage.

Impact

An estimated 23% of Teleport Cloud customers experienced connectivity issues between 17:10 UTC and 20:20 UTC on 2022-06-08. 

All Teleport Cloud customers experienced connectivity issues from 21:00 UTC to 23:14 UTC on 2022-06-08.

Root Cause

Increased memory consumption by software that enables Teleport Cloud Worldwide triggered the initial connectivity issue. This memory consumption interrupted adjacent workloads on the same host. In order to alleviate the load, a core component of our infrastructure was upgraded which caused a service interruption for all users.

Recovery

The initial connectivity issue was resolved by scaling Teleport Cloud. The failed upgrade of the core component was resolved by rolling back to the previous version. The extended outage was due to incompatibility between versions.

Corrective Actions

  1. The capacity of the Teleport Cloud fleet will be increased in all regions during the next maintenance window.
  2. Monitoring service health will be modified to enhance early detection.
  3. Enhancements scheduled for Teleport 10 will reduce memory consumption for Teleport Cloud Worldwide components.
Posted Jun 09, 2022 - 22:54 UTC

Resolved
This incident has been resolved.
Posted Jun 08, 2022 - 23:20 UTC
Update
The fix has been implemented and we have seen recovery
Posted Jun 08, 2022 - 23:18 UTC
Identified
We have identified the issue and are implementing a fix. A subset of customers may see a brief disruption of their connectivity while we're restarting some processes.
Posted Jun 08, 2022 - 19:29 UTC
Investigating
We are seeing elevated error rates for connections to our nodes in the US West and Europe regions. We're currently investigating the cause of this issue.
Posted Jun 08, 2022 - 18:51 UTC
This incident affected: Cloud Service and Cloud Signups.