Degraded Connectivity

Incident Report for Teleport Cloud

Postmortem

Teleport Cloud customers experienced a service disruption intermittently impacting connectivity for about 40 minutes. The root cause of the incident was a bug in the Envoy Gateway ingress service that resulted in excessive memory utilization and ultimately restarts. The incident was triggered by the restart of all tenant services due to an undocumented background EKS storage migration that occurred during a scheduled maintenance period to upgrade the Kubernetes control plane.

In response to the incident, the Teleport team shut down ingress controllers which prevented routing updates from triggering ingress service OOM events. This reduced customer impact by stabilizing ingress connectivity. Once the EKS storage migration and tenant service restarts were completed, the team rolled back the Envoy gateway version which stabilized the platform.

Posted Mar 03, 2025 - 22:01 UTC

Resolved

This incident has been resolved.

Posted Feb 21, 2025 - 02:14 UTC

Monitoring

Services have been restored in all regions.

Posted Feb 21, 2025 - 01:06 UTC

Identified

We've identified the issue and have confirmed that connectivity disruption is only affecting a subset of tenants. Additionally, new signups are currently unavailable until the issue is resolved.

Posted Feb 20, 2025 - 20:34 UTC

Investigating

We are currently investigating connectivity degradation on our infrastructure.

Posted Feb 20, 2025 - 20:14 UTC

This incident affected: Cloud Service and Cloud Signups.