Teleport Cloud customers experienced a service disruption intermittently impacting connectivity for about 40 minutes. The root cause of the incident was a bug in the Envoy Gateway ingress service that resulted in excessive memory utilization and ultimately restarts. The incident was triggered by the restart of all tenant services due to an undocumented background EKS storage migration that occurred during a scheduled maintenance period to upgrade the Kubernetes control plane.
In response to the incident, the Teleport team shut down ingress controllers which prevented routing updates from triggering ingress service OOM events. This reduced customer impact by stabilizing ingress connectivity. Once the EKS storage migration and tenant service restarts were completed, the team rolled back the Envoy gateway version which stabilized the platform.