On March 25th, 2025, Teleport Cloud experienced a brief service interruption that led to an undetermined period of time where some customers’ were unable to load the Teleport web UI. The root cause was the combination of an internal service/load balancer configuration and a missing timeout within the Teleport web UI.
The teleport web UI queries an internal API when a user logs in to ensure that an onboarding questionnaire has been completed. This internal API relies on the on AWS load balancer controller to manage ingress and was configured to use an NLB with an instance-based target group.
At 12:56 UTC on March 25th, the cluster autoscaler in one of our production EKS clusters began the routine process of removing an under utilized instance from service. The removal of this instance in combination with the instance-based target group caused TCP connections to be dropped(kubernetes/autoscaler#5532). Due to the missing timeout within the teleport web UI, web browsers with a previously established connection to this instance would hang on this dropped connection for an undetermined amount of time, likely until the OS or browser keep-alive logic determined the connection was dead.
This outage identified two key improvement Teleport can make to ensure this doesn’t happen again moving forward: