Intermittent Teleport Web UI loading failures

Incident Report for Teleport Cloud

Postmortem

On March 25th, 2025, Teleport Cloud experienced a brief service interruption that led to an undetermined period of time where some customers’ were unable to load the Teleport web UI. The root cause was the combination of an internal service/load balancer configuration and a missing timeout within the Teleport web UI.

The teleport web UI queries an internal API when a user logs in to ensure that an onboarding questionnaire has been completed. This internal API relies on the on AWS load balancer controller to manage ingress and was configured to use an NLB with an instance-based target group.

At 12:56 UTC on March 25th, the cluster autoscaler in one of our production EKS clusters began the routine process of removing an under utilized instance from service. The removal of this instance in combination with the instance-based target group caused TCP connections to be dropped(kubernetes/autoscaler#5532). Due to the missing timeout within the teleport web UI, web browsers with a previously established connection to this instance would hang on this dropped connection for an undetermined amount of time, likely until the OS or browser keep-alive logic determined the connection was dead.

This outage identified two key improvement Teleport can make to ensure this doesn’t happen again moving forward:

Migrate the internal API to IP based target groups - IP based target groups allow for pods to be deregistered with the NLB before being terminated, ensuring time for graceful termination of TCP connections.
Removal of the internal API dependency within the Teleport UI - The Teleport team is working on removing this internal API dependency from the web UI initialization flow to ensure similar issues don’t happen again in the future.

Posted Mar 31, 2025 - 18:48 UTC

Resolved

Between 13:00 and 13:15 a subset of customers might have experienced intermittent failures while loading Teleport Web UI.

Posted Mar 25, 2025 - 13:00 UTC