Postman’s internal service that handles authentication faced elevated latency followed by a number of intermittent outages on Monday, August 13 between 03:40 (20:40 Sunday PT / 09:10 IST) and 11:00 UTC (04:00 PT / 16:30 IST). During this period we faced issues where, intermittently, users of Postman could not login to use Postman Pro services. We apologise for the inconvenience caused.
We traced the outage to a faulty instance of our in-memory storage associated with the service and replaced the same. Further investigation is on-going regarding the cause of the failure of the in-memory storage component.
Incident: https://status.getpostman.com/incidents/3j17rlgt8m62
17:00 UTC
We noticed latency spikes on the service and began our investigation. We backtracked the first indication to high CPU usage at 16:40 UTC, which was being countered by self-healing mechanisms of the service.17:10 UTC
Tracing the symptoms to connectivity failure to one of our in-memory storage nodes, the node was hot rebooted. Hot reboot of in-memory storage usually results in degraded performance of our authentication servers until the caches are hydrated.18:00 UTC
Systems statistics had returned to normal ranges and we began monitoring the system. We had observed elevated CPU usage on the in-memory storage node, but we did not react to this because it was not unusual to see this behaviour post a hot reboot.18:50 UTC
The incident was closed, tracing it back to a plausible network outage and a probable delayed side-effect of the maintenance process of the in-memory storage node.Incident: https://status.getpostman.com/incidents/jy7fv1qs1k4b
03:40 UTC
We observed elevated latency and anomalous metrics arising from the authentication system and subsequently began investigation. A subset of users were intermittently not able to login to our services.04:00 UTC
We allowed horizontal scaling capacity to compensate while we continued our investigation. Users were still facing intermittent login issues, which at time persisted beyond 10 minutes.05:20 UTC
We reproduced the issue within a quarantined clone of the service and isolated the anomaly to a small subset of the response sequence.06:00 UTC
We reconfigured the services to bypass in-memory storage or reduce dependency on in-memory storage to give us breathing room to perform additional investigation. Knowing that our August 12 fix was not permanent, we decided to invest additional time to identify the root cause.07:30 UTC
Still having not found the root cause of the issue, we cloned our authentication service application infrastructure and allowed a part of the workload be served using the new setup.08:30 UTC
We further cloned our in-memory storage infrastructure and allowed a part of the workload be served using the new components.09:30 UTC
The cloned infrastructure (along with cloned in-memory storage) performed within normal parameters, which allowed us to further focus our attention to internal engine or infrastructure issue of our in-memory storage. This was something we were not expecting and had not consequently, during our investigation, paid attention to.10:50 UTC
Having validated the root cause to be in-memory storage node infrastructure failure, we began swapping out all in-memory storage nodes with completely new components.11:40 UTC
All instances were working under normal parameters. We still continued monitoring the system closely.14:45 UTC
Instances were performing under normal parameters while in peak load, which closed the incident for us.
The erroneous in-memory storage node has been quarantined and we are still investigating the same with our platform vendors to determine what events led to its state.
Initiatives have been setup to add additional monitoring parameters to our in-memory storage nodes, to ensure pre-emptive discovery of malfunctions like this.