Anomalous metrics for Authentication Service
Incident Report for Postman
Postmortem

Postman’s internal service that handles authentication faced elevated latency followed by a number of intermittent outages on Monday, August 13 between 03:40 (20:40 Sunday PT / 09:10 IST) and 11:00 UTC (04:00 PT / 16:30 IST). During this period we faced issues where, intermittently, users of Postman could not login to use Postman Pro services. We apologise for the inconvenience caused.

We traced the outage to a faulty instance of our in-memory storage associated with the service and replaced the same. Further investigation is on-going regarding the cause of the failure of the in-memory storage component.

Sunday August 12

Incident: https://status.getpostman.com/incidents/3j17rlgt8m62

  1. 17:00 UTC We noticed latency spikes on the service and began our investigation. We backtracked the first indication to high CPU usage at 16:40 UTC, which was being countered by self-healing mechanisms of the service.
  2. 17:10 UTC Tracing the symptoms to connectivity failure to one of our in-memory storage nodes, the node was hot rebooted. Hot reboot of in-memory storage usually results in degraded performance of our authentication servers until the caches are hydrated.
  3. 18:00 UTC Systems statistics had returned to normal ranges and we began monitoring the system. We had observed elevated CPU usage on the in-memory storage node, but we did not react to this because it was not unusual to see this behaviour post a hot reboot.
  4. 18:50 UTC The incident was closed, tracing it back to a plausible network outage and a probable delayed side-effect of the maintenance process of the in-memory storage node.

Monday August 13

Incident: https://status.getpostman.com/incidents/jy7fv1qs1k4b

  1. 03:40 UTC We observed elevated latency and anomalous metrics arising from the authentication system and subsequently began investigation. A subset of users were intermittently not able to login to our services.
  2. 04:00 UTC We allowed horizontal scaling capacity to compensate while we continued our investigation. Users were still facing intermittent login issues, which at time persisted beyond 10 minutes.
  3. 05:20 UTC We reproduced the issue within a quarantined clone of the service and isolated the anomaly to a small subset of the response sequence.
  4. 06:00 UTC We reconfigured the services to bypass in-memory storage or reduce dependency on in-memory storage to give us breathing room to perform additional investigation. Knowing that our August 12 fix was not permanent, we decided to invest additional time to identify the root cause.
  5. 07:30 UTC Still having not found the root cause of the issue, we cloned our authentication service application infrastructure and allowed a part of the workload be served using the new setup.
  6. 08:30 UTC We further cloned our in-memory storage infrastructure and allowed a part of the workload be served using the new components.
  7. 09:30 UTC The cloned infrastructure (along with cloned in-memory storage) performed within normal parameters, which allowed us to further focus our attention to internal engine or infrastructure issue of our in-memory storage. This was something we were not expecting and had not consequently, during our investigation, paid attention to.
  8. 10:50 UTC Having validated the root cause to be in-memory storage node infrastructure failure, we began swapping out all in-memory storage nodes with completely new components.
  9. 11:40 UTC All instances were working under normal parameters. We still continued monitoring the system closely.
  10. 14:45 UTC Instances were performing under normal parameters while in peak load, which closed the incident for us.

Summary

The erroneous in-memory storage node has been quarantined and we are still investigating the same with our platform vendors to determine what events led to its state.

Initiatives have been setup to add additional monitoring parameters to our in-memory storage nodes, to ensure pre-emptive discovery of malfunctions like this.

Posted Aug 13, 2018 - 18:04 UTC

Resolved
This incident has been resolved.
Posted Aug 13, 2018 - 17:53 UTC
Update
We are closing this incident after extended monitoring. All metrics are operating within normal parameters.
Posted Aug 13, 2018 - 17:52 UTC
Update
Systems are operational and we are waiting for another round of peak traffic to ensure that the problem does not recur.
Posted Aug 13, 2018 - 13:24 UTC
Monitoring
We've applied a fix that should restore connectivity for most users. We'll continue to monitor the situation for any abnormalities.
Posted Aug 13, 2018 - 11:33 UTC
Identified
We have identified high latency while connecting to the in-memory storage associated with authentication service. We are suspecting a possible hardware or software fault. We are in the process of replacing configuration in systems that refer to this. You should now be able to login with intermittent errors.
Posted Aug 13, 2018 - 11:05 UTC
Update
We are continuing to investigate this issue.
Posted Aug 13, 2018 - 09:24 UTC
Investigating
We're currently investigating high error rates for authentication service. We've identified that it is associated with an issue accessing it's in-memory storage system. Some service disruptions should be expected. We will continue to post frequent updates to indicate investigation progress.
Posted Aug 13, 2018 - 04:28 UTC
This incident affected: Postman Platform on Desktop, Postman Monitors, and Postman Mocks.