Anomalous metrics for authentication service
Incident Report for Postman
Postmortem
  1. 17:00 UTC We noticed latency spikes on the service and began our investigation. We backtracked the first indication to high CPU usage at 16:40 UTC, which was being countered by self-healing mechanisms of the service.
  2. 17:10 UTC Tracing the symptoms to connectivity failure to one of our in-memory storage nodes, the node was hot rebooted. Hot reboot of in-memory storage usually results in degraded performance of our authentication servers until the caches are hydrated.
  3. 18:00 UTC Systems statistics had returned to normal ranges and we began monitoring the system. We had observed elevated CPU usage on the in-memory storage node, but we did not react to this because it was not unusual to see this behaviour post a hot reboot.
  4. 18:50 UTC The incident was closed, tracing it back to a plausible network outage and a probable delayed side-effect of the maintenance process of the in-memory storage node.

Detailed Postmortem that resulted in discover of root cause: https://status.getpostman.com/incidents/jy7fv1qs1k4b

Posted Aug 13, 2018 - 18:05 UTC

Resolved
This incident has been resolved.
Posted Aug 12, 2018 - 18:33 UTC
Monitoring
The system is now stable and login should work as expected. We are monitoring the system.
Posted Aug 12, 2018 - 18:19 UTC
Identified
We've identified the issue to be with re-connection logic of our in-memory storage of authentication service. We are rolling out a temporary fix and would look into a permanent solution once service is fully operational.
Posted Aug 12, 2018 - 17:38 UTC
Investigating
Investigating high error rates for authentication service. We've identified that it is associated with an issue accessing it's in-memory storage system. Some service disruptions should be expected. We will continue to post frequent updates to indicate investigation progress.
Posted Aug 12, 2018 - 17:07 UTC
This incident affected: Postman Platform on Desktop, Postman Monitors, and Postman Mocks.