At 10:11am on 10th January, an incident occurred with the main IP Telecom database cluster. The database stopped responding to queries, which caused calls to drop and SIP registrations to fail. The failure was due to an edge case in the database cluster platform, where a node fails, but in a way that does not inform the remaining nodes that it has failed. The remaining nodes become blocked until the failed node either dies completely or recovers.
A number of attempts were made to re-establish service by the infrastructure team. Unfortunately, one of these interventions compounded the issue and created a cascading failure on a number of other components in the HostedPBX platform. These secondary failures were recovered within a few minutes.
At 10:40am the database layer was successfully restarted. Calls and registrations immediately started working again. No further issues have
been observed since that time.
We have a ticket open with our database vendor, and are actively working with them to ensure this issue does not occur again.
Jan 10, 13:39 GMT
We are continuing to monitor for any further issues.
Jan 10, 12:18 GMT
Service has been restored. We are investigating the root cause of the issue and continue to monitor the system.
Jan 10, 10:48 GMT
We are continuing to investigate this issue.
Jan 10, 10:47 GMT
There are issues with calls on trunks and hosted services at the moment. We are investigating.
Jan 10, 10:31 GMT