Service Interruption - Voice platform
Incident Report for CM.com
Postmortem

Executive Summary:

On Friday morning Oct 15, a Microsoft Windows Server 2012/2012R2 update with a corrupt VMware driver was installed on three Voice Active Directory (AD) DNS servers via auto-update, causing all servers to become unresponsive. After restoring the servers from backup, services partly recovered. A second restore for the primary AD DNS was required to make service recover completely. The crash of the AD DNS servers caused various Voice services to be partly or completely unresponsive:

·       At least 50% of all inbound phone calls failed

·       Outbound phone calls via our Outbound-02 (NL) failed to set up

·       CM.com Voice platform apps were unresponsive

·       Voice API and Voice Bot calls had significant delays and/or time-outs

In addition to the outage of customer-facing services, the crash of the AD DNS servers also impacted internal monitoring, tooling and access, resulting in delayed response and mitigation with regards to the outage.

Planned mitigation improvements:

·       Disable the auto-update feature on all AD DNS servers (immediate)

·       Add additional Linux-based DNS servers for improved redundancy (week 42)

·       Upgrade all Windows AD DNS servers to latest Windows version (week 42)

·       Migrate all AD servers to latest CM.com cloud infrastructure (Q4)

·       Additional monitoring and testing capabilities (Q4 2021)

·       Improvements to back-up and restore procedures (Q4 2021)

Root Cause:

On Tuesday October 12th, Microsoft released an update for Windows Server 2012/2012 R2 VM, which included a corrupt driver for VMware PVSCSI. This update was installed on the three Voice AD DNS servers on Friday Oct 15. Due to the fact auto-update was enabled on all Voice AD DNS servers the installation of this update went by unnoticed and was not tested upfront.

Restoring services from back-up required the use of two different back-up instances, due to the fact the auto-update executed on different days. Hence, restoring of services took longer than expected.

Events Timeline (CET):

  • 15-10-2021 06:52:
    CM.com NOC reported a customer complaint about voice platform apps not being available, as well as internal monitoring and tooling not being reachable
  • 15-10-2021 07:16:
    CM.com networking team confirms Voice AD DNS servers crashed
  • 15-10-2021 07:43:
    Restore from back-up initiated
  • 15-10-2021 07:54:
    First AD server restored
  • 15-10-2021 08:12:
    Second AD server restored
  • 15-10-2021 08:20:
    Third AD server restored
  • 15-10-2021 08:44:
    Various Voice services restarted and restored, however traffic still not 100% reliable
  • 15-10-2021 09:18:
    First AD server restored with earlier back-up
  • 15-10-2021 09:23:
    All services fully restored
Posted Oct 20, 2021 - 11:33 CEST

Resolved
This incident has been resolved.
Posted Oct 15, 2021 - 09:50 CEST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 15, 2021 - 09:22 CEST
Update
We are continuing to work on a fix for this issue.
Posted Oct 15, 2021 - 09:07 CEST
Update
Inbound traffic is still unreliable, we are working to restore it.
Posted Oct 15, 2021 - 08:56 CEST
Identified
The cause has been found and fixed. Services are being restored now.
Posted Oct 15, 2021 - 08:38 CEST
Update
We are continuing to investigate this issue.
Posted Oct 15, 2021 - 07:26 CEST
Investigating
We are currently investigating a service interruption on our Voice platform.
Our engineers are looking into this with the highest priority.
Posted Oct 15, 2021 - 07:11 CEST
This incident affected: Voice (Voice Platform - The Netherlands, Voice API's, Voice Apps).