Conversational AI Cloud - Outage

Incident Report for CM.com

Postmortem

Incident Report 01-09-2025 Conversational AI Cloud – Production Gateway

What:

Yesterday, September 1st we had a long-running incident with occurrences of partial downtime and occurrences of degraded performance on successful traffic towards the Conversational AI Cloud Production Gateway. This is our central entry point, which means all potential channels were affected. Through the course of the day, during the timespan of the incident, a little over 1% of requests failed, with 98.8% getting a successful response. At varying intervals, response times, which are usually well below 100 milliseconds, degraded to multiple seconds. Still functional, but not the performance that we stand for.

When:

The full span of the incident started at 09:00 CET and ended at 15:24 CET. Impact may have varied per client and per unit of time.

How:

Both a core cache for session management and a platform service that provides project information degraded under load. A few deployments were done during the day to mitigate the issues and make these components more scalable going forward. To fully observe the effects of the releases, said components were also rebooted as a follow-up to deployments.

  • A new version of the Redis Client in the Engine behind the Production Gateway was introduced
  • The minimum baseline of a computational resource for broad workloads was increased
  • The cache implementation was improved to reduce load by cutting out unnecessary redundancies

There were follow-up releases on September 2nd to improve the platform service specifically. For both components, the growth of traffic over time proved them to be under-optimized for the current scale they serve.

Going forward

We will dilligently and carefully continue with some more releases this and next week to upgrade both services to be fit for future scale to prevent such occurrences from happening again. We apologize for the impact this may have had on you and your work. Thank you for your patience and understanding as we worked to resolve this.

Posted Sep 02, 2025 - 21:24 CEST

Resolved

This incident is resolved. We have not seen a single exception over the past period of monitoring. We apologize for any impact this might have had for you today and will publish a post-mortem tomorrow.
Posted Sep 01, 2025 - 16:40 CEST

Update

The system performance still looks stable, we will continue to monitor on high alert.
Posted Sep 01, 2025 - 16:07 CEST

Update

The current system performance appears to be stable and responsive. We will continue to monitor closely over the next 30 minutes.
Posted Sep 01, 2025 - 15:13 CEST

Update

The new release is scheduled to take place within the next few minutes. We will provide the next update within 30 minutes at the latest, or sooner if circumstances require.
Posted Sep 01, 2025 - 14:53 CEST

Update

A new release is currently being built as a follow-up. We will provide the next update within 30 minutes, or sooner if developments warrant.
Posted Sep 01, 2025 - 14:17 CEST

Update

We have observed some improvements; however, the issue has not been fully resolved. Currently, most requests are performing well, but a subset remains significantly slower. We are continuing to investigate these slow requests.
Posted Sep 01, 2025 - 14:05 CEST

Monitoring

A fix has been implemented and we are monitoring the results
Posted Sep 01, 2025 - 13:39 CEST

Identified

A fix—expected to be structural rather than temporary—is currently being prepared. The release is anticipated to begin at approximately 13:15.
Posted Sep 01, 2025 - 12:50 CEST

Update

We are actively working to resolve the issue as quickly as possible and will provide a further update as soon as possible.
Posted Sep 01, 2025 - 11:47 CEST

Investigating

We are currently experiencing issues with the production gateway of CAIC. We are investigating further.
Posted Sep 01, 2025 - 10:04 CEST
This incident affected: Conversational AI Cloud (Production Gateway).