Incident Report 01-09-2025 Conversational AI Cloud – Production Gateway
What:
Yesterday, September 1st we had a long-running incident with occurrences of partial downtime and occurrences of degraded performance on successful traffic towards the Conversational AI Cloud Production Gateway. This is our central entry point, which means all potential channels were affected. Through the course of the day, during the timespan of the incident, a little over 1% of requests failed, with 98.8% getting a successful response. At varying intervals, response times, which are usually well below 100 milliseconds, degraded to multiple seconds. Still functional, but not the performance that we stand for.
When:
The full span of the incident started at 09:00 CET and ended at 15:24 CET. Impact may have varied per client and per unit of time.
How:
Both a core cache for session management and a platform service that provides project information degraded under load. A few deployments were done during the day to mitigate the issues and make these components more scalable going forward. To fully observe the effects of the releases, said components were also rebooted as a follow-up to deployments.
There were follow-up releases on September 2nd to improve the platform service specifically. For both components, the growth of traffic over time proved them to be under-optimized for the current scale they serve.
Going forward
We will dilligently and carefully continue with some more releases this and next week to upgrade both services to be fit for future scale to prevent such occurrences from happening again. We apologize for the impact this may have had on you and your work. Thank you for your patience and understanding as we worked to resolve this.