Post-Incident Review - Email Service Outage
Incident Date: January 8, 2026 Duration: 16:05:22 - 18:41:03 CET (approximately 2.5 hours)
Summary
A deployment to update our software frameworks resulted in a complete email service outage lasting approximately 2.5 hours. During this period, all email traffic was lost due to a misconfiguration that caused messages to be silently discarded from our queue without processing.
Impact
Affected Services:
- Campaigns triggered by CDP (Workflows)
- Campaigns triggered by API
- Mails sent by API
Unaffected Services:
- Newsletter traffic remained functional throughout the incident
Data Loss: All email requests received during the incident window were accepted with success status codes but were not processed or delivered. Due to the nature of the issue, these messages could not be recovered or resent.
Timeline
- 16:05 CET - New release deployed as part of maintenance project
- ~18:40 CET - Issue reported: triggered emails not being sent
- 18:40 CET - Service rolled back to previous version
- 18:41 CET - Normal operations restored
Root Cause
Incoming email requests are accepted by our API and queued for downstream processing. The deployment introduced a misconfiguration in the connection between our API service and the queueing system. This caused:
- Requests to be accepted successfully by the API (returning success status codes to clients)
- Messages to be discarded by the queue without error messages
- No data reaching downstream email processors, leading to no traffic being sent out
- No request persistence, preventing recovery of lost messages
Resolution Actions
Immediate:
- Rolled back to last known healthy state, restoring service
- Investigated recovery and resending of affected traffic. Despite our efforts, we discontinued these attempts because: (1) the lack of request persistence meant we could not fully recover all affected data, making complete automatic resending impossible, and (2) without knowledge of the time-sensitivity of individual customer communications, we determined that delayed delivery could potentially cause more harm than benefit
Preventative Measures Implemented:
- Fixed the queue connection misconfiguration
- Implemented a fallback/dead-letter queue to capture improperly routed messages and prevent data loss
- Postponed all planned releases pending thorough validation beyond our standard release process
Additional Information
Limited logging information on missed requests is available upon request. Customers experiencing business impact are encouraged to contact us for further assistance.
Closing Remark
We deeply regret this incident and the impact it has had on our customers. We are committed to preventing similar issues in the future through the measures outlined above.