Investigating database issues on EU 01 causing dashboard errors, delayed data processing
Incident Report for Braze, Inc.
Resolved
At this time, data processing remains nominal, the continues to be no delay in message sending. We are still working with our vendor for RCA and future prevention however this incident is now resolved.
Posted 2 months ago. Nov 12, 2018 - 19:02 EST
Update
We have processed the backlog for the majority of customers at this time but we are still monitoring the remaining customer's backlog.
Posted 2 months ago. Nov 12, 2018 - 18:00 EST
Monitoring
We have continued scaling operations towards full capacity and have processed the majority of queued message sending at this time. New messaging campaigns are going out in real time at a nominal speed. We are still heavily queued on incoming data processing at this time.
Posted 2 months ago. Nov 12, 2018 - 17:01 EST
Update
We were able to bring our data collection API's back online about 15 minutes ago and have been monitoring their status. We are still processing the large backlog and are currently do not have enough capacity to process incoming data collection requests and such such the backlog, even while being processed, is growing.
Posted 2 months ago. Nov 12, 2018 - 15:28 EST
Identified
We have triggered a recurrence of the database instability that we saw at the beginning of this incident and have had to stop data collection and data processing at this time.
Posted 2 months ago. Nov 12, 2018 - 14:58 EST
Update
We are continuing to work towards resolution of this, however, we have not been able to scale to fully operational levels yet. Due to this, data processing and messaging remain delayed by a number of hours.
Posted 2 months ago. Nov 12, 2018 - 14:45 EST
Monitoring
We have resumed a normal level of data processing and message sending at this point however the data processing queue is behind by a multiple of hours. We will be slowly scaling up capacity to process the backlog.
Posted 2 months ago. Nov 12, 2018 - 10:55 EST
Update
Most message sending that was queued has gone out, and we are continuing to work through the message sending backlog.
Posted 2 months ago. Nov 12, 2018 - 10:44 EST
Update
Currently, Data Processing and Outbound Messaging are severely impacted however we are queuing all inbound data and messages therefore once those services are restored we will be able to process that queue ensuring no messages or data collection will be lost.
Posted 2 months ago. Nov 12, 2018 - 09:27 EST
Update
Our cloud hosting provider is continuing to investigate the issues with provision more capacity.
Posted 2 months ago. Nov 12, 2018 - 09:22 EST
Update
We escalated the provisioning issue to our cloud hosting provider as their provisioning system malfunctioned. Message sending is still delayed while we provision more capacity.
Posted 2 months ago. Nov 12, 2018 - 09:02 EST
Update
We are in the process of provisioning additional capacity on this disk. Message sending is also delayed while we provision more capacity.
Posted 2 months ago. Nov 12, 2018 - 08:20 EST
Update
We are continuing to work on a fix for this issue.
Posted 2 months ago. Nov 12, 2018 - 08:18 EST
Update
We are continuing to work on a fix for this issue.
Posted 2 months ago. Nov 12, 2018 - 07:59 EST
Identified
We have identified the issue as a non-performant disk on one of our databases. We are in the processing of provisioning additional capacity on this disk. This has appeared to improve the issue and we are no longer seeing errors.
Posted 2 months ago. Nov 12, 2018 - 07:30 EST
Investigating
We are investigating database issues on EU 01 that are causing dashboard errors and delayed data processing.
Posted 2 months ago. Nov 12, 2018 - 07:19 EST
This incident affected: EU 01 Cluster (Dashboard, Data Collection, Data Processing, Outbound Messaging).