Investigating database issues on EU 01 causing dashboard errors, delayed data processing
Incident Report for Braze, Inc.
Resolved
At this time, data processing remains nominal, the continues to be no delay in message sending. We are still working with our vendor for RCA and future prevention however this incident is now resolved.
Posted Nov 12, 2018 - 19:02 EST
Update
We have processed the backlog for the majority of customers at this time but we are still monitoring the remaining customer's backlog.
Posted Nov 12, 2018 - 18:00 EST
Monitoring
We have continued scaling operations towards full capacity and have processed the majority of queued message sending at this time. New messaging campaigns are going out in real time at a nominal speed. We are still heavily queued on incoming data processing at this time.
Posted Nov 12, 2018 - 17:01 EST
Update
We were able to bring our data collection API's back online about 15 minutes ago and have been monitoring their status. We are still processing the large backlog and are currently do not have enough capacity to process incoming data collection requests and such such the backlog, even while being processed, is growing.
Posted Nov 12, 2018 - 15:28 EST
Identified
We have triggered a recurrence of the database instability that we saw at the beginning of this incident and have had to stop data collection and data processing at this time.
Posted Nov 12, 2018 - 14:58 EST
Update
We are continuing to work towards resolution of this, however, we have not been able to scale to fully operational levels yet. Due to this, data processing and messaging remain delayed by a number of hours.
Posted Nov 12, 2018 - 14:45 EST
Monitoring
We have resumed a normal level of data processing and message sending at this point however the data processing queue is behind by a multiple of hours. We will be slowly scaling up capacity to process the backlog.
Posted Nov 12, 2018 - 10:55 EST
Update
Most message sending that was queued has gone out, and we are continuing to work through the message sending backlog.
Posted Nov 12, 2018 - 10:44 EST
Update
Currently, Data Processing and Outbound Messaging are severely impacted however we are queuing all inbound data and messages therefore once those services are restored we will be able to process that queue ensuring no messages or data collection will be lost.
Posted Nov 12, 2018 - 09:27 EST
Update
Our cloud hosting provider is continuing to investigate the issues with provision more capacity.
Posted Nov 12, 2018 - 09:22 EST
Update
We escalated the provisioning issue to our cloud hosting provider as their provisioning system malfunctioned. Message sending is still delayed while we provision more capacity.
Posted Nov 12, 2018 - 09:02 EST
Update
We are in the process of provisioning additional capacity on this disk. Message sending is also delayed while we provision more capacity.
Posted Nov 12, 2018 - 08:20 EST
Update
We are continuing to work on a fix for this issue.
Posted Nov 12, 2018 - 08:18 EST
Update
We are continuing to work on a fix for this issue.
Posted Nov 12, 2018 - 07:59 EST
Identified
We have identified the issue as a non-performant disk on one of our databases. We are in the processing of provisioning additional capacity on this disk. This has appeared to improve the issue and we are no longer seeing errors.
Posted Nov 12, 2018 - 07:30 EST
Investigating
We are investigating database issues on EU 01 that are causing dashboard errors and delayed data processing.
Posted Nov 12, 2018 - 07:19 EST
This incident affected: EU 01 Cluster (Dashboard, SDK Data Collection, Data Processing, Outbound Messaging).