Widespread Internet Issue

Incident Report for Braze, Inc.

Resolved

EU-01 is now fully recovered. Data Processing and Outbound Messaging backlogs have been fully processed.

With that, Braze is fully operational.

Posted Jun 12, 2025 - 17:30 EDT

Update

US-06 is now fully recovered. Data Processing and Outbound Messaging backlogs have been fully processed.

Posted Jun 12, 2025 - 17:13 EDT

Update

Google has confirmed that Android Push / Firebase messaging is fully operational, and backlogs have been processed.

Posted Jun 12, 2025 - 17:07 EDT

Update

Our infrastructure has fully scaled up, so we are processing well above our typical speeds to work through the backlogs.
Several of our clusters are now fully operational, as the Data Processing and Outbound Messaging backlogs have been fully processed.

US-01, US-02, US-03, US-04, US-05, US-07, and EU-02 are fully recovered.

Posted Jun 12, 2025 - 16:58 EDT

Update

We continue to see signs of recovery and are ramping services back to normal as we are able to.

Our REST APIs are operational; however, as some inbound internet traffic remains challenged, some requests may have timed out before they get to Braze. Per standard practice, any REST API calls that timeout or receive a 50X response code should be retried.

Posted Jun 12, 2025 - 16:46 EDT

Update

We are now actively scaling up our infrastructure and expect to return to normal performance soon. As this happens, the Outbound Messaging and Data Processing backlog is being worked on. We expect latency to dissipate as we process through the queue.

Posted Jun 12, 2025 - 16:43 EDT

Update

Our SDK API has resumed normal processing. Events and attribute updates will not be lost; we now are automatically retrying and processing those events.

Dashboard login services have now recovered, and users can log in through https://dashboard.braze.com.

Posted Jun 12, 2025 - 16:38 EDT

Update

Multiple Braze Services are experiencing degradation due to the aforementioned incident with Cloudflare and Google Cloud Platform. While all services are available and operating correctly, we are unable to scale up our infrastructure and have taken some precautions to prevent a cascading failure once the Google and Cloudflare incidents resolve and traffic starts flowing at higher rates again.

Our SDK API is mostly returning error response codes to SDKs–this directs them to retry requests at a later time to throttle processing demand. Events and attribute updates will not be lost, and our SDKs will periodically retry requests automatically.

Our REST APIs are operational; however, as inbound internet traffic is challenged, some requests may have timed out before they get to Braze. Per standard practice, any REST API calls that timeout or receive a 50X response code should be retried.

Dashboard users may experience sporadic login timeouts when visiting https://dashboard.braze.com until upstream providers recover.

Outbound messaging and data processing are continuing, but may experience latency until the incident is fully resolved upstream and Braze can re-scale our infrastructure.

We are starting to see some signs of recovery and will be proactively ramping services back to normal behavior as we are able to. We will also continue to post regular updates.

Posted Jun 12, 2025 - 16:30 EDT

Identified

The Google Cloud outage is preventing our ability to scale up our kubernetes clusters. As a result, customers will experience message sending and data processing latency until this recovers. We are also seeing SDK API errors across multiple clusters.

Posted Jun 12, 2025 - 15:52 EDT

Investigating

There appears to be a widespread internet issue affecting multiple services around the internet.

Cloudflare, a Braze sub-processor, is having an incident (https://www.cloudflarestatus.com/incidents/25r9t0vz99rp) and Google Cloud is having a major outage (https://status.cloud.google.com/).

This is affecting our ability to deliver push notifications via Google Firebase. AWS is not provisioning new servers, which is preventing our ability to scale up, and may be related to this overall internet issue.

We will provide additional updates soon.

Posted Jun 12, 2025 - 15:01 EDT

This incident affected: US 01 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging), US 02 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging), US 03 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging), US 04 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging), EU 01 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging), US 06 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging), US 05 Cluster (Dashboard, SDK Data Collection, Data Processing, Rest APIs, Outbound Messaging), EU 02 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging), US 07 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging), US 10 Cluster (Dashboard, SDK Data Collection, Data Processing, REST APIs, Outbound Messaging), and Global Messaging Channels (Push Notifications - Android).