April 14 outage: what happened, and what’s next

Yesterday's outage disrupted real work for our customers, and we are sorry for that. For many of you, this didn't feel like a one-off, and we heard that clearly.

Here’s what happened, what we fixed, and what we're committing to going forward to ensure Webflow is a reliable platform for our customers to do their best work.

What happened

Yesterday at 14:06 UTC, one of our CMS database clusters went offline. Customers began seeing errors across Webflow and were unable to access the Dashboard, the Designer, hosted websites, form submissions, and our APIs. Sites already cached by our CDN stayed up, but anything recently published or not yet cached was impacted.

This was not caused by a security vulnerability, a malicious attack, or a Webflow feature release. It was a hidden infrastructure constraint, one that was not visible in any metrics available to us or reported by our cloud provider's dashboard. That single failure cascaded in a way that made recovery slower and broader than it ever should have been.

Our team identified the problem quickly and began working to bring the cluster back online. The time from incident creation to drafting the status page was 5 minutes, and it went live 9 minutes after the incident was created at 14:15 UTC.

We immediately established a crisis call with our cloud provider. After extensive escalation, we learned that the cluster had hit an undocumented capacity limit on the cloud provider's database engine, one that was invisible to us. The provider's dashboard reported only about 1.74 TiB of used storage on the cluster, roughly 1.35% utilization. But behind the scenes, the database engine had been silently reserving logical space for each of the approximately 66 million database files on the cluster, consuming the full 128 TiB allocation cap over time. When we attempted to restart the cluster, it had no capacity left to allocate and entered a crash loop.

By 16:21 UTC, we had deployed a fix that removed the failing cluster and allowed the platform to come back online for the large majority of customers (approximately 96%).

Throughout the incident, we shared regular updates on status.webflow.com so customers and partners had a single place to track what was happening in real time. However, a small number of customers, approximately 4% of the directly affected cluster, were still unable to access Webflow until 04:01 UTC on April 15. This is unacceptable and does not meet our standards.

We remained on a continuous video conference call with our cloud provider for the full duration of the event. As of 04:01 UTC on April 15, service has been fully restored to 100% of our customers with no CMS data loss. We continue to work with our cloud provider to understand how this limit went undetected and to ensure it does not happen again.

What we're doing about it

We took two immediate actions. First, we removed the affected cluster from the active group so the remaining clusters could resume serving traffic. This is what restored service for the large majority of customers by 16:21 UTC. Second, we initiated a point-in-time restore of the affected cluster, working directly with our cloud provider to provision the additional capacity needed to complete it. We have validated that there was no CMS data loss as a result of this incident.

Our cloud provider has applied manual workarounds to double the logical storage allocation on the affected cluster from 128 TiB to 256 TiB. We also performed manual operations with them and validated an engine version upgrade across all CMS database clusters that raises this limit permanently.

We have already re-enabled the automated backup and restore operations that were temporarily paused during recovery, and we have deployed new monitoring and alerting that tracks actual logical storage consumption on every cluster, not just incorrectly reported data usage. This gives us visibility into the metric that caused this failure.

We are also adding additional shards to each of our database clusters. This will reduce the number of files per cluster, provide horizontal scaling for the database clusters, and support future customer growth.

What we're working on

Beyond the immediate fixes, we are investing in broader reliability improvements. We're changing how our application handles database failures so that a single unhealthy cluster can no longer prevent the entire platform from starting. No single shard failure should cascade the way this one did. We're building operational circuit breakers that let us pause high-write operations during incidents, reducing load on degraded clusters and increasing isolation to prevent cascading failures. And we're continuing the engine version upgrade across all production clusters to ensure every shard has the higher storage limits in place.

Reliability is the foundation everything we build depends on. We've improved reliability in a lot of areas, but we know that consistency over time is what matters most, and that's where we need to be better. We're focused on making Webflow a platform you can rely on, every day.

If you're still seeing issues, please reach out to Webflow Support. We'll continue to publish updates on status.webflow.com so you have a clear, reliable place to see what's happening and how we're responding in real time.

If you're looking for a more detailed breakdown of the investigation and fixes, our engineering team has a full technical root cause analysis here.

This kind of disruption doesn't wait for a convenient moment, and we're sorry it interrupted your work. We'll do better here.

Thank you for building on Webflow.