July 28 Incident report: Service availability disruption

Webflow experienced a sustained availability disruption across Designer, Dashboard, Marketplace, and user sign ups between July 28-31.

While hosted sites remained live, some core platform functionality was unavailable. We’ve restored reliability, and here’s how we got there. For a broader reflection on how this impacted our customers and what we’re changing as a company, read our CEO Linda’s post here. Below is a technical breakdown of the incident from an engineering perspective.

Overview

The first two phases of disruption were caused by a malicious attacker producing sustained load on our systems and specific API endpoints, resulting in elevated latency and intermittent service outages. These attacks were mitigated through firewall protections, IP blocking, and backend infrastructure investigation.

The third and most disruptive phase was triggered by ongoing attack traffic compounded by performance degradation after scaling up a critical backend database cluster with a dual-socket CPU hardware architecture to provide operational headroom. While the larger database cluster size was shown to be higher performance, we later learned that the single-socket CPU architecture performs better for our services.

The fourth phase began with another malicious attack, which caused a critical database cluster to experience severe write latency and replication lag. This phase was mitigated through two separate configuration changes, first by our database vendor and then by scaling the database cluster up using a higher-capacity, single-socket CPU architecture.

Full stability was restored after these changes, and no further disruptions have been observed since. Throughout all four phases of the incident, all Webflow-hosted websites maintained 100 percent availability.

Incident summary

Start time: July 28, 2025 at 1:27 PM UTC
End time: July 31 at 4:00 PM UTC
Total duration: 3 days, 2 hours, and 33 minutes
Impact: The Webflow Designer, Dashboard, Marketplace, and user sign ups experienced elevated latency, partial outages, and degraded performance in four distinct phases
Root causes:
- Targeted malicious attacker producing sustained load on our systems
- Performance issues following scaling of critical backend database cluster
- Configuration issues and a known software bug with a critical database cluster

What happened

Malicious traffic and early mitigation

‍At 1:27 PM UTC on July 28, we began receiving internal and external reports of increased latency when loading the Webflow Designer and Dashboard. Some customers experienced long wait times to publish sites and errors when attempting to load core parts of the platform.

We identified a malicious attacker producing sustained load on our systems and implemented Web Application Firewall protections, blocked suspect IP address ranges, and opened a high severity case with our third-party database provider. We also took action to improve database efficiency. These attacks were mitigated by 4:55 PM UTC, and service returned to normal.

At 9:03 AM UTC on July 29, we detected a second set of attacks targeting similar API endpoints. Latency in the Designer and Dashboard increased again. Additional firewall protections and IP blocks were applied, and backend database investigations continued. By 10:59 AM UTC, Webflow Designer and Dashboard were once again fully responsive and stable.

System changes that increased load under pressure

The third phase began at 12:13 PM UTC on July 29. While malicious traffic continued, we also began seeing normal weekday load. To increase operational headroom, we scaled up a critical backend database cluster onto a new dual-socket CPU hardware architecture using our vendor’s automation. This action, completed at 2:50 PM UTC, quickly introduced severe write latency and replication lag of 300 times and 500 times above baseline, respectively.

Availability of the Designer and Dashboard was intermittent for the next eight hours. To reduce load on the system, we turned off data pipelines, disabled SCIM, paused new user sign ups, and turned off several newly launched features. All other engineering and operational work was paused to focus on the incident.

At 8:00 PM UTC, our database vendor recommended scaling the cluster back down to a smaller, single-socket CPU architecture. This operation was completed at 10:09 PM UTC. The Designer and Dashboard returned to stable performance immediately afterward.

Final recovery and fix validation‍

The fourth phase of the incident began at 9:32 AM UTC on July 30 when a malicious attack on the Webflow Marketplace triggered elevated database write latency and degraded performance in the Designer and Dashboard. We mitigated the issue by taking the Marketplace offline, disabling new user sign ups, optimizing read operations, and working closely with our third-party database vendor. We failed over the database cluster at 10:18 AM UTC. The vendor also identified a known session count bug and recommended configuration changes, including reducing slow query logging and later disabling aggressive memory decommit, both of which contributed to system recovery. To ensure long-term resilience, we upgraded the critical database cluster to a higher-capacity, single-socket CPU architecture. This final step brought the system back to full stability by 5:59 PM UTC with sustained improvements observed thereafter.

Out of an abundance of caution given the long duration and multiple phases of this incident, we continued our continuous monitoring, video calls with our database vendor, and remained on high alert until 4:00 PM UTC on July 31.

What went well

We mitigated two initial attacks quickly using firewall rules, IP blocking, and backend analysis.
Webflow hosted websites remained fully operational and unaffected throughout the incident.
Engineering and operational teams coordinated to reduce backend load and minimize system pressure.
Live video chat sessions with our database vendor ensured good communication and discussion on mitigation steps.
Every available resource was redirected to resolving the incident during the third phase.

Where we fell short

Our infrastructure did not have circuit breakers or rate limits in place to prevent cascading strain during recovery.
Key features were disabled mid-incident without proactive notice to customers or partners.
The scale up to the dual-socket CPU hardware architecture introduced unexpected performance issues related to write latency and replication lag.
The database vendor did not resolve their session count bug in a timeframe to help with mitigation.

What we’ve changed

Since the incident, we’ve completed the following improvements:

Added an index to a collection in the impacted database cluster to increase query efficiency and reduce database load
Implemented stricter rate-limiting for our user sign up systems to better mitigate spikes in traffic
Increased rate-limit protections in our Web Application Firewall to block malicious traffic in targeted areas
Introduced circuit breakers for key traffic flows that contribute significant load to the affected database cluster
Enhanced monitoring to ensure more reliable alerting when database latency issues arise
Upgraded our critical database cluster to a higher-capacity, single-socket CPU architecture

What we’re still working on

Replay missed form submissions for all customer forms (completed August 8)
Tune heartbeat configurations to improve the health of database connection pools (completed August 1)
Adjust backup and snapshot schedules to avoid load during peak usage hours (completed August 4)
Evaluate and potentially move additional read-only queries to a dedicated replica (completed August 4)
Evaluate the use of a queuing system for non-critical write requests to allow for eventual consistency (completed August 4)
Finalize internal root cause analysis with additional follow-up actions (completed August 4)
Complete a detailed root cause analysis with our database vendor for performance recommendations (was due August 1, delayed by vendor until Aug 25)
Upgrade database clusters to the latest software version (started with planned completion for all database clusters by September 5)
Deploy the fix for the session count bug identified by our database vendor when it becomes available (no date from vendor)

Closing

We understand the trust placed in us to power high-stakes work. We know we didn’t meet that expectation this time, and we’re applying every lesson from this incident to build a stronger, more resilient Webflow. We’re continuing to monitor performance closely and we will share updates as our work continues.

-Allan

Webflow CTO

For anyone who wants more details, there's a more technical deep dive now available here.

‍

→

↗

►

↓

←

↑

Last Updated

July 31, 2025