Form follows function: Building resilient form submissions at scale

We’re hiring!

We’re looking for product and engineering talent to join us on our mission to bring development superpowers to everyone.

Over the last few months, we have been actively investing in our company’s resiliency. This blog post covers how one of those investments helped us with a recent incident where database operations were failing.

Form Submissions

Webflow powers both our customers and their customers. For many businesses, sales pipelines depend on inbound form submissions, so availability and durability of this feature are critical.

A typical form submission comes into the Webflow API, passes through spam checks, and is persisted to the customer’s site. Customers can view and export these submissions in their Webflow dashboard. Oftentimes customers will have webhooks set up with these forms to their own internal systems, allowing them to take custom actions on new form submissions.

Design Goals

We set out to ensure that, even during downstream failures, form submissions would be:

Durable: never lost, even if persisting to the database failed
Non-blocking: recovery mechanisms must not slow the critical request path
Idempotent: safe to replay without creating duplicates
Operable: easy to target a single customer or run system-wide

Phase 1 – Write-Ahead Backups

The first question we needed to ask was, where should backups live? We wanted enough context to replay correctly, without adding downstream dependencies that could fail before a backup was recorded.

We opted to put this as high up in the API layer as possible, after basic middleware validation, but before any database calls were made. By prioritizing backup availability, we opted to store backups for all submissions, even ones that would be later filtered as spam, as well as ones that would be successfully ingested.

We decided to use Amazon S3 to store these backups. Like any distributed system, we needed to consider what would happen if the backup request itself failed, and if it should be a blocking network call. The priority was serving the form submission in the critical path; protecting live traffic takes precedence over backup success.

Phase 2 – Replay

With backups persisted on disk, we then set out to build a system where we could re-ingest form submissions in the event of a downstream outage that were otherwise not able to be successfully saved to the database. We called this system replay. There were two modes this operation needed to be run in: per customer replay and global replay.

Mode 1 – Per Site Replay

This mode of replay would be run in the event of a single site or a handful of sites that would individually need to be replayed. This would be a faster, targeted approach to replaying a forms submission outage experienced by a single customer.

Mode 2 – Global Replay

This operation would be the heavier-handed mode of running this operation, and would involve traversing backups for all customers, and replaying them across the board. This operation would be slower, but would be used in a systemic outage felt by a large portion of customers.

Submission Hashes

Outages rarely fail 100% of requests. A typical pattern looks like:

✅✅⛔✅⛔⛔✅⛔⛔⛔⛔✅

Such that ✅ represents a successfully processed submission, and ⛔ represents one that failed before being saved into the database. We wanted to make sure that if we replayed from start to finish, we did not duplicate the ✅ form submissions that were already ingested.

To do this, we decided to implement a concept called submission hashes. Each form submission is represented as a unique, stable hash, where the SHA-256 hash represents everything about the form submission that makes it unique — the timestamp the submission came into our system, the customer details, the form submission body, request metadata, and more. These hashes are serialized in the critical path for every form submission, regardless of replay context.

We also stored and added a database index on the field, so that we could efficiently query for form submissions by submission hash to determine if the form submission was already successfully ingested.

What this means is that by deduping on submission hashes, we were able to build an idempotent replay process so that we could replay submissions across a broad range, and ensure the only delta of replayed submissions were only those that were not successfully ingested the first time around. It also meant that when choosing the time window on which to replay, we could safely choose timestamps well before the outage and well after the outage, to ensure we cover the entire range, without worrying about scanning over records we’ve already ingested.

const isDuplicate = await checkIfDuplicateFormSubmission({
   siteId,
   workspace,
   body,
   createdOn,
});

if (isDuplicate) {
   return;
}

Chaos Testing

In addition to unit and integration testing, we needed to simulate a real outage to know that this tool worked. To accomplish this, we ran a test on our staging environment in which we deliberately broke form submissions. We did so by throwing an uncaught exception in the form submission ingestion path, simulating a standard outage.

Then, we attempted to submit several form submissions which, as we expected, did not process correctly. Then we deployed a build that fixed the environment (simulating a revert of a breaking change). After that we had our replay scenario: form submissions that were missing in the customer database but present in our backups.

From there, we ran the replay job. First, outside the outage window to confirm that nothing happened. Then, we ran the job squarely in the outage window to prove that we were able to replay missing submissions. Then we ran a replay job spanning before and after the outage period, proving that only missing submissions were replayed. Lastly, we ran the replay job multiple times, proving that the operation was idempotent and that submission hashes were being deduplicated on submission hash.

After this, we had confidence that this was a tool we could use in an incident.

Spam Considerations

Forms attract spam. We use multiple defenses at the network and application layers. One app-layer check depends on a downstream token with a 1-hour TTL (time to live); during longer incidents that signal can’t be trusted. Replay falls back to the other layers, but we chose “more data over less” in recovery, meaning some additional spam may reach customers in certain replay scenarios.

Incident Management

During our recent incident we found ourselves in a situation where this resiliency work was tested. Database writes were failing, and we needed to assess if there was replay work to be done.

To evaluate this, we looked at our metrics. Generally speaking, there should be one backup job per stored form submission; the expected ratio during normal traffic is loosely 1:1. In the event of a service disruption, however, metrics would show a gap between backup volume and persisted submission volume, indicating that there was replay work to be done.

We had a runbook, monitors, and the muscle memory of having run this before. So we went over the state of the world and came up with a plan for executing a global replay operation.

The scale of the issue and need for replay operation was significant, with several hours of backups to go through. Thankfully, the replay operation went off without a hitch, chugging through millions of backups and routing the form submissions correctly across our customers.

What’s Next

There’s more to do here.

Move backups even earlier in the stack to reduce pre-backup failure windows
Smarter spam filtering during replay and in the critical path
Faster, cheaper replay: streaming pipelines, backpressure, and more granular partitioning to reduce compute, while retaining existing idempotency and resilience

Conclusion

Losing customer form data is catastrophic. Resiliency in this area is critical to our customers, so having systems in place to perform data recovery is essential. Just as critical is having a runbook and muscle memory for running these systems safely. We take service disruptions very seriously. When they do occur, we prioritize safe, verifiable recovery. In this incident, we recovered nearly a million backed-up form submissions and delivered them to the correct recipients.

Does building resilient distributed systems sound interesting to you? If so, come work with us!

We’re hiring!

We’re looking for product and engineering talent to join us on our mission to bring development superpowers to everyone.

→

↗

►

↓

←

↑

Last Updated

September 9, 2025