Step 1: Improving S3’s reliability
To safeguard against future S3 outages, we’re implementing multi-region redundancy and active-passive failover. Basically, that means we’ll be storing even more copies of your site’s content in our server network, so if one area fails, we can keep serving your site without a hitch.
Even when an entire AWS region becomes less efficient, or goes down entirely.
To help us get to multi-region S3, we’ve taken the following steps:
- We’ve copied billions of files over to another Amazon datacenter located in Ohio. (The main one is in Virginia.)
- We then set up a replication process to transfer new file uploads to the new (Ohio) datacenter as soon as they’re done uploading to the main (Virginia) one.
- We’ve deployed another cluster of web servers to use in case the main web server cluster is unavailable. (More on this later.)
Also, by configuring a DNS Failover Routing Policy, we’ll be able to automatically route traffic over to the new datacenter if our primary datacenter ever goes down. This will improve our redundancy considerably, since we’ll now be able to route traffic to a completely new datacenter, and in the future, several others in different regions around the world.
Step 2: Streamlining our devops
To date, our engineering team has been focused on building solutions that can scale automatically.
Given we’re in the unpredictable world of web hosting, where any number of our sites can experience a spike in traffic, our infrastructure has to be elastic to scale with the demand.
However, during the scramble caused by the latest S3 outage, we discovered that there were several improvements that we could make to our devops.
- First, we’re adding Terraform, an infrastructure coordination technology that will enhance the automation of deploying our hosting stack. During the outage, we realized that standing up servers still involved a couple error-prone, manual steps. Even though we use Chef to build, install, and configure many virtual instances on AWS EC2, provisioning them still required a few manual keystrokes.
- Second, we’re changing the way we’re building and launching virtual machines with Docker and Rancher. In the past four years, the complexity of Webflow hosting has increased drastically. We offer the most robust hosting service, as well as a variety of powerful features that enable thousands of businesses to run their online presence on Webflow. And due to the complexity we’ve added, our tooling did not cover certain use cases.
So in addition to focusing on S3 reliability, we’ve also started to transition our infrastructure to Docker, Terraform, and Rancher. Have no idea what these are? Read on.
Docker: A shipping container for our code
Docker is a new and novel way to package software into a standardized container that can be easily transported across different platforms and environments.
You can think of Docker as a shipping container. Shipping containers all look the same from the outside, and any ship, harbour, or crane knows how to operate one. Same goes for Docker.
With it, we can now package our code into a Docker container, and know that it will run on any computing environment that also supports Docker. This allows us to deploy our code easily and predictably to new servers at a scale that wasn’t previously possible.
Rancher: A crane to organize containers
Now that we have all these different containers running different parts of our code, we need a way to orchestrate how they work with each other. For that, we rely on Rancher. Rancher is the crane that organizes all the containers, so the right ones can be stacked in the right order and position. It also allows us to monitor the health of each container, and to make sure we have enough computing resources to provide each service.
Terraform: Charting the course
Terraform enables us to quickly provision and maintain our production environments.
This is essentially our container ship’s chart and instruction manual. It allows us to stand up completely new datacenters with our container fleet in different parts of the world in a fraction of the time. Since Terraform has a language that allows us to describe our infrastructure with code, it can talk to several different cloud providers to quickly and easily configure our infrastructure so it’s much more predictable and scalable. It also has the side benefit of making sure that engineers review all of our infrastructure changes, which means fewer developer errors, and better change-management procedures.
Over the next year, Webflow will be investing close to a million dollars to make sure our managed hosting infrastructure is state of the art. Recent advancements in cloud technology are rapidly changing the web hosting world, and we want to make sure that every one of you (and your clients) can capitalize on its benefits.
We hope this article sheds some light on the investments we’re making in our hosting technology, so that Webflow Hosting remains the fastest, most reliable option for your business.