Reliability

Return to FastMail.FM home.

Overview

We know that storage and service reliability is important to our users. Ensuring that stored emails and files are always available and are never lost, damaged or corrupted is always one of our primary concerns.

Compare all the systems and precautions we take below to your current provider or in house solution and see if you can afford not to change.

Email Storage

We use a number of systems to ensure the reliable storage of all email on our servers.

File Storage

Our file storage system uses a custom built database and storage infrastructure to provide reliable and feature rich file storage features.

Database

While the majority of our data is stored as emails and files, there is still a significant amount of data in our regular SQL database such as user data, user preferences, address books, notepages, etc.

Failover

To avoid single points of failure, all stateless frontend services use DNS load balanciing with automatic failover via linux high availiability services. This system works quite simply, works for all protocols, and ensures there's no single piece of hardware that's a point of failure. All web servers, incoming email servers, spam checking servers, etc are automatically balanced by the frontend servers.

The way it works is that each service is assigned multiple IP addresses via DNS. The IP addresses are handed out randomly, so on average systems connecting to us via web, IMAP, POP, etc get one of services the IPs at random. In the case where all servers are working, these IP addresses are distributed around the frontend servers to keep the load amoungst the frontend servers balanced. The system continuously monitors the accessibility of the frontend servers and if any server becomes unavailble, it moves the IP addresses of the affected service to one of the remaining up servers. This system allows for easy scaling of frontend servers as well as easy failover of services to other frontend servers if one machine becomes unavailable.

For backend services that are highly stateful (eg email IMAP/POP servers), we need to use a completely different setup. In this case we distribute email amongst a large number of "email slots" that are then spread around our servers. These slots are paired up to create a redundant replicated "email store" pair. In the case of a server failure, failover is not automatic. An engineer is paged to check the status of the server and if it can be brought back online quickly. If the server cannot be brought back online quickly, then we "failover" the affected email slots to their replica slot to restore service. In the cases where this has happened (twice in the last year), maximum down-time has been about 1 hour, and because it's been only one server, has only affected about 10% of users. This represents a 99.997% uptime average for all users.

In the cases where we need to do maintenance on a machine, we can use the same failover mechanism to switch the master/replica roles of all email slots on a machine in a matter of seconds. This means that performing maintenance on our email server machines results in no downtime for users.

An added advantage of our "email slots" setup is that in the case of failover, the additional load generated on the remaining servers is distributed evenly over those servers, rather than all the load being forced onto one backup server. This ensures that even during system failure or maintenance periods, there's no slowing of service to users and our email servers remain fast.

Monitoring

With all the redundancy we have in place, things can still go wrong, so when they do having someone to deal with the problem promptly is important.

To ensure that any problems are dealt with as quickly as possible, we've implemented an extensive monitoring system. Every 2 minutes we test every port on every service (eg IMAP, POP, SMTP, Web, etc) on every server to see that it's responding as expected. Also every 2 minutes we test key parts of the computers hardware and operating system to see that the value are within acceptable limits. Additionally every 10 minutes we have a script that tests a significant portion of the web interface (login, send email, read email, external pop download, etc) for every backend server. If any of these tests fail, then 2 standby engineers are paged to look into the problem.

All these tests ensure that even in the case of a problem, it will be dealt with promptly.

Also while there are any problems, we aim to keep users up to date with what exactly is happening on our status blog at http://status.fastmail.fm.

Location, Network & Power

Our main servers are located at NYI in New York City, USA. Their facility is a high security, video monitored location; with backup power, airconditioning, and fire systems and 24x7x365 monitoring and onsite technical support.

NYI use high quality Cisco switches for all networking, including completely separate switches for a secure internal network between all our servers.

NYI have one of the most reliable network infrastructures in the industry, with multiple Internet backbone providers, carried by different telecommunication carriers for added redundancy including a 100% uptime guarantee for external connectivity to their network.

You can see a layout of our cabinets if you're really interested.

People

Reliability also comes from having people that care about making their systems reliable. We believe in a pro-active culture of solving problems for the long term. If there are ever problems affecting users, then we ensure that we fix the specific problem as quickly as possible in the short term, but also consider why the problem actually occured, and think about the long term and how we can stop it, and or any similar problems occuring in the future.