We know that storage and service reliability is important to our users. Ensuring that stored emails and files are always available and are never lost, damaged or corrupted is always one of our primary concerns.
Compare all the systems and precautions we take below to your current provider or in house solution and see if you can afford not to change.
We use a number of systems to ensure the reliable storage of all email on our servers.
Reliability starts with the hardware we choose. To ensure that all IMAP/POP email servers stay up and running continuously and reliably we use IBM servers. All servers have 24x7 support contracts with 4 hour response times from an IBM technician. All servers are dual power supply and connected to separate power circuits. All circuits are monitored to ensure there's no over use that might trip a circuit breaker.
RAID storage is the first level of redundancy. RAID ensures that any single disk failures in any of our systems has no effect on any running services. Basically a disk can fail, and the system keeps running fine. When a drive does fail, we're alerted and because we use hot-swap systems, we can immediately replace the failed disk and the system can immediately rebuild the redundant data again.
We use high quality SATA-to-SCSI RAID storage units (based on high performance ARECA controllers with battery backup non-volatile RAM caching). Additional spare units and drives on hand in case of any failures. As with the servers, they're all dual power supply and connected to separate monitored power circuits.
RAID is a great first level, but it doesn't help if a server fails, or if there's some form of file system corruption. That's where replication comes in. When any action is performed on a mailbox (eg email delivered, email copied, email deleted, etc), that action is immediately replicated to a completely separate second server. So if any server fails, we can immediately switch over to the replica server and continue providing service.
Most replication systems replicate at the disk block level. However this doesn't protect against operating system errors that introduce file system corruptions (xfs example). Our replication system replicates at the email level, it understands the structure of mailboxes and what's happening to emails. This allows it to be more efficient, and also it protects the replica side from any low level filesystem corruptions.
On top of RAID and replication, we also maintain a completely separate backup system. The backup system runs nightly and takes an incremental backup of any new emails put in each mailbox over the last 24 hours. If any emails are deleted, it keeps a copy of the deleted email for 7 days.
The backup system runs on completely separate servers (Sun x4500), with completely separate power circuits and separate RAID storage. The servers use a different operating system, have completely separate login credentials, and only access the main servers via a custom protocol that's designed just to feed incremental backup data to them. This ensures that the backup servers are as separated from the main servers as much as possible, so even in the worst case scenarios of power surges and/or physical destruction of a number of our IMAP servers, there is still a backup on a completely separate server.
RAID, replication and backups are great, but we also need to be able to protect emails from corruption. Most people don't think corruption is an issue, but recent research by CERN has shown that with todays large hard drives, this is a potentially serious problem, with an estimated corruption rate of 3 files in every TB of data. In most cases, corruption of data is a silent problem that people don't realise has happened until they need the data.
To deal with this, we ensure that as soon as an email is delivered to a mailbox, a SHA-1 checksum of that email is generated and stored in the email index.
When the email is replicated, the email content and the checksum are sent separately. We then generate the checksum on the replicated email content and ensure that it matches the original checksum to see that the email was replicated correctly.
We also repeat this procedure when the email is backed up, ensuring that the backup of the email is correct.
We also run a regular check process that takes blocks of emails and recomputes their checksum to see it matches what is in the index. If there's any issues, we're alerted and can find which of the master, replica or backup email are correct and can correct the problem.
This checksumming gives an unsurpassed level of integrity checking to all our email data.
We use the widely distributed and used Cyrus IMAP/POP server software. This software is used by some of the largest educational institutions in the world to run their email systems for staff and students (eg Carnegie Mellon University and University of Cambridge, ensuring that the software is continuously being used and stressed by millions of users around the world every day.
Cyrus also has a community of active developers, users and maintainers, ensuring the software is always being updated and any bugs are quickly dealt with and fixed. We actively contribute to this process and maintain a quite large set of patches against the cyrus IMAP/POP server. Over time these patches are being accepted into the main cyrus code base as most of them are related to reliabilty, consistency and performance improvements.
By using well used and open software, we have a reliable base on which to build our email server.
Our file storage system uses a custom built database and storage infrastructure to provide reliable and feature rich file storage features.
RAID, Replication, Backups & Checksums
It uses all of the same features that our email storage system uses to provide the same level of reliability and redundancy, such as RAID as a first level redundancy, replication of all files to secondary servers, a nightly incremental backup of all files, and checksums of all files to ensure ongoing integrity.
Unlike our email servers which use a custom index and caching format to speed up IMAP queries, our file storage system uses a standard SQL database server. See below for more details.
While the majority of our data is stored as emails and files, there is still a significant amount of data in our regular SQL database such as user data, user preferences, address books, notepages, etc.
Our databases are stored on IBM servers, with IBM RAID controllers and IBM SCSI drives for the highest reliability. Our experience with IBM hardware has been excellent. In over 5 years of continuous use, these servers have excelled in their reliability and continuous operation.
Our database is continuously replicated to a second identical server, as well as to a third off-site secure server. The replication is continously monitored to see that it stays up to date, and we're paged if falls behind for any reason. In case of failure of the main database, we are paged and can easily switch to the replica.
We also do nightly hot backups of the database to another secure external site.
InnoDB Table Engine
We were early adopters of the InnoDB table engine and have found it one of the best storage engines available.
Transactions and double buffering ensures that even in crash situations, no data is corrupted and the database recovers to a consistent state.
Hot backups allow nightly snapshots of the database to be made and taken offsite. Clustered indexing allows us to organise our data in exactly the way we want to ensure user data is accessed quickly.
To avoid single points of failure, all stateless frontend services use DNS load balancing with automatic failover via linux high availiability services. This system works quite simply, works for all protocols, and ensures there's no single piece of hardware that's a point of failure. All web servers, incoming email servers, spam checking servers, etc are automatically balanced by the frontend servers.
The way it works is that each service is assigned multiple IP addresses via DNS. The IP addresses are handed out randomly, so on average systems connecting to us via web, IMAP, POP, etc get one of services the IPs at random. In the case where all servers are working, these IP addresses are distributed around the frontend servers to keep the load amoungst the frontend servers balanced. The system continuously monitors the accessibility of the frontend servers and if any server becomes unavailble, it moves the IP addresses of the affected service to one of the remaining up servers. This system allows for easy scaling of frontend servers as well as easy failover of services to other frontend servers if one machine becomes unavailable.
For backend services that are highly stateful (eg email IMAP/POP servers), we need to use a completely different setup. In this case we distribute email amongst a large number of "email slots" that are then spread around our servers. These slots are paired up to create a redundant replicated "email store" pair. In the case of a server failure, failover is not automatic. An engineer is paged to check the status of the server and if it can be brought back online quickly. If the server cannot be brought back online quickly, then we "failover" the affected email slots to their replica slot to restore service. In the cases where this has happened (twice in the last year), maximum down-time has been about 1 hour, and because it's been only one server, has only affected about 10% of users. This represents a 99.997% uptime average for all users.
In the cases where we need to do maintenance on a machine, we can use the same failover mechanism to switch the master/replica roles of all email slots on a machine in a matter of seconds. This means that performing maintenance on our email server machines results in no downtime for users.
An added advantage of our "email slots" setup is that in the case of failover, the additional load generated on the remaining servers is distributed evenly over those servers, rather than all the load being forced onto one backup server. This ensures that even during system failure or maintenance periods, there's no slowing of service to users and our email servers remain fast.
With all the redundancy we have in place, things can still go wrong, so when they do having someone to deal with the problem promptly is important.
To ensure that any problems are dealt with as quickly as possible, we've implemented an extensive monitoring system. Every 2 minutes we test every port on every service (eg IMAP, POP, SMTP, Web, etc) on every server to see that it's responding as expected. Also every 2 minutes we test key parts of the computers hardware and operating system to see that the value are within acceptable limits. Additionally every 10 minutes we have a script that tests a significant portion of the web interface (login, send email, read email, external pop download, etc) for every backend server. If any of these tests fail, then 2 standby engineers are paged to look into the problem.
All these tests ensure that even in the case of a problem, it will be dealt with promptly.
Also while there are any problems, we aim to keep users up to date with what exactly is happening on our status blog at http://status.fastmail.fm.
Our main servers are located at NYI in New York City, USA. Their facility is a high security, video monitored location; with backup power, airconditioning, and fire systems and 24x7x365 monitoring and onsite technical support.
NYI use high quality Cisco switches for all networking, including completely separate switches for a secure internal network between all our servers.
NYI have one of the most reliable network infrastructures in the industry, with multiple internet backbone providers, carried by different telecommunication carriers for added redundancy including a 100% uptime guarantee for external connectivity to their network.
You can see a layout of our cabinets if you're really interested.
Reliability also comes from having people that care about making their systems reliable. We believe in a pro-active culture of solving problems for the long term. If there are ever problems affecting users, then we ensure that we fix the specific problem as quickly as possible in the short term, but also consider why the problem actually occured, and think about the long term and how we can stop it, and or any similar problems occuring in the future.