During the next month, all Wikimedia wikis will be placed into read-only mode for a short period on two days. This action will allow the Wikimedia Foundation’s engineers to test services in the secondary data center in Texas (referred to as “codfw”) and to do planned maintenance.
The secondary data center is a replica of our primary cluster in Virginia. The main purpose of this data center is to improve the reliability and failover capabilities of Wikipedia and all of our sites for users around the world. Both data centers maintain full, up-to-date copies of the databases for Wikipedia and other projects, plus many other services. In case of any type of disaster at the primary data center in Virginia, the Technical Operations team expects to be able to transfer all traffic to the secondary data center in Texas within minutes.
We are planning a test to find out how quickly and reliably we can transfer all application server traffic and tightly coupled service dependencies to the secondary data center. Teams in Technology, and several outside of it, first performed this type of test in April 2016. Since then, the Technology department has improved its procedures and automated several steps, and we are now planning to run this test for a minimum of two weeks. This two-week window should also permit us to do some planned maintenance at the primary server site. At the end of the test period, we will transfer all of the traffic and services back to the primary service center again.
Effect of this test on editors and other contributors to our sites
Ideally we’d make this switchover without affecting our users, but limitations in MediaWiki, the software that powers our wikis, prevent that at this time. When we switch from one datacenter to the other, we will have to place all wikis in read-only mode for a short time. We expect this step to take approximately 20 to 30 minutes each time.
During those weeks, we will also be halting all non-essential code deployments. This means that the regular MediaWiki deployment process will be stopped, and no other non-critical deployments will be done during the two test weeks.
The process for this test is complex, but we learned a lot from doing this last year, and we are hoping to make this process even simpler, faster, and more secure in the future. We hope to not only greatly reduce the disruption for our users and the time needed to make the switch, but also to reduce the amount of manual effort necessary. We appreciate your patience while we improve this essential infrastructure that helps us to keep useful information from the projects available on the Internet, free of charge, in perpetuity.
Faidon Liambotis, Principal Operations Engineer
You can read about a previous similar and successful failover test in a blog post from April 2016.