Site Outage Update — July 20, 2009
As previously reported we have been having problems with site outages; another such outage occurred on July 19th. The good news is that we were able to trace this latest outage to the underlying problem: a hard disk malfunction on our database server. In the next couple of days, we will need to replace the hard disk, during which time the site will again be unavailable for several hours. That replacement should hopefully resolve all of these problems, properly this time rather than just temporarily.
During the most recent outage, we also discovered corruption in our database which caused the temporary loss of 25% of the wiki data. The wiki was put into read-only mode while the missing data was recovered. The corruption made our first set of backups unusable; the system outages caused problems with our second set of backups; nevertheless, we maintain enough sets of different, redundant backups that we were able to recover virtually all of the lost data. In the end, only 1 of the 110787 affected records could not be recovered (and that single lost record contained no important information).
For those who may be wondering, it is very unlikely that these site outages were caused by our recent site upgrades (either of the wiki or of the forums). Software upgrades cannot cause hardware failures (such as these hard disk problems), and especially cannot cause hardware failures on computers where the software is not even installed (the database server with the bad hard disk does not have any wiki or forum software installed on it). Furthermore, these problems are not symptoms of Denial-of-Service (DoS) attacks, or any other type of cyber attacks. Unfortunately, computer hardware does occasionally fail and in some cases, such as this, pin-pointing the specific failure can take some time.
Update: Another temporary outage occurred on August 1, 2009 and was restored with no data loss.