Thanks, a2j :-)
Harry's put together a blog post about what went in, but the real biggie was the switch of our file server image to Ubuntu.
Here's what we thought would go wrong, and what actually did... probably in more detail than you want or need, but perhaps you'll find it interesting,
The risk we were concerned about was that it might take a long time to mount the old user storage volumes. When we go live with a new cluster, all of the volumes that store everyone's data are detached from the old cluster, attached to the new one, then mounted.
In our tests, we'd discovered that if we created a fresh storage volume from one of our backup snapshots of a real live user storage volume, then attached it to a file server based on the new image, it could take almost three hours to mount the first time. As all of our final tests that the new system could mount the old system's volumes were based around such snapshots, this was worrying, as it could have meant that we would be out for three hours.
But then we discovered that when we tried to mount such a snapshot on a server based on the old Debian image, it also took a very long time. So we thought that perhaps the slowness was an artifact of the "created from a snapshot" nature of the volumes we were mounting, and not something to do with the new fileserver image. In other words, it was the way we were testing, not the thing we were testing.
Unfortunately the only way to find out if this was the case was to do a deploy to the live environment, which is why we did it at 4am -- the three hours from 4am-7am UTC are our slowest time of day in terms of users, so it would cause the least harm if there was a 3-hour outage.
When we did the deploy, it turned out that mounting the old user storage onto the new cluster was almost instant. We were OK!
What we'd missed out on was something quite unrelated. We deploy new code from a checkout on an workstation in the office, as upgrading PythonAnywhere is the one thing we can't do from PythonAnywhere [images of someone performing brain surgery on himself]. This local checkout was on an Arch Linux VM I have running on my Windows 7 workstation; that's pretty much our standard procedure. But what I'd missed was that this time I'd checked out the codebase into a directory that was actually on the Windows filesystem, mounted into the Arch VM using VirtualBox's directory-sharing feature. This meant that the checkout did not have the right file permissions -- everything in such a volume has 600 perms. And those incorrect permissions were on the files that were copied up to the machines in the new cluster.
This meant that when we brought up the new cluster, a bunch of things that relied on our uploaded code having the right file permissions was broken. Most of this was easy enough to fix; we have relatively few dependencies on file permissions. But some of it wasn't; in particular, the console servers (which double as ssh servers) have some code that's executed before anyone logs in over ssh, and the file for that code had the wrong perms so it couldn't execute. Which meant that we couldn't ssh into the console servers at all. Which, of course, made it impossible to finish off the configuration and make them live.
We discovered this problem while the site was down, so we had to recreate the console servers while the site was down, which took a while. Once the new servers were up, we had to carefully put them into the running cluster; this isn't something we normally do while a cluster isn't live, but we were able to adapt the code we use to add a new console server to a running live cluster (when we're scaling up for heavier load) and that, after some scary moments, did the job.
Anyway, all should be fine now. Harry and I are handing over to the other guys, and it's time to go home :-)