Drowning in data at the Wash-U Genome Center

P1020221So, how about a little bit more info on the Wash-U Genome Center where I am visiting until the end of this week.

Walking past the guard into the anonymous-looking building, past the drab physical therapy suites to the elevator, you would be forgiven for not twigging that this was the hub for some of the most exciting work in science right now.

Things look up when you hit the 4th floor. Reading the placard next to a first-generation ABI sequencer you will find one of the first machines used on the original Human Genome Project. Another one in the corner serves as a table for a vase. Moving down the corridor, genome nerds will get a frisson of excitement as the tantalising sight of large, blue machines can be glimpsed through the small square windows. To the genome nerd, those blue machines mean just one thing: data.P1020212

Here, genome data is being produced on a scale which would have been unimaginable even a few years ago.

Those blue machines are Solexa machines, what we used to call a “next-generation sequencer”. Wash-U has 29 of them (soon to be 35). There are also 8 454 machines, the competing technology from Roche.

My original calculations figured that these machines were producing around 1 to 2 gigabases of DNA sequence per run, as per the manufacturer presentations. However, a customer as big as Wash-U works closely with Illumina, the manufacturer, and these numbers have already been rendered obsolete.

P1020220Each machine routinely produces 10 gigabases per run (running 75 cycles to make DNA fragments 75 nucleotide bases long). Each run takes around 10 days to complete. So, with it all running 24/7, that’s about 1,200 terabases a year.

With a human genome commonly quoted as being 3 gigabases long, you are looking at 400 human genomes (or cancer genomes, or chimpanzee genomes) per year when sequenced at 10-fold coverage (i.e. every nucleotide in the genome sequenced 10 times on average, to improve accuracy).

And that’s just the tip of the iceberg, sequencing throughput is out-pacing Moore’s Law right now by some margin, and there is no reason to think that the amount of data being pushed from each machine will not continue to increase.

With all that data generated, you need some pretty serious IT infrastructure. The genome center this year completed building a dedicated data center which is located over the road. It is connected to the machines through a dedicated 10-gigabit fibre connection. The data center is spacious with plenty of room for expansion. In common with data centres in the commercial world, it is stocked with resilient power and air-conditioning.P1020218

What’s inside? The big numbers are 3 petabytes (3 thousand thousand gigabytes) of storage, connected to 3,000 processor cores! That’s a lot of hard drives when you see them all lined up. Gary Stiehr told me that at least one disk failed every 2 days.

So, all those sequencers and all that storage and processing capacity. What have I missed? Of course, people – there are around 300 staff in the genome center.

I’ll tell you more about how all this is co-ordinated, a huge task, in another update.