Big data: Welcome to the petacentre
Cory Doctorow has done an awesome article for Nature.com on the absolutely huge and hugely expensive petacentres.
Here’s what he wrote in the email notification for the article:
I wrote a feature for this week’s issue of the journal *Nature* on “petascale” data-centers — giant data-centers used in scholarship and science, from Google to the Large Hadron Collider to the Human Genome and Thousand Genome projects to the Internet Archive. The issue is on stands now and also available free online. Yesterday, I popped into Nature’s offices in London and recorded a special podcast on the subject, too. This was one of the coolest writing assignments I’ve ever been on, pure sysadmin porn. It was worth doing just to see the the giant, Vader-cube tape-robots at CERN.
At this scale, memory has costs. It costs money — 168 million Swiss francs (US$150 million) for data management at the new Large Hadron Collider (LHC) at CERN, the European particle-physics lab near Geneva. And it also has costs that are more physical. Every watt that you put into retrieving data and calculating with them comes out in heat, whether it be on a desktop or in a data centre; in the United States, the energy used by computers has more than doubled since 2000. Once you’re conducting petacalculations on petabytes, you’re into petaheat territory. Two floors of the Sanger data centre are devoted to cooling. The top one houses the current cooling system. The one below sits waiting for the day that the centre needs to double its cooling capacity. Both are sheathed in dramatic blue glass; the scientists call the building the Ice Cube.
The fallow cooling floor is matched in the compute centre below (these people all use ‘compute’ as an adjective). When Butcher was tasked with building the Sanger’s data farm he decided to implement a sort of crop rotation. A quarter of the data centre — 250 square metres — is empty, waiting for the day when the centre needs to upgrade to an entirely new generation of machines. When that day comes, Butcher and his team will set up in that empty space the yet-to-be-specified systems for power, cooling and the rest of it. Once the new centre is up, they’ll be able to shift operations from the obsolete old centre in sections, dismantling and rebuilding without a service interruption, leaving a new patch of the floor fallow — in anticipation of doing it all again in a distressingly short space of time.
The first rotation may come soon. Sequencing at the Sanger, and elsewhere, is getting faster at a dizzying pace — a pace made possible by the data storage facilities that are inflating to ever greater sizes. Take the human genome: the fact that there is now a reference genome sitting in digital storage brings a new generation of sequencing hardware into its own. The crib that the reference genome provides makes the task of adding together the tens of millions of short samples those machines produce a tractable one. It is what makes the 1000 Genomes Project, which the Sanger is undertaking in concert with the Beijing Genomics Institute in China and the US National Human Genome Research Institute, possible — and with it the project’s extraordinary aim of identifying every gene-variant present in at least 1% of Earth’s population.
Nature magazine:
http://www.nature.com/news/2008/080903/full/455016a.html
Podcast:
http://nature.edgeboss.net/download/nature/nature/podcast/extras/big-data-2008-09-04.mp3?ewk13=1
Flickr photos from the research:
http://www.flickr.com/photos/doctorow/sets/72157606675048531/
Cory did a great job on all of it; the article, the podcast and the Flickr! photo gallery!
Get a feeling for “…the relentless march from kilo to mega to giga to tera to peta to exa to zetta to yotta. The mad, inconceivable growth of computer performance and data storage is changing science, knowledge, surveillance, freedom, literacy, the arts — everything that can be represented as data, or built on those representations. And in doing so it is putting endless strain on the people and machines that store the exponentially growing wealth of data involved. I’ve set out to see how the system administrators, or sysadmins, at some of the biggest scientific data centres take that strain — and to get a sense of how it feels to work with some of the biggest, coolest IT toys on the planet.“
Never in known history has the ability to store so much data on so many people as well as the knowledge of mankind, marketing, banking — the intellectual data of the world could be stored like this. Awesome and scary at the same time. Hitler would have loved to have such capabilities during his reign….