Normally when you think of running a compute project in AWS, you need to move you data and then compute. AWS has hosted the 1000 Genome project with over 200 TB of data available to run compute jobs against without moving the data into the environment.

The 1000 Genomes Project

We're very pleased to welcome the 1000 Genomes Project data to Amazon S3.

The original human genome project was a huge undertaking. It aimed to identify every letter of our genetic code, 3 billion DNA bases in total, to help guide our understanding of human biology. The project ran for over a decade, cost billions of dollars and became the corner stone of modern genomics. The techniques and tools developed for the human genome were also put into practice in sequencing other species, from the mouse to the gorilla, from the hedgehog to the platypus. By comparing the genetic code between species, researchers can identify biologically interesting genetic regions for all species, including us.

This is a lot of data.

The data is vast (the current set weighs in at over 200Tb), so hosting the data on S3 which is closely located to the computational resources of EC2 means that anyone with an AWS account can start using it in their research, from anywhere with internet access, at any scale, whilst only paying for the compute power they need, as and when they use it. This enables researchers from laboratories of all sizes to start exploring and working with the data straight away. The Cloud BioLinux AMIs are ready to roll with the necessary tools and packages, and are a great place to get going.

Making the data available via a bucket in S3 also means that customers can crunch the information using Hadoop via Elastic MapReduce, and take advantage of the growing collection of tools for running bioinformatics job flows, such as CloudBurst and Crossbow.

It is interesting to think that AWS is hosting data that is too expensive for people to move around.

More information can be found here http://aws.amazon.com/1000genomes/

If you want to get the data yourself. here it is

Other Sources

The 1000 Genomes project data are also freely accessible through the 1000 Genomes website, and from each of the two institutions that work together as the project Data Coordination Centre (DCC).

The NIH National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine at NIH:

ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes

ftp6.ncbi.nlm.nih.gov (for IPv6 access)

http://www.ncbi.nlm.nih.gov/projects/faspftp/1000genomes/ (via Aspera)

The European Bioinformatics Institute (EMBL-EBI), with support from the Wellcome Trust:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/

http://www.1000genomes.org/aspera (via Aspera)