Google Research has a post reaching out to the academic community.

Google Cluster Data

Thursday, January 07, 2010 at 1/07/2010 08:11:00 AM

Posted by Joseph L. Hellerstein, Manager of Google Performance Analytics
Google faces a large number of technical challenges in the evolution of its applications and infrastructure. In particular, as we increase the size of our compute clusters and scale the work that they process, many issues arise in how to schedule the diversity of work that runs on Google systems.

The areas of interest for Google are:

We have distilled these challenges into the following research topics that we feel are interesting to the academic community and important to Google:

Workload characterizations: How can we characterize Google workloads in a way that readily generates synthetic work that is representative of production workloads so that we can run stand alone benchmarks?

Predictive models of workload characteristics: What is normal and what is abnormal workload? Are there "signals" that can indicate problems in a time-frame that is possible for automated and/or manual responses?

New algorithms for machine assignment: How can we assign tasks to machines so that we make best use of machine resources, avoid excess resource contention on machines, and manage power efficiently?

Scalable management of cell work: How should we design the future cell management system to efficiently visualize work in cells, to aid in problem determination, and to provide automation of management tasks?

Thee Google Cluster data is here.

This project is intended for the distribution of data of production workloads running on Google clusters.

The first dataset (data-1), provides traces over a 7 hour period. The workload consists of a set of tasks, where each task runs on a single machine. Tasks consume memory and one or more cores (in fractional units). Each task belongs to a single job; a job may have multiple tasks (e.g., mappers and reducers).

The data have been anonymized in several ways: there are no task or job names, just numeric identifiers; timestamps are relative to the start of data collection; the consumption of CPU and memory is obscured using a linear transformation. However, even with these transformations of the data, researchers will be able to do workload characterizations (up to a linear transformation of the true workload) and workload generation.

The data are structured as blank separated columns. Each row reports on the execution of a single task during a five minute period.

Time (int) - time in seconds since the start of data collection

JobID (int) - Unique identifier of the job to which this task belongs

TaskID (int) - Unique identifier of the executing task

Job Type (0, 1, 2, 3) - class of job (a categorization of work)

Normalized Task Cores (float) - normalized value of the average number of cores used by the task

Normalized Task Memory (float) - normalized value of the average memory consumed by the task

Please let us know about issues you have with the data.

So far there have been 230 downloads.

Filename ▼
Summary + Labels ▼
Uploaded ▼
Size ▼
DownloadCount ▼
...

google-cluster-data-1.csv.gz
7+ hours of workload traces from a Google production cluster
Dec 18
29.8 MB
230

1 - 1 of 1