Crafting Big Data Benchmarks

Tilmann Rabl.

International Supercomputing Conference , 2014.
Invited Talk.


The Workshops for Big Data Benchmarking (, which have been underway since May 2012, have identified a set of characteristics of big data applications that apply to industry as well as scientific application scenarios involving pipelines of processing with steps that include aggregation, cleaning, and annotation of large volumes of data; filtering, integration, fusion, subsetting, and compaction of data; and, subsequent analysis, including visualization, data mining, predictive analytics and, eventually, decision making. One of the outcomes of the WBDB workshops has been the formation of a Transaction Processing Council subcommittee on Big Data, which is initially defining a Hadoop systems benchmark, TPCx-HS, based on Terasort. TPCx-HS would be a simple, functional benchmark that would assist in determining basic resiliency and scalability features of large-scale systems. Other proposals are also actively under development including BigBench, which extends the TPC-DS benchmark for big data scenarios; Big Decision Benchmark from HP; HiBench from Intel; and the Deep Analytics Pipeline (DAP), which defines a sequence of end-to-end processing steps consisting of some of the operations mentioned above. Pipeline benchmarks reveal the need for different processing modalities and system characteristics for different steps in the pipeline. For example, early processing steps may process very large volumes of data and may benefit from a Hadoop and MapReduce-style of computing, while later steps may operate on more structured data and may require, say, SMP-style architectures or very large memory systems. This talk will provide an overview of these benchmark activities and discuss opportunities for collaboration and future work with industry partners.


Tags: big data, benchmarking

Readers who enjoyed the above work, may also like the following:

  • Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data.
    Chaitanya Baru, Milind Bhandarkar, Carlo Curino, Manuel Danisch, Michael Frank, Bhaskar Gowda, Hans-Arno Jacobsen, Huang Jie, Dileep Kumar, Raghunath Nambiar, Meikel Poess, Francois Raab, Tilmann Rabl, Nishkam Ravi, Kai Sachs, Saptak Sen, Lan Yi, and Choonhan Youn.
    In Sixth TPC Technology Conference on Performance Evaluation & Benchmarking, pages 44-63, 2014. Springer Berlin Heidelberg.
    Tags: bigbench, big data, benchmarking
  • BigBench Specification V0.1.
    Tilmann Rabl, Ahmad Ghazal, Minqing Hu, Alain Crolotte, Francois Raab, Meikel Poess, and Hans-Arno Jacobsen.
    In Proceedings of the 2012 Workshop on Big Data Benchmarking, pages 164-202, 2013.
    Tags: bigbench, big data, benchmarking
  • Big Data Generation.
    Tilmann Rabl and Hans-Arno Jacobsen.
    In Proceedings of the Workshop on Big Data Benchmarking, pages 20-27, 2013.
    Tags: pdgf, big data, benchmarking