Big Data Generation
Abstract
Big data challenges are end-to-end problems. When handling big data it usually
has to be preprocessed, moved, loaded, processed, and stored many times. This
has led to the creation of big data pipelines. Current benchmarks related to
big data only focus on isolated aspects of this pipeline, usually the
processing, storage and loading aspects. To this date, there has not been any
benchmark presented covering the end-to-end aspect for big
data systems.
In this paper, we discuss the necessity of ETL like tasks in big data
benchmarking and propose the Parallel Data Generation Framework (PDGF)
for its data generation. PDGF is a generic data generator that was implemented
at the University of Passau and is currently adopted in TPC benchmarks.
Download
Tags: pdgf, big data, benchmarking
Readers who enjoyed the above work, may also like the following:
- Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query Performance.
Tilmann Rabl, Meikel Poess, Hans-Arno Jacobsen, Patrick O'Neil, and Elizabeth O'Neil. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, 2013.
Tags: star schema benchmark, ssb, parallel data generation framework, pdgf, benchmarking, skew
- Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data.
Chaitanya Baru, Milind Bhandarkar, Carlo Curino, Manuel Danisch, Michael Frank, Bhaskar Gowda, Hans-Arno Jacobsen, Huang Jie, Dileep Kumar, Raghunath Nambiar, Meikel Poess, Francois Raab, Tilmann Rabl, Nishkam Ravi, Kai Sachs, Saptak Sen, Lan Yi, and Choonhan Youn. In Sixth TPC Technology Conference on Performance Evaluation & Benchmarking, pages 44-63, 2014. Springer Berlin Heidelberg.
Tags: bigbench, big data, benchmarking
- BigBench Specification V0.1.
Tilmann Rabl, Ahmad Ghazal, Minqing Hu, Alain Crolotte, Francois Raab, Meikel Poess, and Hans-Arno Jacobsen. In Proceedings of the 2012 Workshop on Big Data Benchmarking, pages 164-202, 2013.
Tags: bigbench, big data, benchmarking
|