Efficient Update Data Generation for DBMS Benchmarks

Michael Frank, Meikel Poess, and Tilmann Rabl.

In ICPE '12: Proceedings of the ACM International Conference on Performance Engineering, 2012.


It is without doubt that industry standard benchmarks have been proven to be crucial to the innovation and productivity of the computing industry. They are important to the fair and standardized assessment of performance across different vendors, different system versions from the same vendor and across different architectures. Good benchmarks are even meant to drive industry and technology forward. Since at some point, after all reasonable advances have been made using a particular benchmark even good benchmarks become obsolete over time. This is why standard consortia periodically overhall their existing benchmarks or develop new benchmarks. An extremely time and resource consuming task in the creation of new benchmarks is the development of benchmark generators, especially because benchmarks tend to become more and more complex. The parallel data generation framework (PDGF) is a generic data generator that is capable of generating the data for the initial load of arbitrary relational schemas. It was, however, not able to generate data for the actual workload, i.e. transactions, incremental load etc., mainly because it did not understand the notion of updates. Updates are data changes that occur over time, e.g. a customer changes address, switches job, gets married or has children. Many benchmarks, need to reflect these changes during their workloads. In this paper we describe extensions to the first version of PDGF that enables the generation of update data.


Readers who enjoyed the above work, may also like the following:

  • Grand Challenge: High Performance Stream Queries in Scala.
    Dantong Song, Kaiwen Zhang, Tilmann Rabl, Prashanth Menon, and Hans-Arno Jacobsen.
    In DEBS, 2015.
    Tags: grand challenge, spark, scala, taxi monitoring
  • Just can't get enough - Synthesizing Big Data.
    Tilmann Rabl, Manuel Danisch, Michael Frank, Sebastian Schindler, and Hans-Arno Jacobsen.
    In Proceedings of the ACM SIGMOD Conference, 2015.
    Demonstration Track.
    Tags: pdgf, dbsynth, data generation
  • DualTable: A Hybrid Storage Model for Update Optimization in Hive.
    Songlin Hu, Wantao Liu, Tilmann Rabl, Shuo Huang, Ying Liang, Zhang Xiao, Hans-Arno Jacobsen, Xubin Pei, and Jiye Wang.
    In Proceedings of the 31st International Conference on Data Engineering, 2015.
    Tags: big data, hadoop, dualtable