TPC-DI: The First Industry Benchmark for Data Integration

Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, and Brian Caufield.

VLDB2014 Proceedings of the VLDB Endowment, 13(7)1367-1378, 2014.

Abstract

Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.

Download




Tags: benchmarking, tpc, tpc-di, etl, data integration


Readers who enjoyed the above work, may also like the following:


  • PSBench: A Benchmark for Content- and Topic-based Publish/Subscribe Systems.
    Kaiwen Zhang, Tilmann Rabl, Yi Ping Sun, Rushab Kumar, Nayeem Zen, and Hans-Arno Jacobsen.
    In Middleware Demos, 2014.
    Tags: pub/sub, pub/sub applications, publish/subscribe, benchmarking
  • Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data.
    Chaitanya Baru, Milind Bhandarkar, Carlo Curino, Manuel Danisch, Michael Frank, Bhaskar Gowda, Hans-Arno Jacobsen, Huang Jie, Dileep Kumar, Raghunath Nambiar, Meikel Poess, Francois Raab, Tilmann Rabl, Nishkam Ravi, Kai Sachs, Saptak Sen, Lan Yi, and Choonhan Youn.
    In Sixth TPC Technology Conference on Performance Evaluation & Benchmarking, pages 44-63, 2014. Springer Berlin Heidelberg.
    Tags: bigbench, big data, benchmarking
  • BigBench Specification V0.1.
    Tilmann Rabl, Ahmad Ghazal, Minqing Hu, Alain Crolotte, Francois Raab, Meikel Poess, and Hans-Arno Jacobsen.
    In Proceedings of the 2012 Workshop on Big Data Benchmarking, pages 164-202, 2013.
    Tags: bigbench, big data, benchmarking