High throughput data redundancy removal algorithm with scalable performance

Souvik Bhattacherjee*, Ankur Narang, Vikas K. Garg

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingsScientificpeer-review

12 Citations (Scopus)

Abstract

The ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, web pages, stock markets, medical records and other domains has triggered worldwide research in data intensive computing. A key requirement here involves removing redundancy from data, as this enhances the compute efficiency for downstream data processing. These application domains have an intense need for high throughput data deduplication for huge volumes of data flowing at the rate of 1 GB/s or more. In this paper, we present the design of a novel parallel data redundancy removal algorithm. We also present a queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500M records, our parallel algorithm can perform complete deduplication in 255s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 2M records/s. For 2048 byte records, we achieve a throughput of 0.81 GB/s. To the best of our knowledge, this is the highest throughput for data redundancy removal on such massive datasets. We also demonstrate strong and weak scalability of our algorithm for both multi-core Power6 and Intel Xeon 5570 architectures.

Original languageEnglish
Title of host publicationHiPEAC'11 - Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Pages87-96
Number of pages10
DOIs
Publication statusPublished - 2011
MoE publication typeA4 Conference publication
EventInternational Conference on High Performance and Embedded Architectures and Compilers - Heraklion, Greece
Duration: 24 Jan 201126 Jan 2011
Conference number: 6

Conference

ConferenceInternational Conference on High Performance and Embedded Architectures and Compilers
Abbreviated titleHiPEAC
Country/TerritoryGreece
CityHeraklion
Period24/01/201126/01/2011

Keywords

  • Bloom filter
  • Data redundancy removal
  • Deduplication
  • Multicore architecture
  • Parallel algorithms
  • Queueing theory

Fingerprint

Dive into the research topics of 'High throughput data redundancy removal algorithm with scalable performance'. Together they form a unique fingerprint.

Cite this