Abstract
The ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, web pages, stock markets, medical records and other domains has triggered worldwide research in data intensive computing. A key requirement here involves removing redundancy from data, as this enhances the compute efficiency for downstream data processing. These application domains have an intense need for high throughput data deduplication for huge volumes of data flowing at the rate of 1 GB/s or more. In this paper, we present the design of a novel parallel data redundancy removal algorithm. We also present a queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500M records, our parallel algorithm can perform complete deduplication in 255s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 2M records/s. For 2048 byte records, we achieve a throughput of 0.81 GB/s. To the best of our knowledge, this is the highest throughput for data redundancy removal on such massive datasets. We also demonstrate strong and weak scalability of our algorithm for both multi-core Power6 and Intel Xeon 5570 architectures.
Original language | English |
---|---|
Title of host publication | HiPEAC'11 - Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers |
Pages | 87-96 |
Number of pages | 10 |
DOIs | |
Publication status | Published - 2011 |
MoE publication type | A4 Conference publication |
Event | International Conference on High Performance and Embedded Architectures and Compilers - Heraklion, Greece Duration: 24 Jan 2011 → 26 Jan 2011 Conference number: 6 |
Conference
Conference | International Conference on High Performance and Embedded Architectures and Compilers |
---|---|
Abbreviated title | HiPEAC |
Country/Territory | Greece |
City | Heraklion |
Period | 24/01/2011 → 26/01/2011 |
Keywords
- Bloom filter
- Data redundancy removal
- Deduplication
- Multicore architecture
- Parallel algorithms
- Queueing theory