High throughput data redundancy removal algorithm with scalable performance

Souvik Bhattacherjee*, Ankur Narang, Vikas K. Garg

*Tämän työn vastaava kirjoittaja

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference contributionScientificvertaisarvioitu

11 Sitaatiot (Scopus)

Abstrakti

The ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, web pages, stock markets, medical records and other domains has triggered worldwide research in data intensive computing. A key requirement here involves removing redundancy from data, as this enhances the compute efficiency for downstream data processing. These application domains have an intense need for high throughput data deduplication for huge volumes of data flowing at the rate of 1 GB/s or more. In this paper, we present the design of a novel parallel data redundancy removal algorithm. We also present a queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500M records, our parallel algorithm can perform complete deduplication in 255s, on 16 core Intel Xeon 5570 architecture. This gives a throughput of around 2M records/s. For 2048 byte records, we achieve a throughput of 0.81 GB/s. To the best of our knowledge, this is the highest throughput for data redundancy removal on such massive datasets. We also demonstrate strong and weak scalability of our algorithm for both multi-core Power6 and Intel Xeon 5570 architectures.

AlkuperäiskieliEnglanti
OtsikkoHiPEAC'11 - Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Sivut87-96
Sivumäärä10
DOI - pysyväislinkit
TilaJulkaistu - 2011
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaInternational Conference on High Performance and Embedded Architectures and Compilers - Heraklion, Kreikka
Kesto: 24 tammikuuta 201126 tammikuuta 2011
Konferenssinumero: 6

Conference

ConferenceInternational Conference on High Performance and Embedded Architectures and Compilers
LyhennettäHiPEAC
MaaKreikka
KaupunkiHeraklion
Ajanjakso24/01/201126/01/2011

Sormenjälki Sukella tutkimusaiheisiin 'High throughput data redundancy removal algorithm with scalable performance'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä