Streaming similarity self-join

Gianmarco De Francisci Morales, Aristides Gionis

Tutkimustuotos: Artikkeli kirjassa/konferenssijulkaisussaConference contributionScientificvertaisarvioitu

20 Sitaatiot (Scopus)
163 Lataukset (Pure)

Abstrakti

We introduce and study the problem of computing the simi- larity self-join in a streaming context (sssj), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose similarity is greater than a given threshold. The simplest formulation of the problem requires unbounded memory, and thus, it is intractable. To make the problem feasible, we introduce the notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time. By leveraging the properties of this time-dependent sim- ilarity function, we design two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch (MB), uses existing index-based filtering techniques for the static ver- sion of the problem, and combines them in a pipeline. The second framework, Streaming (STR), adds time filtering to the existing indexes, and integrates new time-based bounds deeply in the working of the algorithms. We also introduce a new indexing technique (L2), which is based on an existing state-of-the-art indexing technique (L2AP), but is optimized for the streaming case. Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters.

AlkuperäiskieliEnglanti
OtsikkoProceedings of the VLDB Endowment
KustantajaACM
Sivut792-803
Sivumäärä12
TilaJulkaistu - 2016
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisuussa
TapahtumaInternational Conference on Very Large Databases - New Delhi, Intia
Kesto: 5 syysk. 20169 syysk. 2016
Konferenssinumero: 42

Julkaisusarja

NimiProceedings of the VLDB endowment
KustantajaAssociation for Computing Machinery
Numero10
Vuosikerta9
ISSN (painettu)2150-8097

Conference

ConferenceInternational Conference on Very Large Databases
Maa/AlueIntia
KaupunkiNew Delhi
Ajanjakso05/09/201609/09/2016

Sormenjälki

Sukella tutkimusaiheisiin 'Streaming similarity self-join'. Ne muodostavat yhdessä ainutlaatuisen sormenjäljen.

Siteeraa tätä