Efficient set intersection counting algorithm for text similarity measures

Preethi Lahoti, Patrick K. Nicholson, Bilyana Taneva

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

2 Citations (Scopus)

Abstract

Set intersection counting appears as a subroutine in many techniques used in natural language processing, in which similarity is often measured as a function of document cooccurence counts between pairs of noun phrases or entities. Such techniques include clustering of text phrases and named entities, topic labeling, entity disambiguation, sentiment analysis, and search for synonyms. These techniques can have real-time constraints that require very fast computation of thousands of set intersection counting queries with little space overhead and minimal error. On one hand, while sketching techniques for approximate intersection counting exist and have very fast query time, many have issues with accuracy, especially for pairs of lists that have low Jaccard similarity. On the other hand, space-efficient computation of exact intersection sizes is particularly challenging in real-time. In this paper, we show how an efficient spacetime trade-off can be achieved for exact set intersection counting, by combining state-of-the-art algorithms with precomputation and judicious use of compression. In addition, we show that the performance can be further improved by combining the best aspects of these algorithms. We present experimental evidence that realtime computation of exact intersection sizes is feasible with low memory overhead: we improve the mean query time of baseline approaches by over a factor of 100 using a data structure that takes merely twice the size of an inverted index. Overall, in our experiments, we achieve running times within the same order of magnitude as well-known approximation techniques.

Original languageEnglish
Title of host publication19th Workshop on Algorithm Engineering and Experiments 2017, ALENEX 2017
Pages146-158
Number of pages13
ISBN (Electronic)9781611974768
DOIs
Publication statusPublished - 2017
MoE publication typeA4 Article in a conference publication
EventWorkshop on Algorithm Engineering and Experiments - Barcelona, Spain
Duration: 17 Jan 201718 Jan 2017
Conference number: 19

Workshop

WorkshopWorkshop on Algorithm Engineering and Experiments
Abbreviated titleALENEX
Country/TerritorySpain
CityBarcelona
Period17/01/201718/01/2017

Fingerprint

Dive into the research topics of 'Efficient set intersection counting algorithm for text similarity measures'. Together they form a unique fingerprint.

Cite this