ϵ-Coresets for clustering (with Outliers) in doubling metrics

Lingxiao Huang, Shaofeng H.C. Jiang, Jian Li, Xuan Wu

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

11 Citations (Scopus)

Abstract

We study the problem of constructing ϵ-coresets for the (k, z)-clustering problem in a doubling metric M(X, d). An ϵ-coreset is a weighted subset S ⊆ X with weight function w: S → R ≥0 , such that for any k-subset C ϵ [X] k , it holds that Σ x ∈S w(x) · d z (x, C) ∈ (1 ± ϵ) · Σ x∈X d z (x, C). We present an efficient algorithm that constructs an ϵ-coreset for the (k, z)-clustering problem in M(X, d), where the size of the coreset only depends on the parameters k, z, e and the doubling dimension ddim(M). To the best of our knowledge this is the first efficient 6-coreset construction of size independent of |X| for general clustering problems in doubling metrics. To this end, we establish the first relation between the doubling dimension of M (X, d) and the shattering dimension (or VC-dimension) of the range space induced by the distance d. Such a relation is not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1 ± ϵ)-distortion of the distance function d (the distorted distance is called the smoothed distance function), the shattering dimension can be upper bounded by O(ϵ O (ddim(M)) ). For the purpose of coreset construction, the above bound does not suffice as it only works for unweighted spaces. Therefore, we introduce the notion of τ-error probabilistic shattering dimension, and prove a (drastically better) upper bound of O(ddim(M)·log(1/ϵ)+log log1/τ) for the probabilistic shattering dimension for weighted doubling metrics. As it turns out, an upper bound for the probabilistic shattering dimension is enough for constructing a small coreset. We believe the new relation between doubling and shattering dimensions is of independent interest and may find other applications. Furthermore, we study robust coresets for (k, z)-clustering with outliers in a doubling metric. We show an improved connection between α-approximation and robust coresets. This also leads to improvement upon the previous best known bound of the size of robust coreset for Euclidean space [Feldman and Langberg, STOC 11]. The new bound entails a few new results in clustering and property testing. As another application, we show constant-sized (ϵ, k, z)-centroid sets in doubling metrics can be constructed by extending our coreset construction. Prior to our result, constant-sized centroid sets for general clustering problems were only known for Euclidean spaces. We can apply our centroid set to accelerate the local search algorithm (studied in [Friggstad et al., FOCS 2016]) for the (k, z)-clustering problem in doubling metrics.

Original languageEnglish
Title of host publicationProceedings - 59th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2018
EditorsMikkel Thorup
PublisherIEEE Computer Society
Pages814-825
Number of pages12
ISBN (Electronic)9781538642306
DOIs
Publication statusPublished - 30 Nov 2018
MoE publication typeA4 Article in a conference publication
EventAnnual Symposium on Foundations of Computer Science - Paris, France
Duration: 7 Oct 20189 Oct 2018
Conference number: 59

Publication series

NameProceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS
Volume2018-October
ISSN (Print)0272-5428

Conference

ConferenceAnnual Symposium on Foundations of Computer Science
Abbreviated titleFOCS
CountryFrance
CityParis
Period07/10/201809/10/2018

Keywords

  • Centroid set
  • Clustering
  • Coreset
  • Doubling dimension
  • Outlier

Fingerprint Dive into the research topics of 'ϵ-Coresets for clustering (with Outliers) in doubling metrics'. Together they form a unique fingerprint.

Cite this