`silhouette_score` fails whit `"k-means++"` init on very sparse data · Issue #23566

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Describe the bug

Not sure this is a bug or something that just need to be better documented.

The "k-means++" init on very sparse data can select initial centroids than never get updated and this cause problems in the silhouette clustering evaluation with sample_size=1000 .

The problem disappears when sample_size is increased or by setting particular random states.

Steps/Code to Reproduce

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn import metrics
categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=2,
    stop_words="english",
X = vectorizer.fit_transform(dataset.data)
km = KMeans(
    n_clusters=4,
    init="k-means++",
    max_iter=100,
    n_init=1,
    random_state=0,  #notice fixed random state
).fit(X)
metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=3) # works
metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=4) # fails

Expected Results

Always return silhouette score.

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Untitled-1 in <module>
----> <a href='untitled:Untitled-1?line=37'>38</a> metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=43)
File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:117, in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
    [115](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=114)     else:
    [116](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=115)         X, labels = X[indices], labels[indices]
--> [117](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=116) return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:231, in silhouette_samples(X, labels, metric, **kwds)
    [229](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=228) n_samples = len(labels)
    [230](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=229) label_freqs = np.bincount(labels)
--> [231](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=230) check_number_of_labels(len(le.classes_), n_samples)
    [233](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=232) kwds["metric"] = metric
    [234](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=233) reduce_func = functools.partial(
    [235](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=234)     _silhouette_reduce, labels=labels, label_freqs=label_freqs
    [236](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=235) )
File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:33, in check_number_of_labels(n_labels, n_samples)
     [22](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=21) """Check that number of labels are valid.
     [23](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=22) 
     [24](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=23) Parameters
   (...)
     [30](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=29)     Number of samples.
     [31](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=30) """
     [32](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=31) if not 1 < n_labels < n_samples:
---> [33](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=32)     raise ValueError(
     [34](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=33)         "Number of labels is %d. Valid values are 2 to n_samples - 1 (inclusive)"
     [35](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=34)         % n_labels
     [36](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=35)     )
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

Versions

System:
    python: 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:40:17)  [GCC 9.4.0]
executable: /home/arturoamor/miniforge3/envs/dev-scikit-learn/bin/python
   machine: Linux-5.14.0-1038-oem-x86_64-with-glibc2.31
Python dependencies:
      sklearn: 1.2.dev0
          pip: 21.3.1
   setuptools: 60.5.0
        numpy: 1.22.1
        scipy: 1.7.3
       Cython: 0.29.26
       pandas: 1.3.5
   matplotlib: 3.5.1
       joblib: 1.1.0
threadpoolctl: 3.0.0
Built with OpenMP: True
threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-2f7c42d4.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 8
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so
        version: 0.3.17
threading_layer: pthreads
   architecture: SkylakeX
    num_threads: 8
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/libgomp.so.1.0.0
        version: None
    num_threads: 8

Since random_state is not set in KMeans in the reproducer, I can't reproduce it yet. We need to find a seed that trigger it

I updated it.

Isn't it just bad luck ?

It just happens that it's selecting sample_size data points that all correspond to the same label ?

I can reproduce with this:

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn import metrics
X, _ = make_blobs(n_samples=10000, centers=2, n_features=2, random_state=0)
km = KMeans(n_clusters=2, init="random", n_init=1, random_state=0).fit(X)
metrics.silhouette_score(X, km.labels_, sample_size=5, random_state=48)

Here I set 2 clusters and only pick 5 samples to maximize the chance of picking all points corresponding to a same label.

In your examples, the clustering is highly non balanced:

np.unique(km.labels_, return_counts=True)
>>> (array([0, 1, 2, 3], dtype=int32), array([   1,    1, 3382,    3]))

Almost all points are in the same cluster, hence more chance to only pick points corresponding to a same label.

Looking at the results for different seeds of KMeans(n_clusters=4, init="k-means++", max_iter=100, n_init=1, random_state=seed).fit(X) , this kind of unbalance happens sometimes. I guess it's because kmeans++ picks centroids with proba inverse to square distance, which means it will pick points extremely isolated if there are any, which then stay their own centroids all along.

Note that it doesn't mean it's a bad clustering. The inertia is close to the inertia obtained with a random init. It's just kind of reflecting the structure of the dataset.

It doesn't look like there's an issue with kmeans++. It's just that KMeans (euclidean distance in fact) is not really good for such high dimensional dataset.