You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
By clicking “Sign up for GitHub”, you agree to our
terms of service
and
privacy statement
. We’ll occasionally send you account related emails.
Already on GitHub?
Sign in
to your account
Describe the bug
Not sure this is a bug or something that just need to be better documented.
The
"k-means++"
init on very sparse data can select initial centroids than never get updated and this cause problems in the silhouette clustering evaluation with
sample_size=1000
.
The problem disappears when
sample_size
is increased or by setting particular random states.
Steps/Code to Reproduce
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn import metrics
categories = [
"alt.atheism",
"talk.religion.misc",
"comp.graphics",
"sci.space",
dataset = fetch_20newsgroups(
remove=("headers", "footers", "quotes"),
subset="all",
categories=categories,
shuffle=True,
random_state=42,
vectorizer = TfidfVectorizer(
max_df=0.5,
min_df=2,
stop_words="english",
X = vectorizer.fit_transform(dataset.data)
km = KMeans(
n_clusters=4,
init="k-means++",
max_iter=100,
n_init=1,
random_state=0, #notice fixed random state
).fit(X)
metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=3) # works
metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=4) # fails
Expected Results
Always return silhouette score.
Actual Results
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Untitled-1 in <module>
----> <a href='untitled:Untitled-1?line=37'>38</a> metrics.silhouette_score(X, km.labels_, sample_size=1000, random_state=43)
File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:117, in silhouette_score(X, labels, metric, sample_size, random_state, **kwds)
[115](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=114) else:
[116](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=115) X, labels = X[indices], labels[indices]
--> [117](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=116) return np.mean(silhouette_samples(X, labels, metric=metric, **kwds))
File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:231, in silhouette_samples(X, labels, metric, **kwds)
[229](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=228) n_samples = len(labels)
[230](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=229) label_freqs = np.bincount(labels)
--> [231](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=230) check_number_of_labels(len(le.classes_), n_samples)
[233](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=232) kwds["metric"] = metric
[234](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=233) reduce_func = functools.partial(
[235](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=234) _silhouette_reduce, labels=labels, label_freqs=label_freqs
[236](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=235) )
File ~/scikit-learn/sklearn/metrics/cluster/_unsupervised.py:33, in check_number_of_labels(n_labels, n_samples)
[22](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=21) """Check that number of labels are valid.
[23](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=22)
[24](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=23) Parameters
(...)
[30](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=29) Number of samples.
[31](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=30) """
[32](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=31) if not 1 < n_labels < n_samples:
---> [33](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=32) raise ValueError(
[34](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=33) "Number of labels is %d. Valid values are 2 to n_samples - 1 (inclusive)"
[35](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=34) % n_labels
[36](file:///home/arturoamor/scikit-learn/sklearn/metrics/cluster/_unsupervised.py?line=35) )
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
Versions
System:
python: 3.9.9 | packaged by conda-forge | (main, Dec 20 2021, 02:40:17) [GCC 9.4.0]
executable: /home/arturoamor/miniforge3/envs/dev-scikit-learn/bin/python
machine: Linux-5.14.0-1038-oem-x86_64-with-glibc2.31
Python dependencies:
sklearn: 1.2.dev0
pip: 21.3.1
setuptools: 60.5.0
numpy: 1.22.1
scipy: 1.7.3
Cython: 0.29.26
pandas: 1.3.5
matplotlib: 3.5.1
joblib: 1.1.0
threadpoolctl: 3.0.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-2f7c42d4.3.18.so
version: 0.3.18
threading_layer: pthreads
architecture: SkylakeX
num_threads: 8
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-8b9e111f.3.17.so
version: 0.3.17
threading_layer: pthreads
architecture: SkylakeX
num_threads: 8
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /home/arturoamor/miniforge3/envs/dev-scikit-learn/lib/libgomp.so.1.0.0
version: None
num_threads: 8
Since random_state is not set in KMeans in the reproducer, I can't reproduce it yet. We need to find a seed that trigger it
I updated it.
Isn't it just bad luck ?
It just happens that it's selecting
sample_size
data points that all correspond to the same label ?
I can reproduce with this:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn import metrics
X, _ = make_blobs(n_samples=10000, centers=2, n_features=2, random_state=0)
km = KMeans(n_clusters=2, init="random", n_init=1, random_state=0).fit(X)
metrics.silhouette_score(X, km.labels_, sample_size=5, random_state=48)
Here I set 2 clusters and only pick
5
samples to maximize the chance of picking all points corresponding to a same label.
In your examples, the clustering is highly non balanced:
np.unique(km.labels_, return_counts=True)
>>> (array([0, 1, 2, 3], dtype=int32), array([ 1, 1, 3382, 3]))
Almost all points are in the same cluster, hence more chance to only pick points corresponding to a same label.
Looking at the results for different seeds of
KMeans(n_clusters=4, init="k-means++", max_iter=100, n_init=1, random_state=seed).fit(X)
, this kind of unbalance happens sometimes. I guess it's because kmeans++ picks centroids with proba inverse to square distance, which means it will pick points extremely isolated if there are any, which then stay their own centroids all along.
Note that it doesn't mean it's a bad clustering. The inertia is close to the inertia obtained with a random init. It's just kind of reflecting the structure of the dataset.
It doesn't look like there's an issue with kmeans++. It's just that KMeans (euclidean distance in fact) is not really good for such high dimensional dataset.