You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
By clicking “Sign up for GitHub”, you agree to our
terms of service
and
privacy statement
. We’ll occasionally send you account related emails.
Already on GitHub?
Sign in
to your account
There are issues with
numpy.unique
when an array has dtype "object"
when there are incomparable types in the array, an exception is raised
when the types are all numeric, but the array contains
NaN
, the result is incorrect.
Both of those issues seem to be caused by issues with
sort
.
Are
unique
and
sort
known to not work when the dtype is "object"? I didn't see any indication in the documentation for either function.
Reproducing code example:
Here,
np.unique
works as expected:
In [1]: import numpy as np
In [2]: arr = np.array([0.0, 1, np.nan, 1.0, 0])
In [3]: np.sort(arr)
Out[3]: array([ 0., 0., 1., 1., nan])
In [4]: np.unique(arr)
Out[4]: array([ 0., 1., nan])
Here, mixed type inside of
dtype="object"
causes an exception:
In [5]: arr = np.array(["a", "b", 1.0, "b", "a"], dtype="object")
In [6]: np.sort(arr)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-3b220bbb055a> in <module>
----> 1 np.sort(arr)
<__array_function__ internals> in sort(*args, **kwargs)
~/anaconda3/envs/tabnet/lib/python3.8/site-packages/numpy/core/fromnumeric.py in sort(a, axis, kind, order)
989 else:
990 a = asanyarray(a).copy(order="K")
--> 991 a.sort(axis=axis, kind=kind, order=order)
992 return a
TypeError: '<' not supported between instances of 'float' and 'str'
Here, no exception is raised, but the results are incorrect:
In [7]: arr = np.array([0.0, 1, np.nan, 1.0, 0], dtype="object")
In [8]: np.sort(arr)
Out[8]: array([0.0, 1, nan, 0, 1.0], dtype=object)
In [9]: np.unique(arr)
Out[9]: array([0.0, 1, nan, 0, 1.0], dtype=object)
In [10]: arr = np.array([0.0, 1.0, np.nan, 1.0, 0.0], dtype="object")
In [11]: np.sort(arr)
Out[11]: array([0.0, 1.0, nan, 0.0, 1.0], dtype=object)
In [12]: np.unique(arr)
Out[12]: array([0.0, 1.0, nan, 0.0, 1.0], dtype=object)
The documentation for
sort
says "In numpy versions >= 1.4.0 nan values are sorted to the end."
NumPy/Python version information:
Seen in NumPy versions 1.19.5, Python version 3.8.5
Here, no exception is raised, but the results are incorrect:
Both of these are expected behavior, and match the behavior of the builtin
sorted
function on lists.
We should probably update the documentation for
sort
to clarify that nans are sorted to the end
only for arrays of subtypes of
np.inexact
.
I can confirm this issue. I also had problems with
numpy.unique
function when the object type of the array was
set
and also
frozenset
. A small example follows.
My dataset is a dataframe of
sets
of sequences of strings. In the following image there is the
unique
function applied to the values in an array. Sorry, the example is somehow big.
In the image one can see that the
unique
function did not removed duplicates of the set
{'E', 'K', 'L'}
, what is incorrect
Moreover: by counting occurrences in the results of the unique function on this array, it gives count > 1
import collections
collections.Counter(np.unique(sf.values))[{'E', 'K', 'L'}]
Out: 3