添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are issues with numpy.unique when an array has dtype "object"

  • when there are incomparable types in the array, an exception is raised
  • when the types are all numeric, but the array contains NaN , the result is incorrect.
  • Both of those issues seem to be caused by issues with sort .

    Are unique and sort known to not work when the dtype is "object"? I didn't see any indication in the documentation for either function.

    Reproducing code example:

    Here, np.unique works as expected:

    In [1]: import numpy as np
    In [2]: arr = np.array([0.0, 1, np.nan, 1.0, 0])
    In [3]: np.sort(arr)
    Out[3]: array([ 0.,  0.,  1.,  1., nan])
    In [4]: np.unique(arr)
    Out[4]: array([ 0.,  1., nan])

    Here, mixed type inside of dtype="object" causes an exception:

    In [5]: arr = np.array(["a", "b", 1.0, "b", "a"], dtype="object")
    In [6]: np.sort(arr)
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-6-3b220bbb055a> in <module>
    ----> 1 np.sort(arr)
    <__array_function__ internals> in sort(*args, **kwargs)
    ~/anaconda3/envs/tabnet/lib/python3.8/site-packages/numpy/core/fromnumeric.py in sort(a, axis, kind, order)
        989     else:
        990         a = asanyarray(a).copy(order="K")
    --> 991     a.sort(axis=axis, kind=kind, order=order)
        992     return a
    TypeError: '<' not supported between instances of 'float' and 'str'

    Here, no exception is raised, but the results are incorrect:

    In [7]: arr = np.array([0.0, 1, np.nan, 1.0, 0], dtype="object")
    In [8]: np.sort(arr)
    Out[8]: array([0.0, 1, nan, 0, 1.0], dtype=object)
    In [9]: np.unique(arr)
    Out[9]: array([0.0, 1, nan, 0, 1.0], dtype=object)
    In [10]: arr = np.array([0.0, 1.0, np.nan, 1.0, 0.0], dtype="object")
    In [11]: np.sort(arr)
    Out[11]: array([0.0, 1.0, nan, 0.0, 1.0], dtype=object)
    In [12]: np.unique(arr)
    Out[12]: array([0.0, 1.0, nan, 0.0, 1.0], dtype=object)

    The documentation for sort says "In numpy versions >= 1.4.0 nan values are sorted to the end."

    NumPy/Python version information:

    Seen in NumPy versions 1.19.5, Python version 3.8.5

    Here, no exception is raised, but the results are incorrect:

    Both of these are expected behavior, and match the behavior of the builtin sorted function on lists.
    We should probably update the documentation for sort to clarify that nans are sorted to the end only for arrays of subtypes of np.inexact .

    I can confirm this issue. I also had problems with numpy.unique function when the object type of the array was set and also frozenset . A small example follows.

    My dataset is a dataframe of sets of sequences of strings. In the following image there is the unique function applied to the values in an array. Sorry, the example is somehow big.

    In the image one can see that the unique function did not removed duplicates of the set {'E', 'K', 'L'} , what is incorrect

    Moreover: by counting occurrences in the results of the unique function on this array, it gives count > 1

    import collections
    collections.Counter(np.unique(sf.values))[{'E', 'K', 'L'}]
    Out: 3