[Python] Inconsistent cast behavior between array and scalar for int64 · Issue #34901

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
Describe the bug, including details regarding any error messages, version, and platform.

>>> scal = pa.scalar(6312878760374611856, type=pa.int64())
>>> scal.cast(pa.float64())
<pyarrow.DoubleScalar: 6.312878760374612e+18>
>>> arr = pa.array([6312878760374611856], type=pa.int64())
>>> arr.cast(pa.float64())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\array.pxi", line 926, in pyarrow.lib.Array.cast
  File "C:\pandas2_ps_04323\lib\site-packages\pyarrow\compute.py", line 391, in cast
    return call_function("cast", [arr], options)
  File "pyarrow\_compute.pyx", line 560, in pyarrow._compute.call_function
  File "pyarrow\_compute.pyx", line 355, in pyarrow._compute.Function.call
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 6312878760374611856 not in range: -9007199254740992 to 9007199254740992
Behavior is not consistent in casting between array and scalar. The array behavior of raising does not seem correct, as it seems an int64 should always be able to be casted to float64.
Component(s)
Python
          Hi @rohanjain101,
I am able to reproduce this example on pyarrow v11 using macos 13.3.
What you are experiencing is the difference between safe vs unsafe casting, since the number you chose probably can not be fully represented in the new type. It is not true that all int64 values can be safely converted to float64. Due to the way precision works in floating point, there are numbers that may be skipped that could otherwise be represented by int64. See https://en.wikipedia.org/wiki/Double-precision_floating-point_format, which states: only Integers from −253 to 253 (−9,007,199,254,740,992 to 9,007,199,254,740,992) can be exactly represented. My guess is the underlying implementation enforces this hard limit, since technically I believe there are some int64 numbers that can be larger and still represented safely when cast to float64 (such as 18,014,398,509,481,984, but not 18,014,398,509,481,983).
>>> arr = pa.array([18014398509481984], type=pa.int64())
>>> arr.cast(pa.float64())
Traceback (most recent call last):
pyarrow.lib.ArrowInvalid: Integer value 18014398509481984 not in range: -9007199254740992 to 9007199254740992
It appears the scalar cast defaults to allow unsafe casting, while the array defaults to safe casting. You can allow unsafe casting in the array like this:
>>> arr.cast(pa.float64(), safe=False)
<pyarrow.lib.DoubleArray object at 0x126a40ee0>
  6.312878760374612e+18
There are no options to choose safe vs unsafe cast in scalar APIs at the moment. The documentation does state the scalar will perform a safe cast, though, which it is not doing: https://arrow.apache.org/docs/python/generated/pyarrow.Int64Scalar.html#pyarrow.Int64Scalar
This is either a bug in scalar safe casting or the documentation is wrong. Ideally, Scalars can also allow you to choose safe vs unsafe casting with an option. Either way, some more investigation is still needed.
Is their an internal difference in how double values are stored between arrow and numpy that would cause the difference?
In your example, 18,014,398,509,481,984 can be converted to float64 safely according to the floating point specification so it is not a good example to use. Instead let's try 18,014,398,509,481,983, which is not a multiple of 2 (required by integers between 2^53 and 2^54 for safe conversion).
You will lose data in this numpy cast. (And yes, my guess is they adhere to the floating point spec slightly differently purely based on the different behavior).
>>> np.array([18014398509481983]).astype("float64").astype("int64") == 18014398509481983
array([False])
>>> np.array([18014398509481983]).astype("float64").astype("int64")
array([18014398509481984])
Numpy defaults to unsafe casting (https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html), but it seems it also doesn't perform safety checks properly all of the time.
>>> type(np.array([18014398509481984])[0])
<class 'numpy.int64'>
# Bug? No safety error for int64 -> float64
>>> np.array([18014398509481983]).astype("float64", casting="safe").astype("int64") == 18014398509481983
array([False])
# Good? Errors out on float64 -> int64, but the bug happened in the int64 -> float64 and was somehow propagated..
>>> np.array([18014398509481983]).astype("float64", casting="safe").astype("int64", casting="safe") == 18014398509481983
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
          For pyarrow, we should probably:
Allow both safe and unsafe conversion options for scalar APIs (feature)
Default to safe conversion for scalars, which appears is not happening (bug)
Look into allowing safe conversion from int <-> float for valid numbers larger than 2^53 (feature)
@AlenkaF @jorisvandenbossche what do you think?
But in the example where the cast is safe, for 18,014,398,509,481,984, shouldn't that then succeed in pyarrow if it can be done safely? In my example, the array case is still raising even if the cast is safe. Should it only raise for 18,014,398,509,481,983?
If pyarrow were to follow the floating point specification exactly, then yes it would. Right now, it seems to be a limitation of the implementation. You could argue that option (3) above should be a bug instead of a feature.
          I'd recommend filing an issue with numpy about this, too:
>>> import numpy as np
>>> np.__version__
'1.24.2'
# Bug: No safety error for initial int64 -> float64 conversion
>>> np.array([18014398509481983]).astype("float64", casting="safe").astype(str)
array(['1.8014398509481984e+16'], dtype='<U32')
Inconsistent cast behavior between array and scalar for int64
[Python] Inconsistent cast behavior between array and scalar for int64
      Apr 11, 2023
Numpy defaults to unsafe casting (https://numpy.org/doc/stable/reference/generated/numpy.ndarray.astype.html), but it seems it also doesn't perform safety checks properly all of the time.
Sidenote: numpy doesn't really have the same concept of "safe" casting as how we use this in pyarrow. In pyarrow this depends on the values, while in numpy this is just a property of a cast between two dtypes. So to say if a cast from one dtype to another is safe or not, numpy needs to make some generalization/assumption, and so it seems it decided that casting int to float is generally safe (indeed, except for large ints) and casting floats to ints is generally not safe (indeed, except if you have rounded floats):
>>> np.can_cast(np.int64(), np.float64(), casting="safe")
>>> np.can_cast(np.float64(), np.int64(), casting="safe")
False
      [C++] Change Scalar::CastTo to do safe cast by default and allow to specify cast options ?
      #35040
For pyarrow, we should probably:
Allow both safe and unsafe conversion options for scalar APIs (feature)
Default to safe conversion for scalars, which appears is not happening (bug)
Yes, fully agreed, I opened a separate issue for specifically this aspect: #35040
Look into allowing safe conversion from int <-> float for valid numbers larger than 2^53 (feature)
For checking the safety of casting int to float, we indeed use this fixed range:
      arrow/cpp/src/arrow/compute/kernels/scalar_cast_numeric.cc
        Lines 171 to 250
      e488942
I am not fully sure this is something we should change. First, I think this is a lot simpler in implementation to just check for values within the range, compared to checking for certain integers that can still be represented as float outside of that range. But also for the user this seems easier to understand and gives more consistent behaviour? (just everything outside of that range will fail with the default safe=True, and not depending on the exact value)
          Regarding 3, the current error is not very clear:
>>> pa.array([18014398509481984], type=pa.float64())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\array.pxi", line 320, in pyarrow.lib.array
  File "pyarrow\array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Integer value 18014398509481984 is outside of the range exactly representable by a IEEE 754 double precision value
Since this integer meets IEE 754 specification for what can be represented as a double, should the error also clarify that its an internal limitation?
Look into allowing safe conversion from int <-> float for valid numbers larger than 2^53 (feature)
The options within the C++ lib are very fine-grained already:
  bool allow_int_overflow;
  bool allow_time_truncate;
  bool allow_time_overflow;
  bool allow_decimal_truncate;
  bool allow_float_truncate;
  // Indicate if conversions from Binary/FixedSizeBinary to string must
  // validate the utf8 payload.
  bool allow_invalid_utf8;
If we are discussing ideal behavior then I think something like...
# Allow converting integers to floats when the integer cannot be exactly represented by
# an IEEE-754 float and must be rounded.
bool allow_float_inexact;
...would be very reasonable.  If we wanted to be even more extreme 😄 we could have:
enum class IeeeRoundingMode : int8_t {
  TIE_TO_EVEN = 0,
  TIE_AWAY_FROM_ZERO = 1,
  TOWARD_ZERO = 2,
  TOWARD_POSITIVE_INFINITY = 3,
  TOWARD_NEGATIVE_INFINITY = 4,
  ERROR = 5
However, all of point number 3 sounds like a separate issue from this one.
If we are discussing ideal behavior then I think something like...
# Allow converting integers to floats when the integer cannot be exactly represented by
# an IEEE-754 float and must be rounded.
bool allow_float_inexact;
It's not super clear from the name, but so we already use the existing allow_float_truncate options for the int->float cast (the name suggests to me this is mostly about float->int truncating the float if it is not a "round" float).
>>> pa.array([18014398509481984], type=pa.int64()).cast(pa.float64())
ArrowInvalid: Integer value 18014398509481984 not in range: -9007199254740992 to 9007199254740992
>>> pa.array([18014398509481984], type=pa.int64()).cast(options=pc.CastOptions(pa.float64(), allow_float_truncate=True))
<pyarrow.lib.DoubleArray object at 0x7f4e1f7dee00>
  1.8014398509481984e+16
But your suggestion would be that an option like allow_float_inexact would use a different logic to determine when the cast is allowed (actually check if the result value is inexact, instead of checking it is outside of the generally safe range?)
(personally I would say that allow_float_inexact is a better name for what we currently call allow_float_truncate for the int->float cast)
If we wanted to be even more extreme 😄 we could have:
enum class IeeeRoundingMode : int8_t {
  TIE_TO_EVEN = 0,
I would say that if a user wants to control the rounding, they should use the round kernel instead of a cast?
I would say that if a user wants to control the rounding, they should use the round kernel instead of a cast?
The rounding kernel allows for conversion from one valid IEEE value to another.  This rounding is about going from an infinite precision value that cannot be represented in IEEE to a valid IEEE value.
I'll walk my comment back though.
Technically, IEEE rounding is something that has to be considered in just about any operation (e.g. addition, subtraction) because the infinite-precision result isn't representable.
In practice, we'd probably be better off just saying we always use TIE_TO_EVEN and it's not configurable.  This is what every other engine seems to do (TIE_TO_EVEN is the default for most / all modern CPUs).
It's not super clear from the name, but so we already use the existing allow_float_truncate options for the int->float cast (the name suggests to me this is mostly about float->int truncating the float if it is not a "round" float).
I didn't realize this.  So this issue is about allowing these sorts of casts when safe=true and the value happens to be representable?  I thought this was an ask for a new kind of unsafe cast.
### Rationale for this change
Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation.
### Are these changes tested?
Added test cases for GH-35370, GH-34901, and GH-35040
### Are there any user-facing changes?
The Scalar.cast() API is enhanced and backwards compatible. 
* Closes: #35040
Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: Alenka Frim <frim.alenka@gmail.com>

So this issue is about allowing these sorts of casts when safe=true and the value happens to be representable?
Yes, from my understanding.
The original issue reported is now fixed (#35395). We can either repurpose this issue for the above feature request, or close this issue and file a new one.
      [C++] Allow safe cast from int64 to float64 in compute kernel for numerics that can be accurately represented above 2^53
      #35563
          I captured the feature request to safely cast representable int64s larger than 2^53 to float64 here: #35563.
Closing this issue since the original bug report is resolved in #35395
### Rationale for this change
Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation.
### Are these changes tested?
Added test cases for apacheGH-35370, apacheGH-34901, and apacheGH-35040
### Are there any user-facing changes?
The Scalar.cast() API is enhanced and backwards compatible. 
* Closes: apache#35040
Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: Alenka Frim <frim.alenka@gmail.com>

### Rationale for this change
Scalar cast should use the computer kernel just like Arrays, instead of its own custom implementation.
### Are these changes tested?
Added test cases for apacheGH-35370, apacheGH-34901, and apacheGH-35040
### Are there any user-facing changes?
The Scalar.cast() API is enhanced and backwards compatible. 
* Closes: apache#35040
Authored-by: Dane Pitkin <dane@voltrondata.com>
Signed-off-by: Alenka Frim <frim.alenka@gmail.com>