添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
  • Parallel computing with Dask
  • A Tour of Xarray Customizations
  • Interactive plots using hvplot
  • Remote Data
    • Zarr in Cloud Object Storage
    • Access Patterns to Remote Data with fsspec
    • Data Tidying
      • Examples
      • Re-organize InSAR ice velocity data
      • Contributing
      • Presentations
      • Advanced

      • Parallelizing custom functions
      • apply_ufunc
        • A gentle introduction
        • Core dimensions
        • Handling complex output
        • Automatic Vectorization
        • Handling dask arrays
        • Fast vectorization with Numba
        • np.interp : An end-to-end example
        • map_blocks
          • A gentle introduction
          • Reading data using backends
            • Binary data without lazy loading
            • Binary data with lazy loading
            • Creating custom accessors
            • Workshops

            • SciPy 2024
            • SciPy 2023
            • Thinking like Xarray 2022
              • High-level computation patterns
              • Oceanhackweek 2020
                • Xarray in 45 minutes
                • Xarray Online Tutorial 2020
                  • Xarray Fundamentals
                  • Indexing and Selecting Data
                  • Computation
                  • Reference

                  • Contributing Guide
                  • Keep Exploring!
                  • Glossary
                  • Automatic Vectorization #

                    Previously we looked at applying functions on numpy arrays, and the concept of core dimensions . We learned that functions commonly support specifying “core dimensions” through the axis keyword argument.

                    However many functions exist, that implicitly have core dimensions, but do not provide an axis keyword argument. Applying such functions to a nD array usually involves one or multiple loops over the other dimensions — termed “loop dimensions” or “broadcast dimensions”.

                    A good example is numpy’s 1D interpolate function numpy.interp :

                        Signature: np.interp(x, xp, fp, left=None, right=None, period=None)
                        Docstring:
                            One-dimensional linear interpolation.
                        Returns the one-dimensional piecewise linear interpolant to a function
                        with given discrete data points (`xp`, `fp`), evaluated at `x`.
                    

                    This function expects 1D arrays as input, so there is one core dimension and we cannot easily apply it to a nD array since there is no axis keyword argument.

                    Our goal here is to

                  • Understand the difference between core dimensions and loop dimensions

                  • Understand vectorization

                  • Learn how to apply such functions without loops using apply_ufunc by providing the vectorize keyword argument.

                  • Core dimensions and looping#

                    Let’s say we want to interpolate an array with two dimensions (space, time) over the time dimension, we might

                  • loop over the space dimension,

                  • subset the array to a 1D array at that space location,

                  • Interpolate the 1D arrays to the new time vector, and

                  • Assign that new interpolated 1D array to the appropriate location of a 2D output array

                  • In pseudo-code this might look like

                    for index in range(size_of_space_axis):
                        out[index, :] = np.interp(..., array[index, :], ...)
                    

                    Exercise

                    Consider the example problem of interpolating a 2D array with dimensions space and time along the time dimension. Which dimension is the core dimension, and which is the “loop dimension”?

                    Solution

                    time is the core dimension, and space is the loop dimension.

                    Vectorization#

                    The pattern of looping over any number of “loop dimensions” and applying a function along “core dimensions” is so common that numpy provides wrappers that automate these steps:

                  • numpy.apply_along_axis

                  • numpy.apply_over_axes

                  • numpy.vectorize

                  • apply_ufunc provides an easy interface to numpy.vectorize through the keyword argument vectorize. Here we see how to use that to automatically apply np.interp along a single axis of a nD array

                    Load data#

                    First lets load an example dataset

                    We’ll reduce the length of error messages using %xmode minimal See the ipython documentation for details.

                    %xmode minimal
                    import xarray as xr
                    import numpy as np
                    xr.set_options(display_expand_data=False)
                    air = (
                        xr.tutorial.load_dataset("air_temperature")
                        .air.sortby("lat")  # np.interp needs coordinate in ascending order
                        .isel(time=slice(4), lon=slice(3))  # choose a small subset for convenience
                    
                    <xarray.DataArray 'air' (time: 4, lat: 25, lon: 3)> Size: 2kB
                    296.3 296.8 297.1 295.9 296.2 296.8 ... 246.3 245.3 244.2 241.9 241.8 241.8
                    Coordinates:
                      * lat      (lat) float32 100B 15.0 17.5 20.0 22.5 25.0 ... 67.5 70.0 72.5 75.0
                      * lon      (lon) float32 12B 200.0 202.5 205.0
                      * time     (time) datetime64[ns] 32B 2013-01-01 ... 2013-01-01T18:00:00
                    Attributes:
                        long_name:     4xDaily Air temperature at sigma level 995
                        units:         degK
                        precision:     2
                        GRIB_id:       11
                        GRIB_name:     TMP
                        var_desc:      Air temperature
                        dataset:       NMC Reanalysis
                        level_desc:    Surface
                        statistic:     Individual Obs
                        parent_stat:   Other
                        actual_range:  [185.16 322.1 ]
                    xarray.DataArray
                    'air'
                    • time: 4
                    • lat: 25
                    • lon: 3
                    • 296.3 296.8 297.1 295.9 296.2 296.8 ... 245.3 244.2 241.9 241.8 241.8
                      array([[[296.29, 296.79, 297.1 ],
                              [295.9 , 296.2 , 296.79],
                              [296.6 , 296.2 , 296.4 ],
                              [297.  , 296.7 , 296.1 ],
                              [295.4 , 295.7 , 295.79],
                              [293.79, 294.1 , 294.6 ],
                              [293.1 , 293.29, 293.29],
                              [290.2 , 290.79, 291.4 ],
                              [287.9 , 288.  , 288.29],
                              [286.5 , 286.5 , 285.7 ],
                              [284.6 , 284.9 , 284.2 ],
                              [282.79, 283.2 , 282.6 ],
                              [280.  , 280.7 , 280.2 ],
                              [278.4 , 279.  , 279.  ],
                              [277.29, 277.4 , 277.79],
                              [276.7 , 277.4 , 277.7 ],
                              [275.9 , 276.9 , 276.9 ],
                              [274.79, 275.2 , 275.6 ],
                              [273.7 , 273.6 , 273.79],
                              [272.1 , 270.9 , 270.  ],
                              [293.  , 293.5 , 294.29],
                              [291.9 , 291.9 , 292.2 ],
                              [289.2 , 289.4 , 289.9 ],
                              [286.6 , 287.1 , 287.9 ],
                              [284.79, 284.79, 285.4 ],
                              [282.79, 282.  , 282.7 ],
                              [281.2 , 280.2 , 280.6 ],
                              [279.5 , 278.7 , 278.6 ],
                              [278.  , 277.7 , 277.6 ],
                              [276.4 , 275.9 , 276.4 ],
                              [275.6 , 275.7 , 276.1 ],
                              [274.5 , 275.6 , 276.29],
                              [273.4 , 274.5 , 275.5 ],
                              [274.1 , 274.  , 273.5 ],
                              [273.29, 272.6 , 271.5 ],
                              [272.79, 272.4 , 271.9 ],
                              [267.7 , 266.29, 264.4 ],
                              [256.6 , 254.7 , 252.1 ],
                              [246.3 , 245.3 , 244.2 ],
                              [241.89, 241.8 , 241.8 ]]])
                      • lat
                        (lat)
                        float32
                        15.0 17.5 20.0 ... 70.0 72.5 75.0
                        standard_name :
                        latitude
                        long_name :
                        Latitude
                        units :
                        degrees_north
                        axis :
                        Y
                        array([15. , 17.5, 20. , 22.5, 25. , 27.5, 30. , 32.5, 35. , 37.5, 40. , 42.5,
                               45. , 47.5, 50. , 52.5, 55. , 57.5, 60. , 62.5, 65. , 67.5, 70. , 72.5,
                               75. ], dtype=float32)
                      • lon
                        (lon)
                        float32
                        200.0 202.5 205.0
                        standard_name :
                        longitude
                        long_name :
                        Longitude
                        units :
                        degrees_east
                        axis :
                        X
                        array([200. , 202.5, 205. ], dtype=float32)
                      • time
                        (time)
                        datetime64[ns]
                        2013-01-01 ... 2013-01-01T18:00:00
                        standard_name :
                        time
                        long_name :
                        Time
                        array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000',
                               '2013-01-01T12:00:00.000000000', '2013-01-01T18:00:00.000000000'],
                              dtype='datetime64[ns]')
                      • lat
                        PandasIndex
                        PandasIndex(Index([15.0, 17.5, 20.0, 22.5, 25.0, 27.5, 30.0, 32.5, 35.0, 37.5, 40.0, 42.5,
                               45.0, 47.5, 50.0, 52.5, 55.0, 57.5, 60.0, 62.5, 65.0, 67.5, 70.0, 72.5,
                               75.0],
                              dtype='float32', name='lat'))
                      • lon
                        PandasIndex
                        PandasIndex(Index([200.0, 202.5, 205.0], dtype='float32', name='lon'))
                      • time
                        PandasIndex
                        PandasIndex(DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 06:00:00',
                                       '2013-01-01 12:00:00', '2013-01-01 18:00:00'],
                                      dtype='datetime64[ns]', name='time', freq=None))
                    • long_name :
                      4xDaily Air temperature at sigma level 995
                      units :
                      degK
                      precision :
                      2
                      GRIB_id :
                      11
                      GRIB_name :
                      TMP
                      var_desc :
                      Air temperature
                      dataset :
                      NMC Reanalysis
                      level_desc :
                      Surface
                      statistic :
                      Individual Obs
                      parent_stat :
                      Other
                      actual_range :
                      [185.16 322.1 ]

                    Review#

                    We’ll work with the apply_ufunc call from the section on handling dimensions that change size. See the “Handling Complex Output” section for how to get here.

                    This version only works with 1D vectors. We will expand that to work with inputs of any number of dimensions.

                    newlat = np.linspace(15, 75, 100)
                    xr.apply_ufunc(
                        np.interp,  # first the function
                        newlat,
                        air.lat,
                        air.isel(lon=0, time=0),  # this version only works with 1D vectors
                        input_core_dims=[["lat"], ["lat"], ["lat"]],
                        output_core_dims=[["lat"]],
                        exclude_dims={"lat"},
                    
                    <xarray.DataArray (lat: 100)> Size: 800B
                    296.3 296.2 296.1 296.0 295.9 296.0 ... 245.1 243.7 243.1 242.5 241.8 241.2
                    Coordinates:
                        lon      float32 4B 200.0
                        time     datetime64[ns] 8B 2013-01-01
                    Dimensions without coordinates: lat
                    xarray.DataArray
                    • lat: 100
                    • 296.3 296.2 296.1 296.0 295.9 296.0 ... 243.7 243.1 242.5 241.8 241.2
                      array([296.29      , 296.19545455, 296.10090909, 296.00636364,
                             295.91181818, 296.04848485, 296.21818182, 296.38787879,
                             296.55757576, 296.67272727, 296.76969697, 296.86666667,
                             296.96363636, 296.75757576, 296.36969697, 295.98181818,
                             295.59393939, 295.20484848, 294.81454545, 294.42424242,
                             294.03393939, 293.72727273, 293.56      , 293.39272727,
                             293.22545455, 292.92424242, 292.22121212, 291.51818182,
                             290.81515152, 290.13030303, 289.57272727, 289.01515152,
                             288.45757576, 287.9       , 287.56060606, 287.22121212,
                             286.88181818, 286.54242424, 286.0969697 , 285.63636364,
                             285.17575758, 284.71515152, 284.27090909, 283.83212121,
                             283.39333333, 282.95454545, 282.36727273, 281.69090909,
                             281.01454545, 280.33818182, 279.80606061, 279.41818182,
                             279.03030303, 278.64242424, 278.29909091, 278.03      ,
                             277.76090909, 277.49181818, 277.25424242, 277.11121212,
                             276.96818182, 276.82515152, 276.67575758, 276.48181818,
                             276.28787879, 276.09393939, 275.9       , 275.63090909,
                             275.36181818, 275.09272727, 274.82363636, 274.55878788,
                             274.29454545, 274.03030303, 273.76606061, 273.40909091,
                             273.02121212, 272.63333333, 272.24545455, 272.46363636,
                             273.04545455, 273.62727273, 274.20909091, 273.53030303,
                             271.59090909, 269.65151515, 267.71212121, 265.        ,
                             261.        , 257.        , 253.        , 249.62424242,
                             248.12121212, 246.61818182, 245.11515152, 243.72121212,
                             243.09090909, 242.46060606, 241.83030303, 241.2       ])
                      • lon
                        ()
                        float32
                        200.0
                        array(200., dtype=float32)
                      • time
                        ()
                        datetime64[ns]
                        2013-01-01
                        array('2013-01-01T00:00:00.000000000', dtype='datetime64[ns]')

                      Try nD input#

                      Our goal is to interpolate latitude at every longitude and time, such that we go from a dataset with dimensions (time: 4, lat: 25, lon: 3) to (time: 4, lat: 100, lon: 3).

                      If we blindly try passing air (a 3D DataArray), we get a hard-to-understand error

                      newlat = np.linspace(15, 75, 100)
                      xr.apply_ufunc(
                          np.interp,  # first the function
                          newlat,
                          air.lat,
                          air,
                          input_core_dims=[["lat"], ["lat"], ["lat"]],
                          output_core_dims=[["lat"]],
                          exclude_dims={"lat"},
                      

                      We will use a “wrapper” function debug_interp to examine what gets passed to numpy.interp.

                      Such wrapper functions are a great way to understand and debug apply_ufunc use cases.

                      def debug_interp(xi, x, data):
                          print(f"data: {data.shape} | x: {x.shape} | xi: {xi.shape}")
                          return np.interp(xi, x, data)
                      interped = xr.apply_ufunc(
                          debug_interp,  # first the function
                          newlat,
                          air.lat,
                          air,
                          input_core_dims=[["lat"], ["lat"], ["lat"]],
                          output_core_dims=[["lat"]],
                          exclude_dims={"lat"},  # dimensions allowed to change size. Must be set!
                      

                      That’s a hard-to-interpret error from NumPy but our print call helpfully printed the shapes of the input data:

                      data: (4, 3, 25) | x: (25,) | xi: (100,)
                      

                      We see that apply_ufunc passes the full 3D array to interp1d_np which in turn passes that on to numpy.interp. But numpy.interp requires a 1D input, and thus the error.

                      Instead of passing the full 3D array we want loop over all combinations of lon and time; and apply our function to each corresponding vector of data along lat.

                      Vectorization with np.vectorize#

                      apply_ufunc makes it easy to loop over the loop dimensions by specifying vectorize=True:

                      vectorize : bool, optional
                          If True, then assume ``func`` only takes arrays defined over core
                          dimensions as input and vectorize it automatically with
                          :py:func:`numpy.vectorize`. This option exists for convenience, but is
                          almost always slower than supplying a pre-vectorized function.
                          Using this option requires NumPy version 1.12 or newer.
                      

                      Warning

                      Also see the numpy documentation for numpy.vectorize. Most importantly

                      The vectorize function is provided primarily for convenience, not for performance. 
                      The implementation is essentially a for loop.
                          air,
                          input_core_dims=[["lat"], ["lat"], ["lat"]],
                          output_core_dims=[["lat"]],
                          exclude_dims={"lat"},  # dimensions allowed to change size. Must be set!
                          vectorize=True,
                      interped
                      
                      data: (25,) | x: (25,) | xi: (100,)
                      data: (25,) | x: (25,) | xi: (100,)
                      data: (25,) | x: (25,) | xi: (100,)
                      data: (25,) | x: (25,) | xi: (100,)
                      data: (25,) | x: (25,) | xi: (100,)
                      data: (25,) | x: (25,) | xi: (100,)
                      data: (25,) | x: (25,) | xi: (100,)
                      data: (25,) | x: (25,) | xi: (100,)
                      data: (25,) | x: (25,) | xi: (100,)
                      data: (25,) | x: (25,) | xi: (100,)
                      data: (25,) | x: (25,) | xi: (100,)
                      data: (25,) | x: (25,) | xi: (100,)
                      
                      <xarray.DataArray (time: 4, lon: 3, lat: 100)> Size: 10kB
                      296.3 296.2 296.1 296.0 295.9 296.0 ... 245.9 244.1 243.5 243.0 242.4 241.8
                      Coordinates:
                        * lon      (lon) float32 12B 200.0 202.5 205.0
                        * time     (time) datetime64[ns] 32B 2013-01-01 ... 2013-01-01T18:00:00
                      Dimensions without coordinates: lat
                      xarray.DataArray
                      • time: 4
                      • lon: 3
                      • lat: 100
                      • 296.3 296.2 296.1 296.0 295.9 296.0 ... 244.1 243.5 243.0 242.4 241.8
                        array([[[296.29      , 296.19545455, 296.10090909, ..., 242.46060606,
                                 241.83030303, 241.2       ],
                                [296.79      , 296.6469697 , 296.50393939, ..., 243.46969697,
                                 242.98484848, 242.5       ],
                                [297.1       , 297.02484848, 296.94969697, ..., 244.08181818,
                                 243.79090909, 243.5       ]],
                               [[296.29      , 296.26818182, 296.24636364, ..., 242.82727273,
                                 242.46363636, 242.1       ],
                                [297.2       , 297.07878788, 296.95757576, ..., 243.37878788,
                                 243.03939394, 242.7       ],
                                [297.4       , 297.25212121, 297.10424242, ..., 243.63333333,
                                 243.36666667, 243.1       ]],
                               [[296.4       , 296.35151515, 296.3030303 , ..., 243.41515152,
                                 242.85757576, 242.3       ],
                                [296.29      , 296.34090909, 296.39181818, ..., 243.26181818,
                                 242.73090909, 242.2       ],
                                [296.4       , 296.37333333, 296.34666667, ..., 243.12424242,
                                 242.71212121, 242.3       ]],
                               [[297.5       , 297.37878788, 297.25757576, ..., 244.02818182,
                                 242.95909091, 241.89      ],
                                [297.7       , 297.65151515, 297.6030303 , ..., 243.4969697 ,
                                 242.64848485, 241.8       ],
                                [297.5       , 297.4030303 , 297.30606061, ..., 242.96363636,
                                 242.38181818, 241.8       ]]])
                        • lon
                          (lon)
                          float32
                          200.0 202.5 205.0
                          array([200. , 202.5, 205. ], dtype=float32)
                        • time
                          (time)
                          datetime64[ns]
                          2013-01-01 ... 2013-01-01T18:00:00
                          array(['2013-01-01T00:00:00.000000000', '2013-01-01T06:00:00.000000000',
                                 '2013-01-01T12:00:00.000000000', '2013-01-01T18:00:00.000000000'],
                                dtype='datetime64[ns]')
                        • lon
                          PandasIndex
                          PandasIndex(Index([200.0, 202.5, 205.0], dtype='float32', name='lon'))
                        • time
                          PandasIndex
                          PandasIndex(DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 06:00:00',
                                         '2013-01-01 12:00:00', '2013-01-01 18:00:00'],
                                        dtype='datetime64[ns]', name='time', freq=None))

                      Wow that worked!

                      Notice that

                    • the printed input shapes are all 1D and correspond to one vector of size 25 along the lat dimension.

                    • debug_interp was called 4x3 = 12 times which is the total number lat vectors since the size along time is 4, and the size along lon is 3.

                    • The result interped is now an xarray object with coordinate values copied over from data.

                    • lat is now the last dimension in interped. This is a “property” of core dimensions: they are moved to the end before being sent to interp1d_np as noted in the docstring for input_core_dims

                          Core dimensions are automatically moved to the last axes of input
                          variables before applying ``func``, which facilitates using NumPy style
                          generalized ufuncs [2]_.
                      

                      Conclusion#

                      This is why apply_ufunc is so convenient; it takes care of a lot of code necessary to apply functions that consume and produce numpy arrays to xarray objects.

                      The vectorize keyword argument, when set to True, will use numpy.vectorize to apply the function by looping over the “loop dimensions” — dimensions that are not the core dimensions for the applied function.

    •