添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
谦虚好学的企鹅  ·  Costa Maria Vino ...·  4 月前    · 
听话的牛肉面  ·  4K ...·  1 年前    · 

Xarray Interpolation, Groupby, Resample, Rolling, and Coarsen #

In this lesson, we cover some more advanced aspects of xarray.

import numpy as np
import xarray as xr
from matplotlib import pyplot as plt
%xmode Minimal

Interpolation#

In the previous lesson on Xarray, we learned how to select data based on its dimension coordinates and align data with dimension different coordinates. But what if we want to estimate the value of the data variables at different coordinates. This is where interpolation comes in.

# we write it out explicitly so we can see each point.
x_data = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
f = xr.DataArray(x_data**2, dims=['x'], coords={'x': x_data})
<xarray.DataArray (x: 11)>
array([  0,   1,   4,   9,  16,  25,  36,  49,  64,  81, 100])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10
f.plot(marker='o')

We only have data on the integer points in x. But what if we wanted to estimate the value at, say, 4.5?

f.sel(x=4.5)
array(20.5)
Coordinates:
    x        float64 4.5

Interpolation uses scipy.interpolate under the hood. There are different modes of interpolation.

# default
f.interp(x=4.5, method='linear').values
x_new = x_data + 0.5
f_interp_linear = f.interp(x=x_new, method='linear')
f_interp_cubic = f.interp(x=x_new, method='cubic')
f.plot(marker='o', label='original')
f_interp_linear.plot(marker='o', label='linear')
f_interp_cubic.plot(marker='o', label='cubic')
plt.legend()

You can apply interpolation to any dimension, and even to multiple dimensions at a time. (Multidimensional interpolation only supports mode='nearest' and mode='linear'.) But keep in mind that Xarray has no built-in understanding of geography. If you use interp on lat / lon coordinates, it will just perform naive interpolation of the lat / lon values. More sophisticated treatment of spherical geometry requires another package such as xesmf.

Groupby#

Xarray copies Pandas’ very useful groupby functionality, enabling the “split / apply / combine” workflow on xarray DataArrays and Datasets. In the first part of the lesson, we will learn to use groupby by analyzing sea-surface temperature data.

First we load a dataset. We will use the NOAA Extended Reconstructed Sea Surface Temperature (ERSST) v5 product, a widely used and trusted gridded compilation of of historical data going back to 1854.

Since the data is provided via an OPeNDAP server, we can load it directly without downloading anything:

url = 'http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/noaa.ersst.v5/sst.mnmean.nc'
ds = xr.open_dataset(url, drop_variables=['time_bnds'])
ds = ds.sel(time=slice('1960', '2018')).load()
Dimensions:  (lat: 89, lon: 180, time: 708)
Coordinates:
  * lat      (lat) float32 88.0 86.0 84.0 82.0 80.0 ... -82.0 -84.0 -86.0 -88.0
  * lon      (lon) float32 0.0 2.0 4.0 6.0 8.0 ... 350.0 352.0 354.0 356.0 358.0
  * time     (time) datetime64[ns] 1960-01-01 1960-02-01 ... 2018-12-01
Data variables:
    sst      (time, lat, lon) float32 -1.8 -1.8 -1.8 -1.8 ... nan nan nan nan
Attributes: (12/38)
    climatology:                     Climatology is based on 1971-2000 SST, X...
    description:                     In situ data: ICOADS2.5 before 2007 and ...
    keywords_vocabulary:             NASA Global Change Master Directory (GCM...
    keywords:                        Earth Science > Oceans > Ocean Temperatu...
    instrument:                      Conventional thermometers
    source_comment:                  SSTs were observed by conventional therm...
    ...                              ...
    license:                         No constraints on data access or use
    comment:                         SSTs were observed by conventional therm...
    summary:                         ERSST.v5 is developed based on v4 after ...
    dataset_title:                   NOAA Extended Reconstructed SST V5
    data_modified:                   2021-11-07
    DODS_EXTRA.Unlimited_Dimension:  time

Let’s do some basic visualizations of the data, just to make sure it looks reasonable.

ds.sst[0].plot(vmin=-2, vmax=30)

Note that xarray correctly parsed the time index, resulting in a Pandas datetime index on the time dimension.

ds.time
<xarray.DataArray 'time' (time: 708)>
array(['1960-01-01T00:00:00.000000000', '1960-02-01T00:00:00.000000000',
       '1960-03-01T00:00:00.000000000', ..., '2018-10-01T00:00:00.000000000',
       '2018-11-01T00:00:00.000000000', '2018-12-01T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 1960-01-01 1960-02-01 ... 2018-12-01
Attributes:
    long_name:        Time
    delta_t:          0000-01-00 00:00:00
    avg_period:       0000-01-00 00:00:00
    prev_avg_period:  0000-00-07 00:00:00
    standard_name:    time
    axis:             T
    actual_range:     [19723. 80992.]
    _ChunkSizes:      1
ds.sst.sel(lon=300, lat=50).plot()

As we can see from the plot, the timeseries at any one point is totally dominated by the seasonal cycle. We would like to remove this seasonal cycle (called the “climatology”) in order to better see the long-term variaitions in temperature. We will accomplish this using groupby.

The syntax of Xarray’s groupby is almost identical to Pandas. We will first apply groupby to a single DataArray.

ds.sst.groupby?
----------
group : str, DataArray or IndexVariable
    Array whose unique values should be used to group this array. If a
    string, must be the name of a variable contained in this dataset.
squeeze : bool, optional
    If "group" is a dimension of any arrays in this dataset, `squeeze`
    controls whether the subarrays have a dimension of length 1 along
    that dimension or if the dimension is squeezed out.
restore_coord_dims : bool, optional
    If True, also restore the dimension order of multi-dimensional
    coordinates.
Returns
-------
grouped
    A `GroupBy` object patterned after `pandas.GroupBy` that can be
    iterated over in the form of `(unique_value, grouped_array)` pairs.
Examples
--------
Calculate daily anomalies for daily data:
>>> da = xr.DataArray(
...     np.linspace(0, 1826, num=1827),
...     coords=[pd.date_range("1/1/2000", "31/12/2004", freq="D")],
...     dims="time",
... )
<xarray.DataArray (time: 1827)>
array([0.000e+00, 1.000e+00, 2.000e+00, ..., 1.824e+03, 1.825e+03,
       1.826e+03])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2004-12-31
>>> da.groupby("time.dayofyear") - da.groupby("time.dayofyear").mean("time")
<xarray.DataArray (time: 1827)>
array([-730.8, -730.8, -730.8, ...,  730.2,  730.2,  730.5])
Coordinates:
  * time       (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2004-12-31
    dayofyear  (time) int64 1 2 3 4 5 6 7 8 ... 359 360 361 362 363 364 365 366
See Also
--------
core.groupby.DataArrayGroupBy
core.groupby.DatasetGroupBy
File:      /srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/common.py
Type:      method

Split Step#

The most important argument is group: this defines the unique values we will us to “split” the data for grouped analysis. We can pass either a DataArray or a name of a variable in the dataset. Lets first use a DataArray. Just like with Pandas, we can use the time indexe to extract specific components of dates and times. Xarray uses a special syntax for this .dt, called the DatetimeAccessor.

ds.time.dt
<xarray.DataArray 'month' (time: 708)>
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,
        6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
       11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,
        4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,
        9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,
        2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,
        7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,
       12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,
        5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9,
       10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,
        3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,
        8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,
        1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,
        6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
       11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,
        4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,
        9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,
        2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,
        7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,
       12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,
        3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,
        8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,
        1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,
        6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
       11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,
        4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,
        9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,
        2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,
        7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,
       12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,
        5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9,
       10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,
        3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,
        8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,
        1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,
        6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
       11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,
        4,  5,  6,  7,  8,  9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,
        9, 10, 11, 12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  1,
        2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])
Coordinates:
  * time     (time) datetime64[ns] 1960-01-01 1960-02-01 ... 2018-12-01

ds.time.dt.year

We can use these arrays in a groupby operation:

gb = ds.sst.groupby(ds.time.dt.month)

Xarray also offers a more concise syntax when the variable you’re grouping on is already present in the dataset. This is identical to the previous line:

gb = ds.sst.groupby('time.month')

Now that the data are split, we can manually iterate over the group. The iterator returns the key (group name) and the value (the actual dataset corresponding to that group) for each group.

for group_name, group_da in gb:
    # stop iterating after the first loop
    break 
print(group_name)
group_da
<xarray.DataArray 'sst' (time: 59, lat: 89, lon: 180)>
array([[[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan]],
       [[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan]],
       [[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan]],
       [[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan]],
       [[-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [-1.8, -1.8, -1.8, ..., -1.8, -1.8, -1.8],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan],
        [ nan,  nan,  nan, ...,  nan,  nan,  nan]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 88.0 86.0 84.0 82.0 80.0 ... -82.0 -84.0 -86.0 -88.0
  * lon      (lon) float32 0.0 2.0 4.0 6.0 8.0 ... 350.0 352.0 354.0 356.0 358.0
  * time     (time) datetime64[ns] 1960-01-01 1961-01-01 ... 2018-01-01
Attributes:
    long_name:     Monthly Means of Sea Surface Temperature
    units:         degC
    var_desc:      Sea Surface Temperature
    level_desc:    Surface
    statistic:     Mean
    dataset:       NOAA Extended Reconstructed SST V5
    parent_stat:   Individual Values
    actual_range:  [-1.8     42.32636]
    valid_range:   [-1.8 45. ]
    _ChunkSizes:   [  1  89 180]

Map & Combine#

Now that we have groups defined, it’s time to “apply” a calculation to the group. Like in Pandas, these calculations can either be:

  • aggregation: reduces the size of the group

  • transformation: preserves the group’s full size

  • At then end of the apply step, xarray will automatically combine the aggregated / transformed groups back into a single object.

    Warning

    Xarray calls the “apply” step map. This is different from Pandas!

    The most fundamental way to apply is with the .map method.

    gb.map?
    
    Signature: gb.map(func, shortcut=False, args=(), **kwargs)
    Docstring:
    Apply a function to each array in the group and concatenate them
    together into a new array.
    `func` is called like `func(ar, *args, **kwargs)` for each array `ar`
    in this group.
    Apply uses heuristics (like `pandas.GroupBy.apply`) to figure out how
    to stack together the array. The rule is:
    1. If the dimension along which the group coordinate is defined is
       still in the first grouped array after applying `func`, then stack
       over this dimension.
    2. Otherwise, stack over the new dimension given by name of this
       grouping (the argument to the `groupby` function).
    Parameters
    ----------
    func : callable
        Callable to apply to each array.
    shortcut : bool, optional
        Whether or not to shortcut evaluation under the assumptions that:
        (1) The action of `func` does not depend on any of the array
            metadata (attributes or coordinates) but only on the data and
            dimensions.
        (2) The action of `func` creates arrays with homogeneous metadata,
            that is, with the same dimensions and attributes.
        If these conditions are satisfied `shortcut` provides significant
        speedup. This should be the case for many common groupby operations
        (e.g., applying numpy ufuncs).
    *args : tuple, optional
        Positional arguments passed to `func`.
    **kwargs
        Used to call `func(ar, **kwargs)` for each array `ar`.
    Returns
    -------
    applied : DataArray or DataArray
        The result of splitting, applying and combining this array.
    File:      /srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/groupby.py
    Type:      method
    

    Aggregations#

    .apply accepts as its argument a function. We can pass an existing function:

    gb.map(np.mean)
    
    <xarray.DataArray 'sst' (month: 12)>
    array([13.659641, 13.768647, 13.76488 , 13.684034, 13.642146, 13.713043,
           13.921847, 14.093956, 13.982147, 13.691116, 13.506494, 13.529454],
          dtype=float32)
    Coordinates:
      * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

    Because we specified no extra arguments (like axis) the function was applied over all space and time dimensions. This is not what we wanted. Instead, we could define a custom function. This function takes a single argument–the group dataset–and returns a new dataset to be combined:

    def time_mean(a):
        return a.mean(dim='time')
    gb.apply(time_mean)
    
    <xarray.DataArray 'sst' (month: 12, lat: 89, lon: 180)>
    array([[[-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan]],
           [[-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan]],
           [[-1.7995025, -1.7995973, -1.7998415, ..., -1.7997988,
             -1.7996519, -1.7995045],
            [-1.7995876, -1.7997634, -1.8000009, ..., -1.8000009,
             -1.7998358, -1.7996247],
            [-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan]]], dtype=float32)
    Coordinates:
      * lat      (lat) float32 88.0 86.0 84.0 82.0 80.0 ... -82.0 -84.0 -86.0 -88.0
      * lon      (lon) float32 0.0 2.0 4.0 6.0 8.0 ... 350.0 352.0 354.0 356.0 358.0
      * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

    Like Pandas, xarray’s groupby object has many built-in aggregation operations (e.g. mean, min, max, std, etc):

    # this does the same thing as the previous cell
    sst_mm = gb.mean(dim='time')
    sst_mm
    
    <xarray.DataArray 'sst' (month: 12, lat: 89, lon: 180)>
    array([[[-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan]],
           [[-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan]],
           [[-1.7995025, -1.7995973, -1.7998415, ..., -1.7997988,
             -1.7996519, -1.7995045],
            [-1.7995876, -1.7997634, -1.8000009, ..., -1.8000009,
             -1.7998358, -1.7996247],
            [-1.8000009, -1.8000009, -1.8000009, ..., -1.8000009,
             -1.8000009, -1.8000009],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan],
            [       nan,        nan,        nan, ...,        nan,
                    nan,        nan]]], dtype=float32)
    Coordinates:
      * lat      (lat) float32 88.0 86.0 84.0 82.0 80.0 ... -82.0 -84.0 -86.0 -88.0
      * lon      (lon) float32 0.0 2.0 4.0 6.0 8.0 ... 350.0 352.0 354.0 356.0 358.0
      * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

    So we did what we wanted to do: calculate the climatology at every point in the dataset. Let’s look at the data a bit.

    Climatlogy at a specific point in the North Atlantic

    sst_mm.sel(lon=300, lat=50).plot()
    

    Transformations#

    Now we want to remove this climatology from the dataset, to examine the residual, called the anomaly, which is the interesting part from a climate perspective. Removing the seasonal climatology is a perfect example of a transformation: it operates over a group, but doesn’t change the size of the dataset. Here is one way to code it.

    def remove_time_mean(x):
        return x - x.mean(dim='time')
    ds_anom = ds.groupby('time.month').apply(remove_time_mean)
    ds_anom
    Dimensions:  (lat: 89, lon: 180, time: 708)
    Coordinates:
      * lat      (lat) float32 88.0 86.0 84.0 82.0 80.0 ... -82.0 -84.0 -86.0 -88.0
      * lon      (lon) float32 0.0 2.0 4.0 6.0 8.0 ... 350.0 352.0 354.0 356.0 358.0
      * time     (time) datetime64[ns] 1960-01-01 1960-02-01 ... 2018-12-01
    Data variables:
        sst      (time, lat, lon) float32 9.537e-07 9.537e-07 9.537e-07 ... nan nan

    In the above example, we applied groupby to a Dataset instead of a DataArray.

    Xarray makes these sorts of transformations easy by supporting groupby arithmetic. This concept is easiest explained with an example:

    gb = ds.groupby('time.month')
    ds_anom = gb - gb.mean(dim='time')
    ds_anom
    Dimensions:  (lat: 89, lon: 180, time: 708)
    Coordinates:
      * lat      (lat) float32 88.0 86.0 84.0 82.0 80.0 ... -82.0 -84.0 -86.0 -88.0
      * lon      (lon) float32 0.0 2.0 4.0 6.0 8.0 ... 350.0 352.0 354.0 356.0 358.0
      * time     (time) datetime64[ns] 1960-01-01 1960-02-01 ... 2018-12-01
        month    (time) int64 1 2 3 4 5 6 7 8 9 10 11 ... 2 3 4 5 6 7 8 9 10 11 12
    Data variables:
        sst      (time, lat, lon) float32 9.537e-07 9.537e-07 9.537e-07 ... nan nan

    Now we can view the climate signal without the overwhelming influence of the seasonal cycle.

    Timeseries at a single point in the North Atlantic

    ds_anom.sst.sel(lon=300, lat=50).plot()
    

    Resample#

    Resample in xarray is nearly identical to Pandas. It can be applied only to time-index dimensions. Here we compute the five-year mean. It is effectively a group-by operation, and uses the same basic syntax. Note that resampling changes the length of the the output arrays.

    ds_anom_resample = ds_anom.resample(time='5Y').mean(dim='time')
    ds_anom_resample
    Dimensions:  (time: 13, lat: 89, lon: 180)
    Coordinates:
      * time     (time) datetime64[ns] 1960-12-31 1965-12-31 ... 2020-12-31
      * lat      (lat) float32 88.0 86.0 84.0 82.0 80.0 ... -82.0 -84.0 -86.0 -88.0
      * lon      (lon) float32 0.0 2.0 4.0 6.0 8.0 ... 350.0 352.0 354.0 356.0 358.0
    Data variables:
        sst      (time, lat, lon) float32 -0.0005707 -0.0005493 ... nan nan
    ds_anom.sst.sel(lon=300, lat=50).plot()
    ds_anom_resample.sst.sel(lon=300, lat=50).plot(marker='o')
    

    Rolling is also similar to pandas. It does not change the length of the arrays. Instead, it allows a moving window to be applied to the data at each point.

    ds_anom_rolling = ds_anom.rolling(time=12, center=True).mean()
    ds_anom_rolling
    Dimensions:  (lat: 89, lon: 180, time: 708)
    Coordinates:
      * lat      (lat) float32 88.0 86.0 84.0 82.0 80.0 ... -82.0 -84.0 -86.0 -88.0
      * lon      (lon) float32 0.0 2.0 4.0 6.0 8.0 ... 350.0 352.0 354.0 356.0 358.0
      * time     (time) datetime64[ns] 1960-01-01 1960-02-01 ... 2018-12-01
        month    (time) int64 1 2 3 4 5 6 7 8 9 10 11 ... 2 3 4 5 6 7 8 9 10 11 12
    Data variables:
        sst      (time, lat, lon) float32 nan nan nan nan nan ... nan nan nan nan
    ds_anom.sst.sel(lon=300, lat=50).plot(label='monthly anom')
    ds_anom_resample.sst.sel(lon=300, lat=50).plot(marker='o', label='5 year resample')
    ds_anom_rolling.sst.sel(lon=300, lat=50).plot(label='12 month rolling mean', color='k')
    plt.legend()
    

    Coarsen#

    coarsen is a simple way to reduce the size of your data along one or more axes. It is very similar to resample when operating on time dimensions; the key differnce is that coarsen only operates on fixed blocks of data, irrespective of the coordinate values, while resample actually looks at the coordinates to figure out, e.g. what month a particular data point is in.

    For regularly-spaced monthly data beginning in January, the following should be equivalent to annual resampling. However, results would different for irregularly-spaced data.

    ds.coarsen(time=12).mean()
    Dimensions:  (time: 59, lat: 89, lon: 180)
    Coordinates:
      * lat      (lat) float32 88.0 86.0 84.0 82.0 80.0 ... -82.0 -84.0 -86.0 -88.0
      * lon      (lon) float32 0.0 2.0 4.0 6.0 8.0 ... 350.0 352.0 354.0 356.0 358.0
      * time     (time) datetime64[ns] 1960-06-16T08:00:00 ... 2018-06-16T12:00:00
    Data variables:
        sst      (time, lat, lon) float32 -1.8 -1.8 -1.8 -1.8 ... nan nan nan nan
    Attributes: (12/38)
        climatology:                     Climatology is based on 1971-2000 SST, X...
        description:                     In situ data: ICOADS2.5 before 2007 and ...
        keywords_vocabulary:             NASA Global Change Master Directory (GCM...
        keywords:                        Earth Science > Oceans > Ocean Temperatu...
        instrument:                      Conventional thermometers
        source_comment:                  SSTs were observed by conventional therm...
        ...                              ...
        license:                         No constraints on data access or use
        comment:                         SSTs were observed by conventional therm...
        summary:                         ERSST.v5 is developed based on v4 after ...
        dataset_title:                   NOAA Extended Reconstructed SST V5
        data_modified:                   2021-11-07
        DODS_EXTRA.Unlimited_Dimension:  time

    Coarsen also works on spatial coordinates (or any coordiantes).

    ds_coarse = ds.coarsen(lon=4, lat=4, boundary='pad').mean()
    ds_coarse.sst.isel(time=0).plot(vmin=2, vmax=30, figsize=(12, 5), edgecolor='k')
    

    An Advanced Example#

    In this example we will show a realistic workflow with Xarray. We will

  • Load a “basin mask” dataset

  • Interpolate the basins to our SST dataset coordinates

  • Group the SST by basin

  • Convert to Pandas Dataframe and plot mean SST by basin

  • basin = xr.open_dataset('http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NODC/.WOA09/.Masks/.basin/dods')
    basin
    Dimensions:  (Z: 33, X: 360, Y: 180)
    Coordinates:
      * Z        (Z) float32 0.0 10.0 20.0 30.0 50.0 ... 4e+03 4.5e+03 5e+03 5.5e+03
      * X        (X) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
      * Y        (Y) float32 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
    Data variables:
        basin    (Z, Y, X) float32 ...
    Attributes:
        Conventions:  IRIDL
    basin = basin.rename({'X': 'lon', 'Y': 'lat'})
    basin
    Dimensions:  (Z: 33, lon: 360, lat: 180)
    Coordinates:
      * Z        (Z) float32 0.0 10.0 20.0 30.0 50.0 ... 4e+03 4.5e+03 5e+03 5.5e+03
      * lon      (lon) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
      * lat      (lat) float32 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
    Data variables:
        basin    (Z, lat, lon) float32 ...
    Attributes:
        Conventions:  IRIDL
    basin_surf = basin.basin[0]
    basin_surf
    
    <xarray.DataArray 'basin' (lat: 180, lon: 360)>
    array([[nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           [nan, nan, nan, ..., nan, nan, nan],
           [11., 11., 11., ..., 11., 11., 11.],
           [11., 11., 11., ..., 11., 11., 11.],
           [11., 11., 11., ..., 11., 11., 11.]], dtype=float32)
    Coordinates:
        Z        float32 0.0
      * lon      (lon) float32 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
      * lat      (lat) float32 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
    Attributes:
        long_name:  basin code
        CLIST:      Atlantic Ocean\nPacific Ocean \nIndian Ocean\nMediterranean S...
        valid_min:  1
        valid_max:  58
        scale_min:  1
        units:      ids
        scale_max:  58
    basin_surf.plot(vmax=10)
    
    <xarray.DataArray 'sst' (time: 708, basin: 14)>
    array([[-1.8       , -1.8       , 23.455315  , ..., -1.8       ,
             3.3971915 , 24.182198  ],
           [-1.8       , -1.8       , 23.722523  , ..., -1.8       ,
             0.03573781, 24.59657   ],
           [-1.8       , -1.8       , 24.601315  , ..., -1.8       ,
            -0.26487017, 26.234186  ],
           [ 0.6758132 ,  6.504184  , 29.279463  , ..., 10.920228  ,
            15.955025  , 29.41976   ],
           [-0.7937442 ,  3.0715032 , 27.608435  , ...,  5.4078875 ,
            10.673693  , 27.7558    ],
           [-1.8       , -0.06063586, 25.881481  , ...,  0.5253569 ,
             7.267694  , 26.163145  ]], dtype=float32)
    Coordinates:
      * time     (time) datetime64[ns] 1960-01-01 1960-02-01 ... 2018-12-01
        Z        float32 0.0
      * basin    (basin) float64 1.0 2.0 3.0 4.0 5.0 ... 10.0 11.0 12.0 53.0 56.0
    Attributes:
        long_name:     Monthly Means of Sea Surface Temperature
        units:         degC
        var_desc:      Sea Surface Temperature
        level_desc:    Surface
        statistic:     Mean
        dataset:       NOAA Extended Reconstructed SST V5
        parent_stat:   Individual Values
        actual_range:  [-1.8     42.32636]
        valid_range:   [-1.8 45. ]
        _ChunkSizes:   [  1  89 180]
    basin_mean_sst = ds.sst.groupby(basin_surf_interp).mean()
    basin_mean_sst
    
    <xarray.DataArray 'sst' (time: 708, basin: 14)>
    array([[18.585493 , 20.757555 , 21.572067 , ...,  6.238062 ,  6.889794 ,
            26.49982  ],
           [18.705065 , 20.81674  , 21.902279 , ...,  4.8877654,  5.44638  ,
            26.577093 ],
           [18.845842 , 20.865038 , 22.031416 , ...,  4.686406 ,  5.5322194,
            27.908558 ],
           [19.84992  , 21.960493 , 20.389412 , ..., 17.571943 , 18.184528 ,
            29.336565 ],
           [19.424026 , 21.722925 , 21.061403 , ..., 13.461868 , 13.863244 ,
            28.755905 ],
           [19.265354 , 21.512274 , 21.814356 , ...,  9.417906 , 10.607256 ,
            27.905243 ]], dtype=float32)
    Coordinates:
      * time     (time) datetime64[ns] 1960-01-01 1960-02-01 ... 2018-12-01
        Z        float32 0.0
      * basin    (basin) float64 1.0 2.0 3.0 4.0 5.0 ... 10.0 11.0 12.0 53.0 56.0
    df = basin_mean_sst.mean('time').to_dataframe()
    
    import pandas as pd
    basin_names = basin_surf.attrs['CLIST'].split('\n')
    basin_df = pd.Series(basin_names, index=np.arange(1, len(basin_names)+1))
    basin_df
    
    1                 Atlantic Ocean
    2                 Pacific Ocean 
    3                   Indian Ocean
    4              Mediterranean Sea
    5                     Baltic Sea
    6                      Black Sea
    7                        Red Sea
    8                   Persian Gulf
    9                     Hudson Bay
    10                Southern Ocean
    11                  Arctic Ocean
    12                  Sea of Japan
    13                      Kara Sea
    14                      Sulu Sea
    15                    Baffin Bay
    16            East Mediterranean
    17            West Mediterranean
    18                Sea of Okhotsk
    19                     Banda Sea
    20                 Caribbean Sea
    21                 Andaman Basin
    22               North Caribbean
    23                Gulf of Mexico
    24                  Beaufort Sea
    25               South China Sea
    26                   Barents Sea
    27                   Celebes Sea
    28                Aleutian Basin
    29                    Fiji Basin
    30          North American Basin
    31           West European Basin
    32        Southeast Indian Basin
    33                     Coral Sea
    34             East Indian Basin
    35          Central Indian Basin
    36      Southwest Atlantic Basin
    37      Southeast Atlantic Basin
    38       Southeast Pacific Basin
    39               Guatemala Basin
    40           East Caroline Basin
    41                Marianas Basin
    42                Philippine Sea
    43                   Arabian Sea
    44                   Chile Basin
    45                  Somali Basin
    46               Mascarene Basin
    47                  Crozet Basin
    48                  Guinea Basin
    49                  Brazil Basin
    50               Argentine Basin
    51                    Tasman Sea
    52         Atlantic Indian Basin
    53                   Caspian Sea
    54                   Sulu Sea II
    55               Venezuela Basin
    56                 Bay of Bengal
    57                      Java Sea
    58    East Indian Atlantic Basin
    dtype: object