Parallel computing with dask ¶

xarray integrates with dask to support parallel computations and streaming computation on datasets that don’t fit into memory.

Currently, dask is an entirely optional feature for xarray. However, the benefits of using dask are sufficiently strong that dask may become a required dependency in a future version of xarray.

For a full example of how to use xarray’s dask integration, read the blog post introducing xarray and dask .

What is a dask array? ¶

Dask divides arrays into many small pieces, called chunks , each of which is presumed to be small enough to fit into memory.

Unlike NumPy, which has eager evaluation, operations on dask arrays are lazy. Operations queue up a series of tasks mapped over blocks, and no computation is performed until you actually ask values to be computed (e.g., to print results to your screen or write to disk). At that point, data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.

The actual computation is controlled by a multi-processing or thread pool, which allows dask to take full advantage of multiple processors available on most modern computers.

For more details on dask, read its documentation .

Reading and writing data ¶

The usual way to create a dataset filled with dask arrays is to load the data from a netCDF file or files. You can do this by supplying a chunks argument to open_dataset() or using the open_mfdataset() function.

      In [1]: ds = xr.open_dataset('example-data.nc', chunks={'time': 10})
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-9c7ea69516aa> in <module>()
----> 1 ds = xr.open_dataset('example-data.nc', chunks={'time': 10})
~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.1/lib/python3.5/site-packages/xarray-0.10.1-py3.5.egg/xarray/backends/api.py in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, autoclose, concat_characters, decode_coords, engine, chunks, lock, cache, drop_variables)
    284             store = backends.NetCDF4DataStore.open(filename_or_obj,
    285                                                    group=group,
--> 286                                                    autoclose=autoclose)
    287         elif engine == 'scipy':
    288             store = backends.ScipyDataStore(filename_or_obj,
~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.1/lib/python3.5/site-packages/xarray-0.10.1-py3.5.egg/xarray/backends/netCDF4_.py in open(cls, filename, mode, format, group, writer, clobber, diskless, persist, autoclose)
    273                                    diskless=diskless, persist=persist,
    274                                    format=format)
--> 275         ds = opener()
    276         return cls(ds, mode=mode, writer=writer, opener=opener,
    277                    autoclose=autoclose)
~/checkouts/readthedocs.org/user_builds/xray/conda/v0.10.1/lib/python3.5/site-packages/xarray-0.10.1-py3.5.egg/xarray/backends/netCDF4_.py in _open_netcdf4_group(filename, mode, group, **kwargs)
    197     import netCDF4 as nc4
--> 199     ds = nc4.Dataset(filename, mode=mode, **kwargs)
    201     with close_on_error(ds):
netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()