import os
import cupy as cp
import pandas as pd
import cudf
import dask_cudf
cp.random.seed(12)
#### Portions of this were borrowed and adapted from the
#### cuDF cheatsheet, existing cuDF documentation,
#### and 10 Minutes to Pandas.
ds = dask_cudf.from_cudf(s, npartitions=2)
# Note the call to head here to show the first few entries, unlike
# cuDF objects, dask-cuDF objects do not have a printing
# representation that shows values since they may not be in local
# memory.
ds.head(n=3)
Creating a cudf.DataFrame
and a dask_cudf.DataFrame
by specifying values for each column.
df = cudf.DataFrame(
"a": list(range(20)),
"b": list(reversed(range(20))),
"c": list(range(20)),
Now we will convert our cuDF dataframe into a dask-cuDF equivalent. Here we call out a key difference: to inspect the data we must call a method (here .head()
to look at the first few values). In the general case (see the end of this notebook), the data in ddf
will be distributed across multiple GPUs.
In this small case, we could call ddf.compute()
to obtain a cuDF object from the dask-cuDF object. In general, we should avoid calling .compute()
on large dataframes, and restrict ourselves to using it when we have some (relatively) small postprocessed result that we wish to inspect. Hence, throughout this notebook we will generally call .head()
to inspect the first few values of a dask-cuDF dataframe, occasionally calling out places where we use .compute()
and why.
To understand more of the differences between how cuDF and dask-cuDF behave here, visit the 10 Minutes to Dask tutorial after this one.
ddf = dask_cudf.from_cudf(df, npartitions=2)
ddf.head()
Creating a cudf.DataFrame
from a pandas Dataframe
and a dask_cudf.Dataframe
from a cudf.Dataframe
.
Note that best practice for using dask-cuDF is to read data directly into a dask_cudf.DataFrame
with read_csv
or other builtin I/O routines (discussed below).
pdf = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0.1, 0.2, None, 0.3]})
gdf = cudf.DataFrame.from_pandas(pdf)
Selecting a Column
Selecting a single column, which initially yields a cudf.Series
or dask_cudf.Series
. Calling compute
results in a cudf.Series
(equivalent to df.a
).
df["a"]
Selecting Rows by Label
Selecting rows from index 2 to index 5 from columns ‘a’ and ‘b’.
df.loc[2:5, ["a", "b"]]
Selecting Rows by Position
Selecting via integers and integer slices, like numpy/pandas. Note that this functionality is not available for Dask-cuDF DataFrames.
df.iloc[0]
Note here we call compute()
rather than head()
on the dask-cuDF dataframe since we are happy that the number of matching rows will be small (and hence it is reasonable to bring the entire result back).
ddf.query("b == 3").compute()
You can also pass local variables to Dask-cuDF queries, via the local_dict
keyword. With standard cuDF, you may either use the local_dict
keyword or directly pass the variable via the @
keyword. Supported logical operators include >
, <
, >=
, <=
, ==
, and !=
.
cudf_comparator = 3
df.query("b == @cudf_comparator")
MultiIndex
cuDF supports hierarchical indexing of DataFrames using MultiIndex. Grouping hierarchically (see Grouping
below) automatically produces a DataFrame with a MultiIndex.
arrays = [["a", "a", "b", "b"], [1, 2, 3, 4]]
tuples = list(zip(*arrays))
idx = cudf.MultiIndex.from_tuples(tuples)
gdf1 = cudf.DataFrame(
{"first": cp.random.rand(4), "second": cp.random.rand(4)}
gdf1.index = idx
gdf2 = cudf.DataFrame(
{"first": cp.random.rand(4), "second": cp.random.rand(4)}
).T
gdf2.columns = idx
This serves as a prototypical example of when we might want to call .compute()
. The result of computing the mean and variance is a single number in each case, so it is definitely reasonable to look at the entire result!
ds.mean().compute(), ds.var().compute()
Applymap
Applying functions to a Series
. Note that applying user defined functions directly with Dask-cuDF is not yet implemented. For now, you can use map_partitions to apply a function to each partition of the distributed dataframe.
def add_ten(num):
return num + 10
df["a"].apply(add_ten)
String Methods
Like pandas, cuDF provides string processing methods in the str
attribute of Series
. Full documentation of string methods is a work in progress. Please see the cuDF API documentation for more information.
s = cudf.Series(["A", "B", "C", "Aaba", "Baca", None, "CABA", "dog", "cat"])
s.str.lower()
Join
Performing SQL style merges. Note that the dataframe order is not maintained, but may be restored post-merge by sorting by the index.
df_a = cudf.DataFrame()
df_a["key"] = ["a", "b", "c", "d", "e"]
df_a["vals_a"] = [float(i + 10) for i in range(5)]
df_b = cudf.DataFrame()
df_b["key"] = ["a", "c", "e"]
df_b["vals_b"] = [float(i + 100) for i in range(3)]
merged = df_a.merge(df_b, on=["key"], how="left")
merged
ddf_a = dask_cudf.from_cudf(df_a, npartitions=2)
ddf_b = dask_cudf.from_cudf(df_b, npartitions=2)
merged = ddf_a.merge(ddf_b, on=["key"], how="left").head(n=4)
merged
Grouping
Like pandas, cuDF and Dask-cuDF support the Split-Apply-Combine groupby paradigm.
df["agg_col1"] = [1 if x % 2 == 0 else 0 for x in range(len(df))]
df["agg_col2"] = [1 if x % 3 == 0 else 0 for x in range(len(df))]
ddf = dask_cudf.from_cudf(df, npartitions=2)
Transpose
Transposing a dataframe, using either the transpose
method or T
property. Currently, all columns must have the same type. Transposing is not currently implemented in Dask-cuDF.
sample = cudf.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
sample
Time Series
DataFrames
supports datetime
typed columns, which allow users to interact with and filter data based on specific timestamps.
import datetime as dt
date_df = cudf.DataFrame()
date_df["date"] = pd.date_range("11/20/2018", periods=72, freq="D")
date_df["value"] = cp.random.sample(len(date_df))
search_date = dt.datetime.strptime("2018-11-23", "%Y-%m-%d")
date_df.query("date <= @search_date")
date_ddf = dask_cudf.from_cudf(date_df, npartitions=2)
date_ddf.query(
"date <= @search_date", local_dict={"search_date": search_date}
).compute()
gdf = cudf.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "grade": ["a", "b", "b", "a", "a", "e"]}
gdf["grade"] = gdf["grade"].astype("category")
Accessing the categories of a column. Note that this is currently not supported in Dask-cuDF.
gdf.grade.cat.categories
To convert the first few entries to pandas, we similarly call .head()
on the dask-cuDF dataframe to obtain a local cuDF dataframe, which we can then convert.
ddf.head().to_pandas()
In contrast, if we want to convert the entire frame, we need to call .compute()
on ddf
to get a local cuDF dataframe, and then call to_pandas()
, followed by subsequent processing. This workflow is less recommended, since it both puts high memory pressure on a single GPU (the .compute()
call) and does not take advantage of GPU acceleration for processing (the computation happens on in pandas).
ddf.compute().to_pandas().head()
a: [[0,1,2,3,4,...,15,16,17,18,19]]
b: [[19,18,17,16,15,...,4,3,2,1,0]]
c: [[0,1,2,3,4,...,15,16,17,18,19]]
agg_col1: [[1,0,1,0,1,...,0,1,0,1,0]]
agg_col2: [[1,0,0,1,0,...,1,0,0,1,0]]
Note that for the dask-cuDF case, we use dask_cudf.read_csv
in preference to dask_cudf.from_cudf(cudf.read_csv)
since the former can parallelize across multiple GPUs and handle larger CSV files that would fit in memory on a single GPU.
ddf = dask_cudf.read_csv("example_output/foo_dask.csv")
ddf.head()
Reading all CSV files in a directory into a single dask_cudf.DataFrame
, using the star wildcard.
ddf = dask_cudf.read_csv("example_output/*.csv")
ddf.head()
Reading/Writing Parquet Files
Writing to parquet files with cuDF’s GPU-accelerated parquet writer
df.to_parquet("example_output/temp_parquet")
Writing to parquet files from a dask_cudf.DataFrame
using cuDF’s parquet writer under the hood.
ddf.to_parquet("example_output/ddf_parquet_files")
Dask Performance Tips
Like Apache Spark, Dask operations are lazy. Instead of being executed immediately, most operations are added to a task graph and the actual evaluation is delayed until the result is needed.
Sometimes, though, we want to force the execution of operations. Calling persist
on a Dask collection fully computes it (or actively computes it in the background), persisting the result into memory. When we’re using distributed systems, we may want to wait until persist
is finished before beginning any downstream operations. We can enforce this contract by using wait
. Wrapping an operation with wait
will ensure it doesn’t begin executing until all necessary upstream operations have finished.
The snippets below provide basic examples, using LocalCUDACluster
to create one dask-worker per GPU on the local machine. For more detailed information about persist
and wait
, please see the Dask documentation for persist and wait. Wait relies on the concept of Futures, which is beyond the scope of this tutorial. For more information on Futures, see the Dask Futures documentation. For more information about multi-GPU clusters, please see the dask-cuda library (documentation is in progress).
First, we set up a GPU cluster. With our client
set up, Dask-cuDF computation will be distributed across the GPUs in the cluster.
import time
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()
client = Client(cluster)
Persisting Data
Next, we create our Dask-cuDF DataFrame and apply a transformation, storing the result as a new column.
nrows = 10000000
df2 = cudf.DataFrame({"a": cp.arange(nrows), "b": cp.arange(nrows)})
ddf2 = dask_cudf.from_cudf(df2, npartitions=16)
ddf2["c"] = ddf2["a"] + 5
Mon Sep 23 17:35:25 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-PCIE-32GB Off | 00000000:02:00.0 Off | 0 |
| N/A 26C P0 37W / 250W | 643MiB / 32768MiB | 4% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
Because Dask is lazy, the computation has not yet occurred. We can see that there are sixty-four tasks in the task graph and we’re using about 330 MB of device memory on each GPU. We can force computation by using persist
. By forcing execution, the result is now explicitly in memory and our task graph only contains one task per partition (the baseline).
ddf2 = ddf2.persist()
/opt/conda/envs/docs/lib/python3.11/site-packages/distributed/client.py:3357: UserWarning: Sending large graph of size 152.61 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
warnings.warn(
Dask DataFrame Structure:
Mon Sep 23 17:35:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-PCIE-32GB Off | 00000000:02:00.0 Off | 0 |
| N/A 26C P0 37W / 250W | 1433MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
Because we forced computation, we now have a larger object in distributed GPU memory. Note that actual numbers will differ between systems (for example depending on how many devices are available).
Wait
Depending on our workflow or distributed computing setup, we may want to wait
until all upstream tasks have finished before proceeding with a specific function. This section shows an example of this behavior, adapted from the Dask documentation.
First, we create a new Dask DataFrame and define a function that we’ll map to every partition in the dataframe.
import random
nrows = 10000000
df1 = cudf.DataFrame({"a": cp.arange(nrows), "b": cp.arange(nrows)})
ddf1 = dask_cudf.from_cudf(df1, npartitions=100)
def func(df):
time.sleep(random.randint(1, 10))
return (df + 5) * 3 - 11
This function will do a basic transformation of every column in the dataframe, but the time spent in the function will vary due to the time.sleep
statement randomly adding 1-10 seconds of time. We’ll run this on every partition of our dataframe using map_partitions
, which adds the task to our task-graph, and store the result. We can then call persist
to force execution.
results_ddf = ddf2.map_partitions(func)
results_ddf = results_ddf.persist()
However, some partitions will be done much sooner than others. If we had downstream processes that should wait for all partitions to be completed, we can enforce that behavior using wait
.
wait(results_ddf)
DoneAndNotDoneFutures(done={<Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 0)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 3)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 10)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 5)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 11)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 15)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 6)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 13)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 1)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 9)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 14)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 4)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 7)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 12)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 2)>, <Future: finished, type: cudf.core.dataframe.DataFrame, key: ('func-931c70a1496f22a9c0a7cbc675197582', 8)>}, not_done=set())