"A": ["foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two", "one", "one", "two", "two"],
"C": [1, 2, 2, 3, 3, 4, 5, 6, 7],
running the following:
df.pivot_table(values="C", index=["A", "B"],aggfunc=np.median)
results:
Which is the require result. However, when running this with dask dataframe it doesn’t go through:
ddf = dd.from_pandas(df, npartitions=3)
ddf.pivot_table(values="C", index=["A", "B"],aggfunc=np.median)
results:
ValueError: 'index' must be the name of an existing column
seems like the DD implementation is rather limited to scalars (dask.dataframe.reshape.pivot_table — Dask documentation)
Is there another way to achieve this?
Hi @jadeidev,
Not exactly the same as you’ll get a Series instead of a DataFrame, but you can still get the same results with:
res = ddf.groupby(["A", "B"]).C.median()
# Optional, depends on what you want to do
pd_series = res.compute()
Does that help?