Hello guys,
I just discovered Dask, I have to deal with huge data (1.6TB per csv file) and I think Dask can help me
I need to apply “basic” data transformation, and I am using apply() function to do so.
I have this function.
def extract_data(row):
ret=dict()
# do regexp stuff on a specific column
# generate a few values, store them in the dict ret
return ret['value1'],ret['value2']
then I apply this function to the daskdataframe
meta=[ ('value1', str),('value2',str) ]
newddf = ddf.apply(extract_data, axis=1, meta=meta)
print(newdf) gives me something like that:
Dask DataFrame Structure:
value 1 value2
npartitions=1
object object
... ...
Dask Name: apply, 12 tasks
when I try to run newddf.head() I have an error
**AttributeError** : 'DataFrame' object has no attribute 'name'
What did I do wrong ?
I can run exactly the same code on a pandas dataframe with no issue.
Thanks for your help !
@pfrenard Welcome to Discourse!
I was able to reproduce this and the error is in how you’re defining meta
. The output of extract_data
is a tuple, and meta
needs to match that. You can use something like: meta = ("Result", object)
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({'x': list(range(5))})
ddf = dd.from_pandas(df, npartitions=2)
def extract_data(row):
ret = {'value1': 'p', 'value2': 'q'}
return ret['value1'], ret['value2']
meta = ("Result", object)
newddf = ddf.apply(extract_data, axis=1, meta=meta)
newddf.compute()
Ref docs: dask.dataframe.DataFrame.apply — Dask documentation