Large Input Data
If you are working with large time series data, you are probably facing multiple problems.
The two most important ones are:
long execution times for feature extraction
large memory consumption, even beyond what a single machine can handle
To solve the first problem, you can parallelize the computation as described in
Parallelization
.
Note, that parallelization on your local computer is already turned on by default.
However, for larger data sets you need to handle both problems at the same time.
You have multiple options to do so, which we will discuss in the following paragraphs.
Dask - the simple way
tsfresh
accepts a
dask dataframe
instead of a
pandas dataframe as input for the
tsfresh.extract_features()
function.
Dask dataframes allow you to scale your computation beyond your local memory (via partitioning the data internally)
and even to large clusters of machines.
Its dataframe API is very similar to pandas dataframes and might even be a drop-in replacement.
All arguments discussed in
Data Formats
are also valid for dask dataframes.
The input data will be transformed into the correct format for
tsfresh
using dask methods
and the feature extraction will be added as additional computations to the computation graph.
You can then add additional computations to the result or trigger the computation as usual with
.compute()
.
The last step of the feature extraction is to bring all features into a tabular format.
Especially for very large data samples, this computation can be a large
performance bottleneck.
We therefore recommend to turn the pivoting off, if you do not really need it
and work with the un-pivoted data as much as possible.
For example, to read in data from parquet and do the feature extraction:
import dask.dataframe as dd
from tsfresh import extract_features
df = dd.read_parquet(...)
X = extract_features(df,
column_id="id",
column_sort="time",
pivot=False)
result = X.compute()
Dask - more control
The feature extraction method needs to perform some data transformations before it
can call the actual feature calculators.
If you want to optimize your data flow, you might want to have more control on how
exactly the feature calculation is added to you dask computation graph.
Therefore, it is also possible to add the feature extraction directly:
from tsfresh.convenience.bindings import dask_feature_extraction_on_chunk
features = dask_feature_extraction_on_chunk(df_grouped,
column_id="id",
column_kind="kind",
column_sort="time",
column_value="value")
In this case however, df_grouped
must already be in the correct format.
Check out the documentation of tsfresh.convenience.bindings.dask_feature_extraction_on_chunk()
for more information.
No pivoting will be performed in this case.
PySpark
Similar to dask, it is also possible to pass the feature extraction into a Spark
computation graph.
You can find more information in the documentation of tsfresh.convenience.bindings.spark_feature_extraction_on_chunk()
.