How to Select a Model For Your Time Series Prediction Task [Guide]

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Stationarity

Another important definition in time series is stationarity. A stationary time series is a time series that has no trend. Some time series models are not able to deal with trends (more on this later). You can detect non-stationarity using the Dickey-Fuller Test and you can remove non-stationarity using differencing.

Dickey-Fuller test

The Dickey-Fuller test is a statistical hypothesis test that allows you to detect non-stationarity. You can use the following Python code to apply a Dickey-Fuller test to the CO2 data:

from statsmodels.tsa.stattools import adfuller
adf, pval, usedlag, nobs, crit_vals, icbest =  adfuller(co2_data.co2.values)
print('ADF test statistic:', adf)
print('ADF p-values:', pval)
print('ADF number of lags used:', usedlag)
print('ADF number of observations:', nobs)
print('ADF critical values:', crit_vals)
print('ADF best information criterion:', icbest)

The null hypothesis of the ADF test is that a unit root is present in the time series. The alternative hypothesis is that the data is stationary.

The second value is the p-value. If this p-value is smaller than 0.05 you can reject the null hypothesis (reject non-stationarity) and accept the alternative hypothesis (stationarity). In this case, we cannot reject the null hypothesis and will have to assume that the data is non-stationary. As you have seen the data, you know that there is a trend, so this also confirms the result we obtained.

Differencing

You can remove the trend from your time series. The goal is to have only seasonal variation: this can be a way to use certain models that work with seasonality but not with trends.

adf, pval, usedlag, nobs, crit_vals, icbest =  adfuller(differenced_co2.dropna())
print('ADF test statistic:', adf)
print('ADF p-values:', pval)
print('ADF number of lags used:', usedlag)
print('ADF number of observations:', nobs)
print('ADF critical values:', crit_vals)
print('ADF best information criterion:', icbest)

The last concept that is important to understand before going into modeling is the concept of one-step models versus multi-step models.

Some models work great for predicting the next step for a time series, but do not have the capacity to predict multiple steps at once. These models are one-step models. You can make multi-step models with them by windowing over your predictions, but there is a risk: when using predicted values to make predictions, your errors can quickly add up and become very large.

Multistep models are models that have the intrinsic capacity to predict multiple steps at once. They are generally the better choice for long-term forecasts, and sometimes for one-step forecasts as well. It is key that you decide on the number of steps that you want to predict before starting to build models. This purely depends on your use case.

Classical time series models are a family of models that have been traditionally used a lot in many domains of forecasting. They are strongly based on temporal variation inside a time series and they work well with univariate time series. Some advanced options exist to add external variables into the models as well. These models are generally only applicable to time series and are not useful for other types of machine learning.

Supervised models

Supervised models are a family of models that are used for many machine learning task. A machine learning model is supervised when it uses clearly defined input variables and one or more output (target) variables.

Supervised models can be used for time series, as long as you have a way to extract seasonality and put it into a variable. Examples include creating a variable for a year, a month, or a day of the week, etc. These are then used as the X variables in your supervised model and the ‘y’ is the actual value of the time series. You can also include lagged versions of y (the past value of y) into the X data, in order to add autocorrelation effects.

Deep learning and recent models

The increasing popularity of deep learning over the past years has opened new doors for forecasting as well, as specific deep learning architectures have been invented that work very well on sequence data.

Cloud computing and the popularization of AI as a service have also provided a number of new inventions in the domain. Facebook, Amazon, and other big tech companies are open-sourcing their forecasting products, or making them available on their cloud platforms. The availability of those new “black- box” models gives forecasting practitioners new tools to try and test, and can sometimes even beat previous models.

Going deeper into classical time series models

In this part, you will discover classical time series models in depth.

ARIMA family

The ARIMA family of models is a set of smaller models that can be combined. Each part of the ARMIA model can be used as a stand-alone component, or the different building blocks can be combined. When all of the individual components are put together, you obtain the SARIMAX model. You will now see each of the building blocks separately.

1. Autoregression (AR)

Autoregression is the first building block of the SARIMAX family. You can see the AR model as a regression model that explains a variable’s future value using its past (lagged) values.

The order of an AR model is denoted as p, and it represents the number of lagged values to include in the model. The simplest model is the AR(1) model: it uses only the value of the previous timestep to predict the current value. The maximum number of values that you can use is the total length of the time series (i.e. you use all previous time steps).

2. Moving average (MA)

The Moving Average is the second building block of the larger SARIMAX model. It works in a comparable way to the AR model: it uses past values to predict the current value of the variable.

The past values that the Moving Average model uses are not the values of the variable. Rather, the Moving Average uses the prediction error in previous time steps to predict the future.

This sounds counter-intuitive, but there is a logic behind it. When a model has some unknown but regular external perturbations, your model may have a seasonality or other pattern in the error of the model. The MA model is a method to capture this pattern without even having to identify where it comes from.

The MA model can use multiple steps back in time as well. This is represented in the order parameter called q. For example, an MA(1) model has an order of 1 and uses only one time step back.

3. Autoregressive moving average (ARMA)

The Autoregressive Moving Average, or ARMA, model combines the two previous building blocks into one model. ARMA can therefore use both the value and the prediction errors from the past.

ARMA can have different values for the lag of the AR and MA processes. For example an ARMA(1, 0) model has an AR order of 1 ( p = 1) and an MA order of 0 (q=0). This is actually just an AR(1) model. The MA(1) model is the same as the ARMA(0, 1) model. Other combinations are possible: ARMA(3, 1) for example has an AR order of 3 lagged values and uses 1 lagged value for the MA.

4. Autoregressive integrated moving average (ARIMA)

The ARMA model needs a stationary time series. As you have seen earlier on, stationarity means that a time series remains stable. You can use the Augmented Dickey-Fuller test to test whether your time series is stable and apply differencing if it is not the case.

The ARIMA model adds automatic differencing to the ARMA model. It has an additional parameter that you can set to the number of times that the time series needs to be differenced. For example, an ARMA(1,1) that needs to be differenced one time would result in the following notation: ARIMA(1, 1, 1). The first 1 is for the AR order, the second one is for the differencing, and the third 1 is for the MA order. ARIMA(1, 0, 1) would be the same as ARMA(1, 1).

5. Seasonal autoregressive integrated moving-average (SARIMA)

SARIMA adds seasonal effects into the ARIMA model. If seasonality is present in your time series, it is very important to use it in your forecast.

SARIMA notation is quite a bit more complex than ARIMA, as each of the components receives a seasonal parameter on top of the regular parameter.

For example, let’s consider the ARIMA(p, d, q) as seen before. In SARIMA notation, this becomes SARIMA(p, d, q)(P, D, Q)m.

m is simply the number of observations per year: monthly data has m=12, quarterly data has m=4 etc. The small letters (p, d, q) represent the non-seasonal orders. The capital letters (P, D, Q) represent the seasonal orders.

import pmdarima as pm
from pmdarima.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

You could see Vector Autoregression, or VAR as a multivariate alternative to Arima. Rather than predicting one dependent variable, you predict multiple time series at the same time. This can be especially useful when there are strong relationships between your different time series. The Vector Autoregression, like the standard AR model, contains only an autoregressive component.

The VARMA model is the multivariate equivalent of the ARMA model. VARMA is to ARMA what VAR is to AR: it adds a Moving Average component to the model.

If you want to go a step further, you can use VARMAX . The X represents external (exogenous) variables . Exogenous variables are variables that can help your model to make better forecasts, but that do not need to be forecasted themselves. The statsmodels VARMAX implementation is a good way to get started with implementing VARMAX models.

More advanced versions like seasonal VARMAX (SVARMAX) exist, but they become so complex and specific that it will be hard to find implementations that do this easily and efficiently. Once models become so complex, it gets hard to understand what is happening inside the model and is often better to start looking into other familiar models.

Smoothing

Exponential Smoothing is a basic statistical technique that can be used to smoothen out time series. Time series patterns often have a lot of long term variability, but also short term (noisy) variability. Smoothening allows you to make your curve smoother so that long term variability becomes more evident and short term (noisy) patterns are removed.

This smooth version of the time series can then be used for analysis.

1. Simple moving average

The simple moving average is the simplest smoothing technique. It consists of replacing the current value by the average of the current and a few past values. The exact number of past values to take into account is a parameter. The more values you use, the smoother the curve will become. At the same time, you will lose more and more variation.

2. Simple exponential smoothing (SES)

Exponential smoothing is an adaptation of this simple moving average. Rather than taking the average, it takes a weighted average of past values. A value that is further back will count less and a more recent value will count more.

3. Double exponential smoothing

When trends are present in your time series data, you should avoid using Simple Exponential Smoothing: it does not work well in this case, as the model cannot make the distinction between variation and trend correctly. However, you can use double exponential smoothing .

In DES, there is a recursive application of an exponential filter. This allows you to remove trend problems. This works using the following formulas for time zero:

If you want to go even further, you can use Triple Exponential Smoothing, which is also called Holt Winter’s exponential smoothing . You should use this only when there are three important signals in your time series data. For example, one signal could be the trend, another one could be a weekly seasonality and a third one could be a monthly seasonality.

An example of exponential smoothing in Python

In the following example, you see how to apply a Simple Exponential Smoothing to the CO2 data. The smoothing level indicates how smooth your curve should become. In the example it is set very low, indicating a very smooth curve. Feel free to play around with this parameter and see what less smooth versions look like.

from statsmodels.tsa.api import SimpleExpSmoothing
es = SimpleExpSmoothing(co2_data.co2.values)
es.fit(smoothing_level=0.01)
plt.plot(co2_data.co2.values)
plt.plot(es.predict(es.params, start=0, end=None))
plt.show()

Supervised Machine Learning models work very differently than classical machine learning models. The main difference is that they consider that variables are either dependent variables or independent variables. Dependent variables, or target variables, are the variables that you want to predict. Independent variables are the variables that help you to predict.

Supervised Machine Learning models are not made especially for time series data. After all, there are often no independent variables in time series data. Yet it is fairly simple to adapt them to time series by converting the seasonality (based on your time stamps for example) into independent variables.

Linear regression

Linear Regression is arguably the simplest supervised machine learning model. Linear Regression estimates linear relationships: each independent variable has a coefficient that indicates how this variable affects the target variable.

Simple Linear Regression is a Linear Regression in which there is only one independent variable. An example of a Simple Linear Regression model in non-time series data could be the following: hot chocolate sales that depend on the outside temperature (degrees Celsius).

The colder the temperature, the higher the hot chocolate sales. Visually, this could look like the graph below.

In Multiple Linear Regression, rather than using only one independent variable, you use multiple independent variables. You could imagine the 2d graph converting into a 3d graph, where the third axis represents the variable Price. In this case, you would build a Linear Model that explains the sales using temperature and price. You can add as many variables as you need.

Now, of course, this is not a time series data set: there is no time variable present. So, how could you use this technique for time series? The answer is fairly straightforward. Rather than only using temperature and price in this data set, you could add the variables year, month, day of the week, etc.

If you build a supervised model on time series, you have the disadvantage that you need to do a little bit of feature engineering to extract seasonality into variables in a way or another. An advantage is, however, that adding exogenous variables becomes much easier.

Let’s now see how to apply a Linear Regression on the CO2 dataset. You can prepare the CO2 data as follows:

# extract the seasonality data months = [x.month for x in co2_data.index] years = [x.year for x in co2_data.index] day = [x.day for x in co2_data.index] # convert into one matrix X = np.array([day, months, years]).T

The XGBoost model is a third model that you should absolutely know. There are many other models out there, but Random Forests and XGBoost are considered absolute classics among the supervised machine learning family.

XGBoost is a machine learning model that is based on the gradient boosting framework. This model is an ensemble model of weak learners just like the Random Forest but with an interesting advantage. In standard gradient boosting, the individual trees are fit in sequence and each consecutive decision tree is fit in such a way to minimize the error of previous trees. XGBoost obtains the same result but is still able to do parallel learning.

You can use the XGBoost package as follows:

LSTMs are Recurrent Neural Networks. Neural Networks are very complex machine learning models that pass input data through a network. Each node in the network learns a very simple operation. The neural network consists of many such nodes. The fact that the model can use a large number of simple nodes makes the overall prediction very complex. Neural Networks can therefore fit very complex and nonlinear data sets.

RNNs are a special type of Neural network, in which the network can learn from sequence data. This can be useful for multiple use cases, including understanding time series (which are clearly sequences of values over time), but also text (sentences are sequences of words).

LSTMs are a specific type of RNNs. They have proven useful for time series forecasting on multiple occasions. They require some data and are more complicated to learn than supervised models. Once you master them, they can prove to be very powerful depending on your data and your specific use case.

To go into LSTMs, the Keras library in Python is a great starting point.

Prophet

Prophet is a time series library that was open-sourced by Facebook. It is a black-box model, as it will generate forecasts without much user specification. This can be an advantage, as you can almost automatically generate forecasting models without much knowledge or effort.

On the other hand, there is a risk here as well: if you do not pay close enough attention, you may very well be producing a model that seems good to the automated model building tool, but that in reality does not work well.

Extensive model validation and evaluation are recommended when using such black-box models, yet if you find that it works well on your specific use case, you may find a lot of added value here.

You can find a lot of resources on Facebook’s GitHub .

The first thing to define when selecting models is the metric that you want to look at. In the previous part, you have seen multiple fits with different qualities (think about the linear regression vs the random forest).

To go further with model selection , you will need to define a metric to evaluate your models. A very often used model in forecasting is the Mean Squared Error. This metric measures the error at each point in time and takes the square of it. The average of those squared errors is called the Mean Squared Error. An often-used alternative is the Root Mean Squared Error: the square root of the Mean Squared Error.

Another frequently used metric is the Mean Absolute Error: rather than taking the square of each error, it takes the absolute value here. The Mean Absolute Percent Error is a variation on this where the Absolute Error at each point in time is expressed as a percentage of the actual value. This yields a metric that is a percentage, which is very easy to interpret.

Time series train test split

The second thing to think about when evaluating Machine Learning is to consider that a model that works well on the training data, does not necessarily work well on new, out-of-sample data. Such a model is called an overfitting model.

There are two common approaches that can help you to estimate whether a model is generalizing correctly: the train-test-split and cross-validation.

The train test split means that you remove a part of your data before fitting the model. As an example, you could remove the last 3 years from the CO2 database and use the remaining 40 years for fitting the model. You’d then forecast the three years of test data and measure the evaluation metric of your choice between your predictions and the actual values of the last three years.

To benchmark and choose models, you could build multiple models on the 40 years of data and do the test set evaluation on all of them. Based on this test-performance, you could select the model that is most performant.

Of course, if you are building a short-term forecasting model, using three years of data would not make sense: you’d choose an evaluation period that is comparable to the period that you’d forecast in reality.

Time series cross-validation

A risk with the train test split is that you measure only at one point in time. In non-time series data, the test set is generally generated by a random selection of data points. In time series, however, this does not work in many cases: when sequences are used, you can not remove one point in the sequence and still expect the model to work.

Time series train test splits are therefore best applied by selecting the final period as the test set. The risk here is that this can go wrong if your last period is not very reliable. In recent covid periods, you can imagine that many business forecasts have gone completely off: the underlying trends have changed.

Cross-validation is a method that does a repeated train test evaluation. Rather than making one train test split, it makes multiple (the exact number is a user-defined parameter). For example, if you use 3-fold cross-validation, you will split your data set into three equal parts. You will then fit three times the same model on two-thirds of the data set and use the other third for evaluation. In the end, you have three evaluation scores (each on a different test set) and you can use the average as the final metric.

In time series, however, you cannot apply a random selection to obtain multiple test sets. If you’d do this, you’d end up with sequences with many missing data points.

A solution can be found in time series cross-validation. What it does is create multiple train test sets, but each of the test sets is the end of the period. For example, the first train test split could be built on the first 10 years of data (5 train, 5 test). The second model would be done on the first 15 years of data (10 train, 5 test), etc. This can work well but has the disadvantage that each of the models does not use the same number of years in the training data.

An alternative is to do a rolling split (always 5 years train, 5 years test), but here the disadvantage is that you can never use more than 5 years for training data.

Time series model experiments

In conclusion, when doing time series model selection, the following questions are key to define before starting to experiment:

In this part, you will work on a forecast for the next day of the S&P 500. You could imagine running your model every night and then the next day you would know if the stock market is going up or down. If you have a very accurate model for doing this, you could easily make a lot of money (don’t treat it as financial advice ;)).

Stock market forecasting data and definition of the evaluation method

Obtaining stock market data

You can use the Yahoo Finance package in Python to automatically download stock data.

# taking the close price (end of day) sp500_data = yf.download('^GSPC', start="1980-01-01", end="2021-11-21") sp500_data = sp500_data[['Close']] sp500_data.plot(figsize=(12, 12))

As you want to predict for one day only, you can understand that the test set will be very small (one day only). Therefore, it would be best to create a lot of test splits to make sure that there is an acceptable amount of model evaluation.

This can be obtained by the Time Series Split that was explained earlier. For example, you can set up a Time Series Split that will make 100 train test sets, in which each train test set uses three months of training data and one day of test data. This will work fine for this example to understand the principle of model selection in time series.

Building a classical time series model

Let’s start with a classical time series model on this problem: the Arima model. In this code, you will set up the automated creation of Arima models with orders ranging from (0, 0, 0) to (4, 4, 4). Each of the models will be built and evaluated using a Time Series Split with 100 splits, in which the train size is a maximum of three months and the test size is always one day.

Because of the sheer number of runs involved, the results are logged to neptune.ai for ease of comparison. For following along, you can set up a free account and get more info from this tutorial .

import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import TimeSeriesSplit
import neptune
from neptune.utils import stringify_unsupported
param_list = [(x, y, z) for x in range(5) for y in range(5) for z in range(5)]
for order in param_list:
    run = neptune.init_run(
        project="YOU/YOUR_PROJECT",
        api_token="YOUR_API_TOKEN",
    run['order'] = stringify_unsupported(order)
    # for each param combi do a ts split
    # max 3 months training data
    # 1 day test size
    mses = []
    tscv = TimeSeriesSplit(n_splits=100,
                           max_train_size = 3*31,
                           test_size=1)
    for train_index, test_index in tscv.split(y):
          train = y[train_index]
          test = y[test_index]
          # for each ts split do a model
          mod = sm.tsa.ARIMA(train, order=order)
          res = mod.fit()
          pred = res.forecast(1)[0]
          mse = mean_squared_error(test, pred)
          mses.append(mse)
        except:
	    # ignore models that error
        average_mse = np.mean(mses)
        std_mse = np.std(mses)
        run['average_mse'] = average_mse
        run['std_mse'] = std_mse
    except:
        run['average_mse'] = None
        run['std_mse'] = None
    run.stop()

The model with the lowest average MSE is the model with order (0, 1, 3). However, as you can see, the standard deviation of this model is suspiciously 0. The next two models in the line are ARIMA(1, 0, 3) and ARIMA(1, 0, 2). They are very similar and this would indicate that the result is reliable. The best guess here would be to take the ARIMA(1, 0, 3) as the best model which has an average MSE of 0.00000131908 and an average standard deviation of 0.00000197007.

In some use cases, you may have a lot of data about the future. For example, if you want to predict the number of customers of a restaurant, you could use external data on the number of reservations that have been made for future dates as independent variables.

For the current stock market use case, you do not have those data: you only have the stock price over a period of time. However, supervised models cannot be built using only a target variable. You’ll need to find a way to extract seasonality from the data and use feature engineering to create independent variables. As the stock market is known to have a lot of autocorrelation effects, let’s try a model that uses the values of the past 30 days as predictor variables to predict the 31st day.

You can create a data set that has all of the possible combinations of 30 training days and 1 test day (always consecutive) of the S&P500 and you would be able to create an enormous training database this way:

import yfinance as yf
sp500_data = yf.download('^GSPC', start="1980-01-01", end="2021-11-21")
sp500_data = sp500_data[['Close']]
difs = (sp500_data.shift() - sp500_data) / sp500_data
difs = difs.dropna()
y = difs.Close.values
# window through the data
X_data = []
y_data = []
for i in range(len(y) - 31):
    X_data.append(y[i:i+30])
    y_data.append(y[i+30])
X_windows = np.vstack(X_data)

from sklearn.model_selection import KFold import neptune from neptune.utils import stringify_unsupported from sklearn.metrics import mean_squared_error # specify the grid for the grid search of hyperparameter tuning parameters={'max_depth': list(range(2, 20, 4)), 'gamma': list(range(0, 10, 2)), 'min_child_weight' : list(range(0, 10, 2)), 'eta': [0.01,0.05, 0.1, 0.15,0.2,0.3,0.5] param_list = [(x, y, z, a) for x in parameters['max_depth'] for y in parameters['gamma'] for z in parameters['min_child_weight'] for a in parameters['eta']] for params in param_list: mses = [] run = neptune.init_run( project="YOU/YOUR_PROJECT", api_token="YOUR_API_TOKEN", run['params'] = stringify_unsupported(params) my_kfold = KFold(n_splits=10, shuffle=True, random_state=0) for train_index, test_index in my_kfold.split(X_windows): X_train, X_test = X_windows[train_index], X_windows[test_index] y_train, y_test = np.array(y_data)[train_index], np.array(y_data)[test_index] xgb_model = xgb.XGBRegressor(max_depth=params[0],gamma=params[1], min_child_weight=params[2], eta=params[3]) xgb_model.fit(X_train, y_train) preds = xgb_model.predict(X_test) mses.append(mean_squared_error(y_test, preds)) average_mse = np.mean(mses) std_mse = np.std(mses) run['average_mse'] = average_mse run['std_mse'] = std_mse run.stop()

For more reading on XGBoost tuning check the official XGBoost documentation on the topic over here .

The best (lowest) MSE obtained by this XGBoost is 0.000129982. There are multiple hyperparameter combinations that obtain this score. As you can see, the XGBoost model is much less performant than the classical time series model, at least in the current configuration. Another method of organizing the data may be necessary to get better results from XGBoost.

Building a deep learning-based time series model

As a third model for the model comparison, let’s take an LSTM and see whether this can beat the ARIMA model. You can do a model comparison using cross-validation as well. However, this can be fairly long to run. In this case, you see how a train/test split has been used instead.

You can build the LSTM using the following code:

import yfinance as yf
sp500_data = yf.download('^GSPC', start="1980-01-01", end="2021-11-21")
sp500_data = sp500_data[['Close']]
difs = (sp500_data.shift() - sp500_data) / sp500_data
difs = difs.dropna()
y = difs.Close.values
# create windows
X_data = []
y_data = []
for i in range(len(y) - 3*31):
    X_data.append(y[i:i+3*31])
    y_data.append(y[i+3*31])
X_windows = np.vstack(X_data)
# create train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_windows, np.array(y_data), test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val  = train_test_split(X_train, y_train, test_size=0.25, random_state=1)
# build LSTM using tensorflow keras
import numpy as np
import neptune
archi_list = [
              [tf.keras.layers.LSTM(32, return_sequences=True,  input_shape=(3*31,1)),
               tf.keras.layers.LSTM(32, return_sequences=True),
               tf.keras.layers.Dense(units=1)
              [tf.keras.layers.LSTM(64, return_sequences=True,  input_shape=(3*31,1)),
               tf.keras.layers.LSTM(64, return_sequences=True),
               tf.keras.layers.Dense(units=1)
              [tf.keras.layers.LSTM(128, return_sequences=True,  input_shape=(3*31,1)),
               tf.keras.layers.LSTM(128, return_sequences=True),
               tf.keras.layers.Dense(units=1)
              [tf.keras.layers.LSTM(32, return_sequences=True,  input_shape=(3*31,1)),
               tf.keras.layers.LSTM(32, return_sequences=True),
               tf.keras.layers.LSTM(32, return_sequences=True),
               tf.keras.layers.Dense(units=1)
              [tf.keras.layers.LSTM(64, return_sequences=True,  input_shape=(3*31,1)),
               tf.keras.layers.LSTM(64, return_sequences=True),
               tf.keras.layers.LSTM(64, return_sequences=True),
               tf.keras.layers.Dense(units=1)
for archi in archi_list:
    run = neptune.init_run(
          project="YOU/YOUR_PROJECT",
          api_token="YOUR_API_TOKEN",
    run['params'] = f'{str(len(archi) - 1)} times {str(archi[0].units)}'
    run['Tags'] = 'lstm'
    lstm_model = tf.keras.models.Sequential(archi)
    lstm_model.compile(loss=tf.losses.MeanSquaredError(),
                      optimizer=tf.optimizers.Adam(),
                      metrics=[tf.metrics.MeanSquaredError()]
    history = lstm_model.fit(X_train, y_train, epochs=10, validation_data=(X_val, y_val))
    run['last_mse'] = history.history['val_mean_squared_error'][-1]
    run.stop()

The LSTM performed the same as the XGBoost model. Again, there can be multiple things to tune further if you’d want to work more on this. You could think about using longer or shorter training periods. You may also want to work on standardizing the data differently: this often plays a role in neural network performances.

Selecting the best model

As a conclusion of this case study, you could say that the best performance was obtained by the ARIMA model. This has been based on using comparative data for each: three months training period and a one-day forecast.

Time series models comparison — *Comparison of 3 models in Neptune | See in the app*

If you want to take this model further, there are a lot of things that you could improve. For example, you could try working with longer or shorter training periods. You could also try adding additional data, like seasonal data (day of the week, month, etc.) or additional predictor variables like market sentiment or others. In this case, you would need to switch to the SARIMAX model.

I hope that this article has shown you how to go about model selection in the case of time series data. You have now got an idea of the different models and model categories that could be interesting to work with. You have also seen the tools like windowing and the time series split that are specific to time series model evaluation.

For more advanced reading, I suggest the following sources:

Thanks for your vote! It's been noted. | What topics you would like to see for your next read? Thanks for your vote! It's been noted. | Let us know what should be improved.

Introduction to time series datasets and forecasting
Time series models specifics
Types of time series models
Going deeper into classical time series models
Going deeper into supervised machine learning models
Going deeper into advanced and specific time series models
Going deeper into deep learning-based time series models
Time series model selection
An example use case for time series modeling

neptune.ai demo [20min]

How deepsense.ai Tracked and Analyzed 120K+ Models Using Neptune

How ReSpo.Vision Uses Neptune to Easily Track Training Pipelines at Scale

Building a Machine Learning Platform

Learnings From Building the ML Platform at Mailchimp