Let’s start by reviewing what exactly time series are.
Time series is a special type of data set in which one or more variables are measured over time.
Most data sets that we work with are based on independent observations. Examples are data sets in which each row (data point) represents an individual observation. For example, on a website, you could track each visitor. Each visitor has a user id and he or she will be independent of the other visitors.
In time series, however, observations are measured over time. Each data point in your data set corresponds to a point in time. This means that there is a relation between different data points of your dataset. This has important implications for the types of machine learning algorithms that you can apply to the time series dataset.
In the next part of this article, you will discover the specifics of time series data in more detail.
The first specificity of time series is that the timestamp that identifies the data has intrinsic meaning.
Univariate time series models
are forecasting models that use only one variable (the target variable) and its temporal variation to forecast the future. Univariate models are specific to time series.
In other situations, you may have additional explanatory data about the future. For example, imagine that you want to factor the weather forecast into your product demand forecast, or that you have some other data that will influence your predictions. In this case, you can use
Multivariate time series models
. Multivariate Time Series models are the Univariate Time Series models that are adapted to integrate external variables. You can also use supervised machine learning for this task.
Seasonality
is a recurring movement that is present in your time series variable. For example, the temperature of a place will be higher in the summer months and lower in the winter months. You could compute average monthly temperatures and use this seasonality as a basis for forecasting future values.
A trend
can be a long-term upward or downward pattern. In a temperature time series, a trend could be present due to global warming. For example, on top of the summer/winter seasonality, you may well see a slight increase in average temperatures over time.
Noise
is the part of the variability in a time series that can neither be explained by seasonality nor by a trend. When building models, you end up combining different components into a mathematical formula. Two parts of such a formula can be seasonality and trend. A model that combines both will never represent the values of temperature perfectly: an error will always remain. This is represented by the noise factor.
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(co2_data)
result.plot()
Positive autocorrelation
means that a high value now is likely to yield a high value in the future and vice versa. You can think about the stock market: if everybody is buying a stock, then the price goes up. When the price goes up, people think that this is a good stock to buy and they buy it too, thereby driving the price even higher. However, if the price goes down, then everybody is fearful of a crash, sells their stocks and the price becomes lower.
Negative autocorrelation
is the opposite: a high value today implies a low value tomorrow and a low-value today implies a high-value tomorrow. A common example is rabbit populations in natural environments. If there are a lot of wild rabbits in the summer of one year, they will eat all of the natural resources available. During winter, there will be nothing left to eat, so many of them will die and the surviving rabbit population will be small. During this year with a small rabbit population, the natural resources will grow back and allow the rabbit population to grow in the following year.
The PACF is an alternative to the ACF. Rather than giving the autocorrelations, it gives you the partial autocorrelation. This autocorrelation is called partial, because with each step back in the past, only additional autocorrelation is listed. This is different from the ACF, as the ACF contains duplicate correlations when variability can be explained by multiple points in time.
For example, If the value of today is the same as the value of yesterday, but also the same as the day-before-yesterday, the ACF would show two highly correlated steps. The PACF would only show yesterday and remove the day before yesterday.
You can compute a PACF plot using Python as follows:
You can see below that this PACF plot gives a much better representation of the autocorrelation in the CO2 data. There is a strong positive autocorrelation with lag 1: a high value now means that you are very likely to observe a high value in the next step. Since the autocorrelation shown here is partial, you do not see any duplicate effects with earlier lags, making the PACF plot neater and clearer.
Stationarity
Another important definition in time series is stationarity. A stationary time series is a time series that has no trend. Some time series models are not able to deal with trends (more on this later). You can detect non-stationarity using the Dickey-Fuller Test and you can remove non-stationarity using differencing.
Dickey-Fuller test
The Dickey-Fuller test is a statistical hypothesis test that allows you to detect non-stationarity. You can use the following Python code to apply a Dickey-Fuller test to the CO2 data:
from statsmodels.tsa.stattools import adfuller
adf, pval, usedlag, nobs, crit_vals, icbest = adfuller(co2_data.co2.values)
print('ADF test statistic:', adf)
print('ADF p-values:', pval)
print('ADF number of lags used:', usedlag)
print('ADF number of observations:', nobs)
print('ADF critical values:', crit_vals)
print('ADF best information criterion:', icbest)
The null hypothesis of the ADF test is that a unit root is present in the time series. The alternative hypothesis is that the data is stationary.
The second value is the p-value. If this p-value is smaller than 0.05 you can reject the null hypothesis (reject non-stationarity) and accept the alternative hypothesis (stationarity). In this case, we cannot reject the null hypothesis and will have to assume that the data is non-stationary. As you have seen the data, you know that there is a trend, so this also confirms the result we obtained.
Differencing
You can remove the trend from your time series. The goal is to have only seasonal variation: this can be a way to use certain models that work with seasonality but not with trends.
adf, pval, usedlag, nobs, crit_vals, icbest = adfuller(differenced_co2.dropna())
print('ADF test statistic:', adf)
print('ADF p-values:', pval)
print('ADF number of lags used:', usedlag)
print('ADF number of observations:', nobs)
print('ADF critical values:', crit_vals)
print('ADF best information criterion:', icbest)
The last concept that is important to understand before going into modeling is the concept of one-step models versus multi-step models.
Some models work great for predicting the next step for a time series, but do not have the capacity to predict multiple steps at once. These models are one-step models. You can make multi-step models with them by windowing over your predictions, but there is a risk: when using predicted values to make predictions, your errors can quickly add up and become very large.
Multistep models are models that have the intrinsic capacity to predict multiple steps at once. They are generally the better choice for long-term forecasts, and sometimes for one-step forecasts as well. It is key that you decide on the number of steps that you want to predict before starting to build models. This purely depends on your use case.
Classical time series models are a family of models that have been traditionally used a lot in many domains of forecasting. They are strongly based on temporal variation inside a time series and they work well with univariate time series. Some advanced options exist to add external variables into the models as well. These models are generally only applicable to time series and are not useful for other types of machine learning.
Supervised models
Supervised models are a family of models that are used for many machine learning task. A machine learning model is supervised when it uses clearly defined input variables and one or more output (target) variables.
Supervised models can be used for time series, as long as you have a way to extract seasonality and put it into a variable. Examples include creating a variable for a year, a month, or a day of the week, etc. These are then used as the X variables in your supervised model and the ‘y’ is the actual value of the time series. You can also include lagged versions of y (the past value of y) into the X data, in order to add autocorrelation effects.
Deep learning and recent models
The increasing popularity of deep learning over the past years has opened new doors for forecasting as well, as specific deep learning architectures have been invented that work very well on sequence data.
Cloud computing and the popularization of AI as a service have also provided a number of new inventions in the domain. Facebook, Amazon, and other big tech companies are open-sourcing their forecasting products, or making them available on their cloud platforms. The availability of those new “black- box” models gives forecasting practitioners new tools to try and test, and can sometimes even beat previous models.
Going deeper into classical time series models
In this part, you will discover classical time series models in depth.
ARIMA family
The ARIMA family of models is a set of smaller models that can be combined. Each part of the ARMIA model can be used as a stand-alone component, or the different building blocks can be combined. When all of the individual components are put together, you obtain the SARIMAX model. You will now see each of the building blocks separately.
1. Autoregression (AR)
Autoregression is the first building block of the SARIMAX family. You can see the AR model as a regression model that explains a variable’s future value using its past (lagged) values.
The order of an AR model is denoted as p, and it represents the number of lagged values to include in the model. The simplest model is the AR(1) model: it uses only the value of the previous timestep to predict the current value. The maximum number of values that you can use is the total length of the time series (i.e. you use all previous time steps).
2. Moving average (MA)
The Moving Average is the second building block of the larger SARIMAX model. It works in a comparable way to the AR model: it uses past values to predict the current value of the variable.
The past values that the Moving Average model uses are not the values of the variable. Rather, the Moving Average uses the prediction error in previous time steps to predict the future.
This sounds counter-intuitive, but there is a logic behind it. When a model has some unknown but regular external perturbations, your model may have a seasonality or other pattern in the error of the model. The MA model is a method to capture this pattern without even having to identify where it comes from.
The MA model can use multiple steps back in time as well. This is represented in the order parameter called q. For example, an MA(1) model has an order of 1 and uses only one time step back.
3. Autoregressive moving average (ARMA)
The Autoregressive Moving Average, or ARMA, model combines the two previous building blocks into one model. ARMA can therefore use both the value and the prediction errors from the past.
ARMA can have different values for the lag of the AR and MA processes. For example an ARMA(1, 0) model has an AR order of 1 ( p = 1) and an MA order of 0 (q=0). This is actually just an AR(1) model. The MA(1) model is the same as the ARMA(0, 1) model. Other combinations are possible: ARMA(3, 1) for example has an AR order of 3 lagged values and uses 1 lagged value for the MA.
4. Autoregressive integrated moving average (ARIMA)
The ARMA model needs a stationary time series. As you have seen earlier on, stationarity means that a time series remains stable. You can use the Augmented Dickey-Fuller test to test whether your time series is stable and apply differencing if it is not the case.
The ARIMA model adds automatic differencing to the ARMA model. It has an additional parameter that you can set to the number of times that the time series needs to be differenced. For example, an ARMA(1,1) that needs to be differenced one time would result in the following notation: ARIMA(1, 1, 1). The first 1 is for the AR order, the second one is for the differencing, and the third 1 is for the MA order. ARIMA(1, 0, 1) would be the same as ARMA(1, 1).
SARIMA adds seasonal effects into the ARIMA model. If seasonality is present in your time series, it is very important to use it in your forecast.
SARIMA notation is quite a bit more complex than ARIMA, as each of the components receives a seasonal parameter on top of the regular parameter.
For example, let’s consider the ARIMA(p, d, q) as seen before. In SARIMA notation, this becomes SARIMA(p, d, q)(P, D, Q)m.
m is simply the number of observations per year: monthly data has m=12, quarterly data has m=4 etc. The small letters (p, d, q) represent the non-seasonal orders. The capital letters (P, D, Q) represent the seasonal orders.
import pmdarima as pm
from pmdarima.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
You could see Vector Autoregression, or VAR as a multivariate alternative to Arima. Rather than predicting one dependent variable, you predict multiple time series at the same time. This can be especially useful when there are strong relationships between your different time series. The Vector Autoregression, like the standard AR model, contains only an autoregressive component.
The
VARMA model
is the multivariate equivalent of the ARMA model. VARMA is to ARMA what VAR is to AR: it adds a Moving Average component to the model.
If you want to go a step further, you can use
VARMAX
. The X represents
external (exogenous) variables
. Exogenous variables are variables that can help your model to make better forecasts, but that do not need to be forecasted themselves. The statsmodels
VARMAX
implementation is a good way to get started with implementing VARMAX models.
More advanced versions like seasonal VARMAX (SVARMAX) exist, but they become so complex and specific that it will be hard to find implementations that do this easily and efficiently. Once models become so complex, it gets hard to understand what is happening inside the model and is often better to start looking into other familiar models.
Smoothing
Exponential Smoothing is a basic statistical technique that can be used to smoothen out time series. Time series patterns often have a lot of long term variability, but also short term (noisy) variability. Smoothening allows you to make your curve smoother so that long term variability becomes more evident and short term (noisy) patterns are removed.
This smooth version of the time series can then be used for analysis.
1. Simple moving average
The
simple moving average
is the simplest smoothing technique. It consists of replacing the current value by the average of the current and a few past values. The exact number of past values to take into account is a parameter. The more values you use, the smoother the curve will become. At the same time, you will lose more and more variation.
2. Simple exponential smoothing (SES)
Exponential smoothing
is an adaptation of this simple moving average. Rather than taking the average, it takes a weighted average of past values. A value that is further back will count less and a more recent value will count more.
3. Double exponential smoothing
When trends are present in your time series data, you should avoid using Simple Exponential Smoothing: it does not work well in this case, as the model cannot make the distinction between variation and trend correctly. However, you can use
double exponential smoothing
.
In DES, there is a recursive application of an exponential filter. This allows you to remove trend problems. This works using the following formulas for time zero:
If you want to go even further, you can use Triple Exponential Smoothing, which is also called
Holt Winter’s exponential smoothing
. You should use this only when there are three important signals in your time series data. For example, one signal could be the trend, another one could be a weekly seasonality and a third one could be a monthly seasonality.
An example of exponential smoothing in Python
In the following example, you see how to apply a Simple Exponential Smoothing to the CO2 data. The smoothing level indicates how smooth your curve should become. In the example it is set very low, indicating a very smooth curve. Feel free to play around with this parameter and see what less smooth versions look like.
from statsmodels.tsa.api import SimpleExpSmoothing
es = SimpleExpSmoothing(co2_data.co2.values)
es.fit(smoothing_level=0.01)
plt.plot(co2_data.co2.values)
plt.plot(es.predict(es.params, start=0, end=None))
plt.show()
Supervised Machine Learning models work very differently than classical machine learning models. The main difference is that they consider that variables are either dependent variables or independent variables. Dependent variables, or target variables, are the variables that you want to predict. Independent variables are the variables that help you to predict.
Supervised Machine Learning models are not made especially for time series data. After all, there are often no independent variables in time series data. Yet it is fairly simple to adapt them to time series by converting the seasonality (based on your time stamps for example) into independent variables.
Linear regression
Linear Regression is arguably the simplest supervised machine learning model. Linear Regression estimates linear relationships: each independent variable has a coefficient that indicates how this variable affects the target variable.
Simple Linear Regression is a Linear Regression in which there is only one independent variable. An example of a Simple Linear Regression model in non-time series data could be the following: hot chocolate sales that depend on the outside temperature (degrees Celsius).
The colder the temperature, the higher the hot chocolate sales. Visually, this could look like the graph below.
In Multiple Linear Regression, rather than using only one independent variable, you use multiple independent variables. You could imagine the 2d graph converting into a 3d graph, where the third axis represents the variable Price. In this case, you would build a Linear Model that explains the sales using temperature and price. You can add as many variables as you need.
Now, of course, this is not a time series data set: there is no time variable present. So, how could you use this technique for time series? The answer is fairly straightforward. Rather than only using temperature and price in this data set, you could add the variables year, month, day of the week, etc.
If you build a supervised model on time series, you have the disadvantage that you need to do a little bit of feature engineering to extract seasonality into variables in a way or another. An advantage is, however, that adding exogenous variables becomes much easier.
Let’s now see how to apply a Linear Regression on the CO2 dataset. You can prepare the CO2 data as follows:
# extract the seasonality data
months = [x.month for x in co2_data.index]
years = [x.year for x in co2_data.index]
day = [x.day for x in co2_data.index]
# convert into one matrix
X = np.array([day, months, years]).T
The XGBoost model is a third model that you should absolutely know. There are many other models out there, but Random Forests and XGBoost are considered absolute classics among the supervised machine learning family.
XGBoost is a machine learning model that is based on the gradient boosting framework. This model is an ensemble model of weak learners just like the Random Forest but with an interesting advantage. In standard gradient boosting, the individual trees are fit in sequence and each consecutive decision tree is fit in such a way to minimize the error of previous trees. XGBoost obtains the same result but is still able to do parallel learning.
You can use the XGBoost package as follows:
LSTMs are Recurrent Neural Networks. Neural Networks are very complex machine learning models that pass input data through a network. Each node in the network learns a very simple operation. The neural network consists of many such nodes. The fact that the model can use a large number of simple nodes makes the overall prediction very complex. Neural Networks can therefore fit very complex and nonlinear data sets.
RNNs are a special type of Neural network, in which the network can learn from sequence data. This can be useful for multiple use cases, including understanding time series (which are clearly sequences of values over time), but also text (sentences are sequences of words).
LSTMs are a specific type of RNNs. They have proven useful for time series forecasting on multiple occasions. They require some data and are more complicated to learn than supervised models. Once you master them, they can prove to be very powerful depending on your data and your specific use case.
To go into LSTMs, the
Keras
library in Python is a great starting point.
Prophet
Prophet is a time series library that was open-sourced by Facebook. It is a black-box model, as it will generate forecasts without much user specification. This can be an advantage, as you can almost automatically generate forecasting models without much knowledge or effort.
On the other hand, there is a risk here as well: if you do not pay close enough attention, you may very well be producing a model that seems good to the automated model building tool, but that in reality does not work well.
Extensive model validation and evaluation are recommended when using such black-box models, yet if you find that it works well on your specific use case, you may find a lot of added value here.
You can find a lot of resources on Facebook’s
GitHub
.
The first thing to define when selecting models is the metric that you want to look at. In the previous part, you have seen multiple fits with different qualities (think about the linear regression vs the random forest).
To go further with
model selection
, you will need to define a metric to evaluate your models. A very often used model in forecasting is the Mean Squared Error. This metric measures the error at each point in time and takes the square of it. The average of those squared errors is called the Mean Squared Error. An often-used alternative is the Root Mean Squared Error: the square root of the Mean Squared Error.
Another frequently used metric is the Mean Absolute Error: rather than taking the square of each error, it takes the absolute value here. The Mean Absolute Percent Error is a variation on this where the Absolute Error at each point in time is expressed as a percentage of the actual value. This yields a metric that is a percentage, which is very easy to interpret.
Time series train test split
The second thing to think about when evaluating Machine Learning is to consider that a model that works well on the training data, does not necessarily work well on new, out-of-sample data. Such a model is called an overfitting model.
There are two common approaches that can help you to estimate whether a model is generalizing correctly: the train-test-split and cross-validation.
The train test split means that you remove a part of your data before fitting the model. As an example, you could remove the last 3 years from the CO2 database and use the remaining 40 years for fitting the model. You’d then forecast the three years of test data and measure the evaluation metric of your choice between your predictions and the actual values of the last three years.
To benchmark and choose models, you could build multiple models on the 40 years of data and do the test set evaluation on all of them. Based on this test-performance, you could select the model that is most performant.
Of course, if you are building a short-term forecasting model, using three years of data would not make sense: you’d choose an evaluation period that is comparable to the period that you’d forecast in reality.
Time series cross-validation
A risk with the train test split is that you measure only at one point in time. In non-time series data, the test set is generally generated by a random selection of data points. In time series, however, this does not work in many cases: when sequences are used, you can not remove one point in the sequence and still expect the model to work.
Time series train test splits are therefore best applied by selecting the final period as the test set. The risk here is that this can go wrong if your last period is not very reliable. In recent covid periods, you can imagine that many business forecasts have gone completely off: the underlying trends have changed.
Cross-validation is a method that does a repeated train test evaluation. Rather than making one train test split, it makes multiple (the exact number is a user-defined parameter). For example, if you use 3-fold cross-validation, you will split your data set into three equal parts. You will then fit three times the same model on two-thirds of the data set and use the other third for evaluation. In the end, you have three evaluation scores (each on a different test set) and you can use the average as the final metric.
In time series, however, you cannot apply a random selection to obtain multiple test sets. If you’d do this, you’d end up with sequences with many missing data points.
A solution can be found in time series cross-validation. What it does is create multiple train test sets, but each of the test sets is the end of the period. For example, the first train test split could be built on the first 10 years of data (5 train, 5 test). The second model would be done on the first 15 years of data (10 train, 5 test), etc. This can work well but has the disadvantage that each of the models does not use the same number of years in the training data.
An alternative is to do a rolling split (always 5 years train, 5 years test), but here the disadvantage is that you can never use more than 5 years for training data.
Time series model experiments
In conclusion, when doing time series model selection, the following questions are key to define before starting to experiment:
In this part, you will work on a forecast for the next day of the S&P 500. You could imagine running your model every night and then the next day you would know if the stock market is going up or down. If you have a very accurate model for doing this, you could easily make a lot of money (don’t treat it as financial advice ;)).
Stock market forecasting data and definition of the evaluation method
Obtaining stock market data
You can use the Yahoo Finance package in Python to automatically download stock data.
# taking the close price (end of day)
sp500_data = yf.download('^GSPC', start="1980-01-01", end="2021-11-21")
sp500_data = sp500_data[['Close']]
sp500_data.plot(figsize=(12, 12))
As you want to predict for one day only, you can understand that the test set will be very small (one day only). Therefore, it would be best to create a lot of test splits to make sure that there is an acceptable amount of model evaluation.
This can be obtained by the Time Series Split that was explained earlier. For example, you can set up a Time Series Split that will make 100 train test sets, in which each train test set uses three months of training data and one day of test data. This will work fine for this example to understand the principle of model selection in time series.
Building a classical time series model
Let’s start with a classical time series model on this problem: the Arima model. In this code, you will set up the automated creation of Arima models with orders ranging from (0, 0, 0) to (4, 4, 4). Each of the models will be built and evaluated using a Time Series Split with 100 splits, in which the train size is a maximum of three months and the test size is always one day.
Because of the sheer number of runs involved, the results are logged to
neptune.ai
for ease of comparison. For following along, you can set up a free account and get more info
from this tutorial
.
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import TimeSeriesSplit
import neptune
from neptune.utils import stringify_unsupported
param_list = [(x, y, z) for x in range(5) for y in range(5) for z in range(5)]
for order in param_list:
run = neptune.init_run(
project="YOU/YOUR_PROJECT",
api_token="YOUR_API_TOKEN",
run['order'] = stringify_unsupported(order)
# for each param combi do a ts split
# max 3 months training data
# 1 day test size
mses = []
tscv = TimeSeriesSplit(n_splits=100,
max_train_size = 3*31,
test_size=1)
for train_index, test_index in tscv.split(y):
train = y[train_index]
test = y[test_index]
# for each ts split do a model
mod = sm.tsa.ARIMA(train, order=order)
res = mod.fit()
pred = res.forecast(1)[0]
mse = mean_squared_error(test, pred)
mses.append(mse)
except:
# ignore models that error
average_mse = np.mean(mses)
std_mse = np.std(mses)
run['average_mse'] = average_mse
run['std_mse'] = std_mse
except:
run['average_mse'] = None
run['std_mse'] = None
run.stop()
The model with the lowest average MSE is the model with order (0, 1, 3). However, as you can see, the standard deviation of this model is suspiciously 0. The next two models in the line are ARIMA(1, 0, 3) and ARIMA(1, 0, 2). They are very similar and this would indicate that the result is reliable. The best guess here would be to take the ARIMA(1, 0, 3) as the best model which has an average MSE of 0.00000131908 and an average standard deviation of 0.00000197007.
In some use cases, you may have a lot of data about the future. For example, if you want to predict the number of customers of a restaurant, you could use external data on the number of reservations that have been made for future dates as independent variables.
For the current stock market use case, you do not have those data: you only have the stock price over a period of time. However, supervised models cannot be built using only a target variable. You’ll need to find a way to extract seasonality from the data and use feature engineering to create independent variables. As the stock market is known to have a lot of autocorrelation effects, let’s try a model that uses the values of the past 30 days as predictor variables to predict the 31st day.
You can create a data set that has all of the possible combinations of 30 training days and 1 test day (always consecutive) of the S&P500 and you would be able to create an enormous training database this way:
import yfinance as yf
sp500_data = yf.download('^GSPC', start="1980-01-01", end="2021-11-21")
sp500_data = sp500_data[['Close']]
difs = (sp500_data.shift() - sp500_data) / sp500_data
difs = difs.dropna()
y = difs.Close.values
# window through the data
X_data = []
y_data = []
for i in range(len(y) - 31):
X_data.append(y[i:i+30])
y_data.append(y[i+30])
X_windows = np.vstack(X_data)
from sklearn.model_selection import KFold
import neptune
from neptune.utils import stringify_unsupported
from sklearn.metrics import mean_squared_error
# specify the grid for the grid search of hyperparameter tuning
parameters={'max_depth': list(range(2, 20, 4)),
'gamma': list(range(0, 10, 2)),
'min_child_weight' : list(range(0, 10, 2)),
'eta': [0.01,0.05, 0.1, 0.15,0.2,0.3,0.5]
param_list = [(x, y, z, a) for x in parameters['max_depth'] for y in parameters['gamma'] for z in parameters['min_child_weight'] for a in parameters['eta']]
for params in param_list:
mses = []
run = neptune.init_run(
project="YOU/YOUR_PROJECT",
api_token="YOUR_API_TOKEN",
run['params'] = stringify_unsupported(params)
my_kfold = KFold(n_splits=10, shuffle=True, random_state=0)
for train_index, test_index in my_kfold.split(X_windows):
X_train, X_test = X_windows[train_index], X_windows[test_index]
y_train, y_test = np.array(y_data)[train_index], np.array(y_data)[test_index]
xgb_model = xgb.XGBRegressor(max_depth=params[0],gamma=params[1], min_child_weight=params[2], eta=params[3])
xgb_model.fit(X_train, y_train)
preds = xgb_model.predict(X_test)
mses.append(mean_squared_error(y_test, preds))
average_mse = np.mean(mses)
std_mse = np.std(mses)
run['average_mse'] = average_mse
run['std_mse'] = std_mse
run.stop()
For more reading on XGBoost tuning check the official XGBoost documentation on the topic over
here
.
The best (lowest) MSE obtained by this XGBoost is 0.000129982. There are multiple hyperparameter combinations that obtain this score. As you can see, the XGBoost model is much less performant than the classical time series model, at least in the current configuration. Another method of organizing the data may be necessary to get better results from XGBoost.
Building a deep learning-based time series model
As a third model for the model comparison, let’s take an LSTM and see whether this can beat the ARIMA model. You can do a model comparison using cross-validation as well. However, this can be fairly long to run. In this case, you see how a train/test split has been used instead.
You can build the LSTM using the following code:
import yfinance as yf
sp500_data = yf.download('^GSPC', start="1980-01-01", end="2021-11-21")
sp500_data = sp500_data[['Close']]
difs = (sp500_data.shift() - sp500_data) / sp500_data
difs = difs.dropna()
y = difs.Close.values
# create windows
X_data = []
y_data = []
for i in range(len(y) - 3*31):
X_data.append(y[i:i+3*31])
y_data.append(y[i+3*31])
X_windows = np.vstack(X_data)
# create train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_windows, np.array(y_data), test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)
# build LSTM using tensorflow keras
import numpy as np
import neptune
archi_list = [
[tf.keras.layers.LSTM(32, return_sequences=True, input_shape=(3*31,1)),
tf.keras.layers.LSTM(32, return_sequences=True),
tf.keras.layers.Dense(units=1)
[tf.keras.layers.LSTM(64, return_sequences=True, input_shape=(3*31,1)),
tf.keras.layers.LSTM(64, return_sequences=True),
tf.keras.layers.Dense(units=1)
[tf.keras.layers.LSTM(128, return_sequences=True, input_shape=(3*31,1)),
tf.keras.layers.LSTM(128, return_sequences=True),
tf.keras.layers.Dense(units=1)
[tf.keras.layers.LSTM(32, return_sequences=True, input_shape=(3*31,1)),
tf.keras.layers.LSTM(32, return_sequences=True),
tf.keras.layers.LSTM(32, return_sequences=True),
tf.keras.layers.Dense(units=1)
[tf.keras.layers.LSTM(64, return_sequences=True, input_shape=(3*31,1)),
tf.keras.layers.LSTM(64, return_sequences=True),
tf.keras.layers.LSTM(64, return_sequences=True),
tf.keras.layers.Dense(units=1)
for archi in archi_list:
run = neptune.init_run(
project="YOU/YOUR_PROJECT",
api_token="YOUR_API_TOKEN",
run['params'] = f'{str(len(archi) - 1)} times {str(archi[0].units)}'
run['Tags'] = 'lstm'
lstm_model = tf.keras.models.Sequential(archi)
lstm_model.compile(loss=tf.losses.MeanSquaredError(),
optimizer=tf.optimizers.Adam(),
metrics=[tf.metrics.MeanSquaredError()]
history = lstm_model.fit(X_train, y_train, epochs=10, validation_data=(X_val, y_val))
run['last_mse'] = history.history['val_mean_squared_error'][-1]
run.stop()
The LSTM performed the same as the XGBoost model. Again, there can be multiple things to tune further if you’d want to work more on this. You could think about using longer or shorter training periods. You may also want to work on standardizing the data differently: this often plays a role in neural network performances.
Selecting the best model
As a conclusion of this case study, you could say that the best performance was obtained by the ARIMA model. This has been based on using comparative data for each: three months training period and a one-day forecast.
If you want to take this model further, there are a lot of things that you could improve. For example, you could try working with longer or shorter training periods. You could also try adding additional data, like seasonal data (day of the week, month, etc.) or additional predictor variables like market sentiment or others. In this case, you would need to switch to the SARIMAX model.
I hope that this article has shown you how to go about model selection in the case of time series data. You have now got an idea of the different models and model categories that could be interesting to work with. You have also seen the tools like windowing and the time series split that are specific to time series model evaluation.
For more advanced reading, I suggest the following sources:
Thanks for your vote! It's been noted. | What topics you would like to see for your next read?
Thanks for your vote! It's been noted. | Let us know what should be improved.