-
Minimalist Example of Linear Regression
-
Regression Example
-
Classification Example
-
Other Validation Functionalities
-
Conclusion
One of the key aspects of supervised
machine learning
is model evaluation and validation. When you evaluate the predictive performance of your model, it’s essential that the process be unbiased. Using
train_test_split()
from the data science library
scikit-learn
, you can split your dataset into subsets that minimize the potential for bias in your evaluation and validation process.
In this tutorial, you’ll learn:
-
Why you need to
split your dataset
in supervised machine learning
-
Which
subsets
of the dataset you need for an unbiased evaluation of your model
-
How to use
train_test_split()
to split your data
-
How to combine
train_test_split()
with
prediction methods
In addition, you’ll get information on related tools from
sklearn.model_selection
.
The Importance of Data Splitting
Supervised machine learning
is about creating models that precisely map the given inputs (independent variables, or
predictors
) to the given outputs (dependent variables, or
responses
).
How you measure the precision of your model depends on the type of a problem you’re trying to solve. In
regression analysis
, you typically use the
coefficient of determination
,
root-mean-square error
,
mean absolute error
, or similar quantities. For
classification
problems, you often apply
accuracy
,
precision, recall
,
F1 score
, and related indicators.
The acceptable numeric values that measure precision vary from field to field. You can find detailed explanations from
Statistics By Jim
,
Quora
, and many other resources.
What’s most important to understand is that you usually need
unbiased evaluation
to properly use these measures, assess the predictive performance of your model, and validate the model.
This means that you can’t evaluate the predictive performance of a model with the same data you used for training. You need evaluate the model with
fresh data
that hasn’t been seen by the model before. You can accomplish that by splitting your dataset before you use it.
Training, Validation, and Test Sets
three subsets
:
The training set
is applied to train, or
fit
, your model. For example, you use the training set to find the optimal weights, or coefficients, for
linear regression
,
logistic regression
, or
neural networks
.
The validation set
is used for unbiased model evaluation during
hyperparameter tuning
. For example, when you want to find the optimal number of neurons in a neural network or the best kernel for a support vector machine, you experiment with different values. For each considered setting of hyperparameters, you fit the model with the training set and assess its performance with the validation set.
The test set
is needed for an unbiased evaluation of the final model. You shouldn’t use it for fitting or validation.
In less complex cases, when you don’t have to tune hyperparameters, it’s okay to work with only the training and test sets.
Underfitting and Overfitting
underfitting and overfitting
:
Underfitting
is usually the consequence of a model being unable to encapsulate the relations among data. For example, this can happen when trying to represent nonlinear relations with a linear model. Underfitted models will likely have poor performance with both training and test sets.
Overfitting
usually takes place when a model has an excessively complex structure and learns both the existing relations among data and noise. Such models often have bad generalization capabilities. Although they work well with training data, they usually yield poor performance with unseen (test) data.
You can find a more detailed explanation of underfitting and overfitting in
Linear Regression in Python
.
Application of
train_test_split()
import
train_test_split()
and NumPy before you can use them, so you can start with the
import
statements:
Python
Now that you have both imported, you can use them to split data into training sets and test sets. You’ll split inputs and outputs at the same time, with a single function call.
With
train_test_split()
, you need to provide the sequences that you want to split as well as any optional arguments. It returns a
list
of
NumPy arrays
, other sequences, or
SciPy sparse matrices
if appropriate:
Python
arrays
is the sequence of
lists
,
NumPy arrays
,
pandas DataFrames
, or similar array-like objects that hold the data you want to split. All these objects together make up the dataset and must be of the same length.
In supervised machine learning applications, you’ll typically work with two such sequences:
-
A two-dimensional array with the inputs (
x
)
-
A one-dimensional array with the outputs (
y
)
options
are the optional keyword arguments that you can use to get desired behavior:
train_size
is the number that defines the size of the training set. If you provide a
float
, then it must be between
0.0
and
1.0
and will define the share of the dataset used for testing. If you provide an
int
, then it will represent the total number of the training samples. The default value is
None
.
test_size
is the number that defines the size of the test set. It’s very similar to
train_size
. You should provide either
train_size
or
test_size
. If neither is given, then the default share of the dataset that will be used for testing is
0.25
, or 25 percent.
random_state
is the object that controls randomization during splitting. It can be either an
int
or an instance of
RandomState
. The default value is
None
.
shuffle
is the
Boolean object
(
True
by default) that determines whether to shuffle the dataset before applying the split.
stratify
is an array-like object that, if not
None
, determines how to use a
stratified split
.
Now it’s time to try data splitting! You’ll start by creating a simple dataset to work with. The dataset will contain the inputs in the two-dimensional array
x
and outputs in the one-dimensional array
y
:
Python
Supervised Machine Learning With
train_test_split()
LinearRegression
, and
train_test_split()
:
Python
Now that you’ve imported everything you need, you can create two small arrays,
x
and
y
, to represent the observations and then split them into training and test sets just as you did before:
Python
Regression Example
Boston house prices dataset
, which is included in
sklearn
. This dataset has 506 samples, 13 input variables, and the house values as the output. You can retrieve it with
load_boston()
.
First, import
train_test_split()
and
load_boston()
:
Python
As you can see,
load_boston()
with the argument
return_X_y=True
returns a
tuple
with two NumPy arrays:
-
A two-dimensional array with the inputs
-
A one-dimensional array with the outputs
The next step is to split the data the same way as before:
Python
Other Validation Functionalities
sklearn.model_selection
offers a lot of functionalities related to model selection and validation, including the following:
-
Cross-validation
-
Learning curves
-
Hyperparameter tuning
Cross-validation
is a set of techniques that combine the measures of prediction performance to get more accurate model estimations.
One of the widely used cross-validation methods is
k
-fold cross-validation
. In it, you divide your dataset into
k
(often five or ten) subsets, or
folds
, of equal size and then perform the training and test procedures
k
times. Each time, you use a different fold as the test set and all the remaining folds as the training set. This provides
k
measures of predictive performance, and you can then analyze their mean and standard deviation.
You can implement cross-validation with
KFold
,
StratifiedKFold
,
LeaveOneOut
, and a few other classes and functions from
sklearn.model_selection
.
A
learning curve
, sometimes called a training curve, shows how the prediction score of training and validation sets depends on the number of training samples. You can use
learning_curve()
to get this dependency, which can help you find the optimal size of the training set, choose hyperparameters, compare models, and so on.
Hyperparameter tuning
, also called hyperparameter optimization, is the process of determining the best set of hyperparameters to define your machine learning model.
sklearn.model_selection
provides you with several options for this purpose, including
GridSearchCV
,
RandomizedSearchCV
,
validation_curve()
, and others. Splitting your data is also important for hyperparameter tuning.