What Is 'random_state' in sklearn.model_selection.train_test_split Example?

As a data scientist or software engineer, you’re probably familiar with the concept of training and testing your data to validate the accuracy of your models. However, you may have come across the term random_state in the train_test_split method of the sklearn.model_selection module and wondered what it means.

As a data scientist or software engineer, you’re probably familiar with the concept of training and testing your data to validate the accuracy of your models. However, you may have come across the term random_state in the train_test_split method of the sklearn.model_selection module and wondered what it means.

In this article, we’ll explore what “random_state” is and why it’s important in data science . We’ll also demonstrate how you can use it in your projects to ensure reproducibility of your results.

Table of Contents

  1. Introduction
  2. What is train_test_split?
  3. What is random_state ?
  4. Why is random_state important?
  5. Conclusion

What is train_test_split?

Before we dive into random_state , let’s first understand what train_test_split does. It’s a function in the sklearn.model_selection module that splits a dataset into two subsets: one for training and one for testing. The training set is used to train a machine learning model, while the testing set is used to evaluate its performance. Here’s an example of how to use train_test_split :

from sklearn.model_selection import train_test_split
# Example of dummy data for X (features) and y (labels)
X = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
y = [0, 1, 0]
# Use train_test_split to split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Display training and testing data
print("Training data X:", X_train)
print("Training labels y:", y_train)
print("Testing data X:", X_test)
print("Testing labels y:", y_test)

In the example above, X and y are the dataset to be split, and test_size is the proportion of the dataset to be allocated to the testing set. The remaining data is used for training.

Output:

Training data X: [[7, 8, 9], [4, 5, 6]]
Training labels y: [0, 1]
Testing data X: [[1, 2, 3]]
Testing labels y: [0]

Once the data is split, you can use the subsets to train and evaluate your model. However, the results you obtain may differ each time you run the code. This is where “random_state” comes in.

What is random_state ?

random_state is a parameter in train_test_split that controls the random number generator used to shuffle the data before splitting it. In other words, it ensures that the same randomization is used each time you run the code, resulting in the same splits of the data.

Let’s look at an example to demonstrate this. Suppose you have a dataset of 1000 samples, and you want to split it into a training set of 700 samples and a testing set of 300 samples. Here’s how to do it:

Output:

Training set X: [11 47 85 28 93  5 66 65 35 16 49 34  7 95 27 19 81 25 62 13 24  3 17 38
  8 78  6 64 36 89 56 99 54 43 50 67 46 68 61 97 79 41 58 48 98 57 75 32 94 59 63 84 37 29  1 52 21  2 23 87 91 74 86 82 20 60 71 14 92 51]
Testing set X: [83 53 70 45 44 39 22 80 10  0 18 30 73 33 90  4 76 77 12 31 55 88 26 42
 69 15 40 96  9 72]
Training labels y: [11 47 85 28 93  5 66 65 35 16 49 34  7 95 27 19 81 25 62 13 24  3 17 38 8 78  6 64 36 89 56 99 54 43 50 67 46 68 61 97 79 41 58 48 98 57 75 32 94 59 63 84 37 29  1 52 21  2 23 87 91 74 86 82 20 60 71 14 92 51]