← Back to Blog

What Is 'random_state' in sklearn.model_selection.train_test_split Example?

As a data scientist or software engineer, you’re probably familiar with the concept of training and testing your data to validate the accuracy of your models. However, you may have come across the term random_state in the train_test_split method of the sklearn.model_selection module and wondered what it means.

By Saturn Cloud | Thursday, July 06, 2023 | Miscellaneous | Updated: Wednesday, January 24, 2024

As a data scientist or software engineer, you’re probably familiar with the concept of training and testing your data to validate the accuracy of your models. However, you may have come across the term random_state in the train_test_split method of the sklearn.model_selection module and wondered what it means.

In this article, we’ll explore what “random_state” is and why it’s important in data science . We’ll also demonstrate how you can use it in your projects to ensure reproducibility of your results.

Introduction
What is train_test_split?
What is random_state ?
Why is random_state important?
Conclusion

What is train_test_split?

Before we dive into random_state , let’s first understand what train_test_split does. It’s a function in the sklearn.model_selection module that splits a dataset into two subsets: one for training and one for testing. The training set is used to train a machine learning model, while the testing set is used to evaluate its performance. Here’s an example of how to use train_test_split :

from sklearn.model_selection import train_test_split
# Example of dummy data for X (features) and y (labels)
X = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
y = [0, 1, 0]
# Use train_test_split to split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Display training and testing data
print("Training data X:", X_train)
print("Training labels y:", y_train)
print("Testing data X:", X_test)
print("Testing labels y:", y_test)

In the example above, X and y are the dataset to be split, and test_size is the proportion of the dataset to be allocated to the testing set. The remaining data is used for training.

Output:

Training data X: [[7, 8, 9], [4, 5, 6]]
Training labels y: [0, 1]
Testing data X: [[1, 2, 3]]
Testing labels y: [0]

Once the data is split, you can use the subsets to train and evaluate your model. However, the results you obtain may differ each time you run the code. This is where “random_state” comes in.