添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
阳刚的小狗  ·  Java8 Stream ...·  3 天前    · 
稳重的豆腐  ·  JasperException: ...·  2 天前    · 
无邪的猴子  ·  过渡transition · FE-booklet·  2 天前    · 
读研的人字拖  ·  Java 读取 .properties ...·  13 小时前    · 
爱旅游的椅子  ·  哈哈 | Jessibuca·  2 月前    · 
爽快的镜子  ·  新平台打造 ...·  1 年前    · 
安静的茄子  ·  funny paper - ...·  1 年前    · 

In this article, we will discuss how to split a dataset using scikit-learns’ train_test_split().

sklearn.model_selection.train_test_split() function:

The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_train, X_test, y_train, and y_test. X_train and y_train sets are used for training and fitting the model. The X_test and y_test sets are used for testing the model if it’s predicting the right outputs/labels. we can explicitly test the size of the train and test sets. It is suggested to keep our train sets larger than the test sets.

  • Train set: The training dataset is a set of data that was utilized to fit the model. The dataset on which the model is trained. This data is seen and learned by the model.
  • Test set: The test dataset is a subset of the training dataset that is utilized to give an accurate evaluation of a final model fit.
  • validation set: A validation dataset is a sample of data from your model’s training set that is used to estimate model performance while tuning the model’s hyperparameters.
  • underfitting: A data model that is under-fitted has a high error rate on both the training set and unobserved data because it is unable to effectively represent the relationship between the input and output variables.
  • overfitting: when a statistical model matches its training data exactly but the algorithm’s goal is lost because it is unable to accurately execute against unseen data is called overfitting
  • Syntax: sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None

    Parameters:

  • *arrays: sequence of indexables. Lists, numpy arrays, scipy-sparse matrices, and pandas dataframes are all valid inputs.
  • test_size: int or float, by default None. If float, it should be between 0.0 and 1.0 and represent the percentage of the dataset to test split. If int is used, it refers to the total number of test samples. If the value is None, the complement of the train size is used. It will be set to 0.25 if train size is also None.
  • train_size: int or float, by default None.
  • random_state : int,by default None. Controls how the data is shuffled before the split is implemented. For repeatable output across several function calls, pass an int.
  • shuffle: boolean object , by default True. Whether or not the data should be shuffled before splitting. Stratify must be None if shuffle=False.
  • stratify: array-like object , by default it is None . If None is selected, the data is stratified using these as class labels.
  • Returns:

    splitting: The train-test split of inputs is represented as a list.

    Steps to split the dataset:

    Step 1: Import the necessary packages or modules:

    In this step, we are importing the necessary packages or modules into the working python environment.

    Python3

    Step 2: Import the dataframe/ dataset:

    Here, we load the CSV using pd.read_csv() method from pandas and get the shape of the data set using the shape() function.

    CSV Used:

    Python3

    (13, 3)

    Step 3: Get X and Y feature variables:

    Here, we are assigning the X and the Y variable in which the X feature variable has independent variables and the y feature variable has a dependent variable.

    Python3

    Step 4: Use the train test split class to split data into train and test sets:

    Here, the train_test_split() class from sklearn.model_selection is used to split our data into train and test sets where feature variables are given as input in the method. test_size determines the portion of the data which will go into test sets and a random state is used for data reproducibility.

    Python3

    # using the train test split function
    X_train, X_test, y_train, y_test = train_test_split(
    X,y , random_state = 104 ,test_size = 0.25 , shuffle = True )

    Example:

    In this example, ‘predictions.csv’ file is imported. df.shape attribute is used to retrieve the shape of the data frame. The shape of the dataframe is (13,3). The features columns are taken in the X variable and the outcome column is taken in the y variable. X and y variables are passed in the train_test_split() method to split the data frame into train and test sets. The random state parameter is used for data reproducibility. test_size is given as 0.25 which means 25% of the data goes into the test sets. 4 out of 13 rows in the dataframe go into the test sets. 75% of data goes into the train sets, which is 9 rows out of 13 rows. The train sets are used to fit and train the machine learning model. The test sets are used for evaluation.

    CSV Used:

    Python3

    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    # importing data
    df = pd.read_csv( 'prediction.csv' )
    print (df.shape)
    # head of the data
    print ( 'Head of the dataframe : ' )
    print (df.head())
    print (df.columns)
    X = df[ 'area' ]
    y = df[ 'prices' ]
    # using the train test split function
    X_train, X_test, y_train, y_test = train_test_split(
    X,y , random_state = 104 ,test_size = 0.25 , shuffle = True )
    # printing out train and test sets
    print ( 'X_train : ' )
    print (X_train.head())
    print (X_train.shape)
    print ('')
    print ( 'X_test : ' )
    print (X_test.head())
    print (X_test.shape)
    print ('')
    print ( 'y_train : ' )
    print (y_train.head())
    print (y_train.shape)
    print ('')
    print ( 'y_test : ' )
    print (y_test.head())
    print (y_test.shape)
    Unnamed: 0 area prices 0 0 1000 316404.109589 1 1 1500 384297.945205 2 2 2300 492928.082192 3 3 3540 661304.794521 4 4 4120 740061.643836 Index(['Unnamed: 0', 'area', 'prices'], dtype='object') X_train : 3 3540 7 3460 4 4120 0 1000 8 4750 Name: area, dtype: int64 X_test : 12 7100 2 2300 11 8600 10 9000 Name: area, dtype: int64 y_train : 3 661304.794521 7 650441.780822 4 740061.643836 0 316404.109589 8 825607.876712 Name: prices, dtype: float64 y_test : 12 1.144709e+06 2 4.929281e+05 11 1.348390e+06 10 1.402705e+06 Name: prices, dtype: float64

    Example:

    In this example the following steps are executed :

  • The necessary packages are imported.
  • Advertising.csv data set is loaded and cleaned, and null values are dropped.
  • feature and target arrays are created(X andy).
  • The arrays created are split into train and test sets. 30% of the dataset goes into the test set, which means 70% data is a train set.
  • A standard scaler object is created.
  • X_train is fit into the scaler.
  • X_train and X_test are transformed using the transform() method.
  • A simple linear regression model is created
  • Train sets fit in the model.
  • the predict() method is used to carry out predictions on the X_test set.
  • mean_squared_error() metric is used to evaluate the model.
  • To view and download the CSV file used in this example, click here .

    Python3

    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error
    df = pd.read_csv( 'Advertising.csv' )
    # dropping rows which have null values
    df.dropna(inplace = True ,axis = 0 )
    y = df[ 'sales' ]
    X = df.drop( 'sales' ,axis = 1 )
    # splitting the dataframe into train and test sets
    X_train,X_test,y_train,y_test = train_test_split(
    X,y,test_size = 0.3 ,random_state = 101 )
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    model = LinearRegression().fit(X_train,y_train)
    y_pred = model.predict(X_test)
    print (y_pred)
    print (mean_squared_error(y_test,y_pred))

    array([19.82000933, 14.23636718, 12.80417236,  7.75461569,  8.31672266,

    15.4001915 , 11.6590983 , 15.22650923, 15.53524916, 19.46415132,

    17.21364106, 16.69603229, 16.46449309, 10.15345178, 13.44695953,

    24.71946196, 18.67190453, 15.85505154, 14.45450049,  9.91684409,

    10.41647177,  4.61335238, 17.41531451, 17.31014955, 21.72288151,

    5.87934089, 11.29101265, 17.88733657, 21.04225992, 12.32251227,

    14.4099317 , 15.05829814, 10.2105313 ,  7.28532072, 12.66133397,

    23.25847491, 18.87101505,  4.55545854, 19.79603707,  9.21203026,

    10.24668718,  8.96989469, 13.33515217, 20.69532628, 12.17013119,

    21.69572633, 16.7346457 , 22.16358256,  5.34163764, 20.43470231,

    7.58252563, 23.38775769, 10.2270323 , 12.33473902, 24.10480458,

    9.88919804, 21.7781076 ])

    2.7506859249500466

    Example:

    In this example, we’re gonna use the K-nearest neighbors classifier model.

    In this example the following steps are executed :

  • The necessary packages are imported.
  • iris data is loaded from sklearn.datasets.
  • feature and target arrays are created(X andy).
  • The arrays created are split into train and test sets. 30% of the dataset goes into the test set, which means 70% data is a train set.
  • A basic Knn model is created using the KNeighborsClassifier class.
  • Train sets fit in the knn model.
  • the predict() method is used to carry out predictions on the X_test set.
  • Python3

    # Import packages
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.datasets import load_iris
    # Load the data
    irisData = load_iris()
    # Create feature and target arrays
    X = irisData.data
    y = irisData.target
    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2 , random_state = 42 )
    knn = KNeighborsClassifier(n_neighbors = 1 )
    knn.fit(X_train, y_train)
    # predicting on the X_test data set
    print (knn.predict(X_test))
    Comparison of LDA and PCA 2D projection of Iris dataset in Scikit Learn
    LDA and PCA both are dimensionality reduction techniques in which we try to reduce the dimensionality of the dataset without losing much information and preserving the pattern present in the dataset. In this article, we will use the iris dataset along with scikit learn pre-implemented functions to perform LDA and PCA with a single line of code. Con
    Gaussian Process Classification (GPC) on the XOR Dataset in Scikit Learn
    Gaussian process classification (GPC) is a probabilistic approach to classification that models the conditional distribution of the class labels given the feature values. In GPC, the data is assumed to be generated by a Gaussian process, which is a stochastic process that is characterized by its mean and covariance functions. The mean function in G
    HuberRegressor vs Ridge on Dataset with Strong Outliers in Scikit Learn
    Regression is a commonly used machine learning technique for predicting continuous outputs. In some datasets, outliers can have a significant impact on the results. To handle such datasets with outliers, two common algorithms are Huber Regressor and Ridge Regression. This article will explore the differences between the two algorithms in the contex
    Cross-validation on Digits Dataset in Scikit-learn
    In this article, we will discuss cross-validation and its use on digit datasets. Further, we will see the code implementation using a digits dataset. What is Cross-Validation?Cross Validation on the Digits Dataset will allow us to choose the best parameters avoiding overfitting over the training dataset. It is a procedure of experimentation of hit
    Faces dataset decompositions in Scikit Learn
    The Faces dataset is a database of labeled pictures of people's faces that can be found in the well-known machine learning toolkit Scikit-Learn. Face recognition, facial expression analysis, and other computer vision applications are among the frequent uses for it. The Labeled Faces in the Wild (LFW) benchmark includes the dataset. What is Decompos
    Sklearn Diabetes Dataset : Scikit-learn Toy Datasets in Python
    The Sklearn Diabetes Dataset typically refers to a dataset included in the scikit-learn machine learning library, which is a synthetic dataset rather than real-world data. This dataset is often used for demonstration purposes in machine learning tutorials and examples. In this article, we are going to learn more about the Sklearn Diabetes Dataset,
    Difference Between Dataset.from_tensors and Dataset.from_tensor_slices
    In this article, we will learn the difference between from_tensors and from_tensor_slices. Both of these functionalities are used to iterate a dataset or convert a data to TensorFlow data pipeline but how it is done difference lies there. Suppose we have a dataset represented as a Numpy matrix of shape (num_features, num_examples) and we wish to co
    Map Data to a Normal Distribution in Scikit Learn
    A Normal Distribution, also known as a Gaussian distribution, is a continuous probability distribution that is symmetrical around its mean. It is defined by its norm, which is the center of the distribution, and its standard deviation, which is a measure of the spread of the distribution. The normal distribution is often used to model continuous an
    ML | Implement Face recognition using k-NN with scikit-learn
    k-Nearest Neighbors: k-NN is one of the most basic classification algorithms in machine learning. It belongs to the supervised learning category of machine learning. k-NN is often used in search applications where you are looking for “similar” items. The way we measure similarity is by creating a vector representation of the items, and then compare
    Probability Calibration for 3-class Classification in Scikit Learn
    Probability calibration is a technique to map the predicted probabilities of a model to their true probabilities. The probabilities predicted by some classification algorithms like Logistic Regression, SVM, or Random Forest may not be well calibrated, meaning they may not accurately reflect the true probabilities of the target classes. This can lea
    Pipelines - Python and scikit-learn
    The workflow of any machine learning project includes all the steps required to build it. A proper ML project consists of basically four main parts are given as follows: Gathering data: The process of gathering data depends on the project it can be real-time data or the data collected from various sources such as a file, database, survey and other
    Clustering Performance Evaluation in Scikit Learn
    In this article, we shall look at different approaches to evaluate Clustering Algorithms using Scikit Learn Python Machine Learning Library. Clustering is an Unsupervised Machine Learning algorithm that deals with grouping the dataset to its similar kind data point. Clustering is widely used for Segmentation, Pattern Finding, Search engine, and so
    How to Get Regression Model Summary from Scikit-Learn
    In this article, we are going to see how to get a regression model summary from sci-kit learn. It can be done in these ways: Scikit-learn PackagesStats model packageExample 1: Using scikit-learn. You may want to extract a summary of a regression model created in Python with Scikit-learn. Scikit-learn does not have many built-in functions for analyz
    Ledoit-Wolf vs OAS Estimation in Scikit Learn
    Generally, Shrinkage is used to regularize the usual covariance maximum likelihood estimation. Ledoit and Wolf proposed a formula which is known as the Ledoit-Wolf covariance estimation formula; This close formula can compute the asymptotically optimal shrinkage parameter with minimizing a Mean Square Error(MSE) criterion feature. After that, one r
    How to Install Scikit-Learn on Linux?
    In this article, we are going to see how to install Scikit-Learn on Linux. Scikit-Learn is a python open source library for predictive data analysis. It is built on NumPy, SciPy, and matplotlib. It is written in Python, Cython, C, and C++ language. It is available for Linux, Unix, Windows, and Mac. Method 1: Installing Scikit-Learn from source Step
    Isotonic Regression in Scikit Learn
    Isotonic regression is a regression technique in which the predictor variable is monotonically related to the target variable. This means that as the value of the predictor variable increases, the value of the target variable either increases or decreases in a consistent, non-oscillating manner. Mathematically, isotonic regression can be formulated
    Plot Multinomial and One-vs-Rest Logistic Regression in Scikit Learn
    Logistic Regression is a popular classification algorithm that is used to predict the probability of a binary or multi-class target variable. In scikit-learn, there are two types of logistic regression algorithms: Multinomial logistic regression and One-vs-Rest logistic regression. Multinomial logistic regression is used when the target variable ha
    Create a Scatter Plot using Sepal length and Petal_width to Separate the Species Classes Using scikit-learn
    In this article, we are going to see how to create Scatter Plot using Sepal length and Petal_width to Separate the Species classes using scikit-learn in Python. The Iris Dataset contains 50 samples of three Iris species with four characteristics (length and width of sepals and petals). Iris setosa, Iris virginica, and Iris versicolor are the three
    Save classifier to disk in scikit-learn in Python
    In this article, we will cover saving a Save classifier to disk in scikit-learn using Python. We always train our models whether they are classifiers, regressors, etc. with the scikit learn library which require a considerable time to train. So we can save our trained models and then retrieve them when required. This saves us a lot of time. Seriali
    Multiple Linear Regression With scikit-learn
    In this article, let's learn about multiple linear regression using scikit-learn in the Python programming language. Regression is a statistical method for determining the relationship between features and an outcome variable or result. Machine learning, it's utilized as a method for predictive modeling, in which an algorithm is employed to forecas
    Classifier Comparison in Scikit Learn
    In scikit-learn, a classifier is an estimator that is used to predict the label or class of an input sample. There are many different types of classifiers that can be used in scikit-learn, each with its own strengths and weaknesses. Let's load the iris datasets from the sklearn.datasets and then train different types of classifier using it. C/C++ C
    Save and Load Machine Learning Models in Python with scikit-learn
    In this article, let's learn how to save and load your machine learning model in Python with scikit-learn in this tutorial. Once we create a machine learning model, our job doesn't end there. We can save the model to use in the future. We can either use the pickle or the joblib library for this purpose. The dump method is used to create the model a
    How to Extract the Decision Rules from scikit-learn Decision-tree?
    You might have already learned how to build a Decision-Tree Classifier, but might be wondering how the scikit-learn actually does that. So, in this article, we will cover this in a step-by-step manner. You can run the code in sequence, for better understanding. Decision-Tree uses tree-splitting criteria for splitting the nodes into sub-nodes until
    K-Means clustering on the handwritten digits data using Scikit Learn in Python
    K - means clustering is an unsupervised algorithm that is used in customer segmentation applications. In this algorithm, we try to form clusters within our datasets that are closely related to each other in a high-dimensional space. In this article, we will see how to use the k means algorithm to identify the clusters of the digits. Load the Datase
    Classification of text documents using sparse features in Python Scikit Learn
    Classification is a type of machine learning algorithm in which the model is trained, so as to categorize or label the given input based on the provided features for example classifying the input image as an image of a dog or a cat (binary classification) or to classify the provided picture of a living organism into one of the species from within t
    Agglomerative clustering with different metrics in Scikit Learn
    Agglomerative clustering is a type of Hierarchical clustering that works in a bottom-up fashion. Metrics play a key role in determining the performance of clustering algorithms. Choosing the right metric helps the clustering algorithm to perform better. This article discusses agglomerative clustering with different metrics in Scikit Learn. Scikit l
    Normal and Shrinkage Linear Discriminant Analysis for Classification in Scikit Learn
    In this article, we will try to understand the difference between Normal and Shrinkage Linear Discriminant Analysis for Classification. We will try to implement the same using sci-kit learn library in Python. But first, let's try to understand what is LDA. What is Linear discriminant analysis (LDA)? Linear discriminant analysis (LDA) is a supervise
    Color Quantization using K-Means in Scikit Learn
    In this article, we shall play around with pixel intensity value using Machine Learning Algorithms. The goal is to perform a Color Quantization example using KMeans in the Scikit Learn library. Color Quantization Color Quantization is a technique in which the color spaces in an image are reduced to the number of distinct colors. Even after the colo
    Simple 1D Kernel Density Estimation in Scikit Learn
    In this article, we will learn how to use Scikit learn for generating simple 1D kernel density estimation. We will first understand what is kernel density estimation and then we will look into its implementation in Python using KernelDensity class of sklearn.neighbors in scikit learn library. Kernel density estimationA non-parametric method for est
    Receiver Operating Characteristic (ROC) with Cross Validation in Scikit Learn
    In this article, we will implement ROC with Cross-Validation in Scikit Learn. Before we jump into the code, let's first understand why we need ROC curve and Cross-Validation in Machine Learning model predictions. Receiver Operating Characteristic Curve (ROC Curve) To understand the ROC curve one must be familiar with terminologies such as True Posi
    We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy Got It !
    Please go through our recently updated Improvement Guidelines before submitting any improvements.
    Suggest Changes
    Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
    Create Improvement
    Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
    * Only users with earned badges can submit improvements to articles.
    Learn how to earn badges and get started!