阳刚的小狗 · Java8 Stream ...· 3 天前 · |
难过的板凳 · Flutter SDK 使用指南 · TA使用指南· 2 天前 · |
稳重的豆腐 · JasperException: ...· 2 天前 · |
无邪的猴子 · 过渡transition · FE-booklet· 2 天前 · |
读研的人字拖 · Java 读取 .properties ...· 13 小时前 · |
爱旅游的椅子 · 哈哈 | Jessibuca· 2 月前 · |
活泼的泡面 · 整合scRNA-seq和ATAC-seq数据 ...· 1 年前 · |
傲视众生的佛珠 · Threejs渐变光柱效果_three.js ...· 1 年前 · |
安静的茄子 · funny paper - ...· 1 年前 · |
python split test |
https://www.geeksforgeeks.org/how-to-split-the-dataset-with-scikit-learns-train_test_split-function/ |
玩命的小虾米
1 月前 |
In this article, we will discuss how to split a dataset using scikit-learns’ train_test_split().
The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_train, X_test, y_train, and y_test. X_train and y_train sets are used for training and fitting the model. The X_test and y_test sets are used for testing the model if it’s predicting the right outputs/labels. we can explicitly test the size of the train and test sets. It is suggested to keep our train sets larger than the test sets.
Syntax: sklearn.model_selection.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None
Parameters:
Returns:
splitting: The train-test split of inputs is represented as a list.
In this step, we are importing the necessary packages or modules into the working python environment.
Here, we load the CSV using pd.read_csv() method from pandas and get the shape of the data set using the shape() function.
CSV Used:
(13, 3)
Here, we are assigning the X and the Y variable in which the X feature variable has independent variables and the y feature variable has a dependent variable.
Here, the train_test_split() class from sklearn.model_selection is used to split our data into train and test sets where feature variables are given as input in the method. test_size determines the portion of the data which will go into test sets and a random state is used for data reproducibility.
# using the train test split function
X_train, X_test, y_train, y_test
=
train_test_split(
X,y , random_state
=
104
,test_size
=
0.25
, shuffle
=
True
)
Example:
In this example, ‘predictions.csv’ file is imported. df.shape attribute is used to retrieve the shape of the data frame. The shape of the dataframe is (13,3). The features columns are taken in the X variable and the outcome column is taken in the y variable. X and y variables are passed in the train_test_split() method to split the data frame into train and test sets. The random state parameter is used for data reproducibility. test_size is given as 0.25 which means 25% of the data goes into the test sets. 4 out of 13 rows in the dataframe go into the test sets. 75% of data goes into the train sets, which is 9 rows out of 13 rows. The train sets are used to fit and train the machine learning model. The test sets are used for evaluation.
CSV Used:
import
numpy as np
import
pandas as pd
from
sklearn.model_selection
import
train_test_split
# importing data
df
=
pd.read_csv(
'prediction.csv'
)
print
(df.shape)
# head of the data
print
(
'Head of the dataframe : '
)
print
(df.head())
print
(df.columns)
X
=
df[
'area'
]
y
=
df[
'prices'
]
# using the train test split function
X_train, X_test, y_train, y_test
=
train_test_split(
X,y , random_state
=
104
,test_size
=
0.25
, shuffle
=
True
)
# printing out train and test sets
print
(
'X_train : '
)
print
(X_train.head())
print
(X_train.shape)
print
('')
print
(
'X_test : '
)
print
(X_test.head())
print
(X_test.shape)
print
('')
print
(
'y_train : '
)
print
(y_train.head())
print
(y_train.shape)
print
('')
print
(
'y_test : '
)
print
(y_test.head())
print
(y_test.shape)
Example:
In this example the following steps are executed :
To view and download the CSV file used in this example, click here .
import
pandas as pd
import
numpy as np
from
sklearn.model_selection
import
train_test_split
from
sklearn.preprocessing
import
StandardScaler
from
sklearn.linear_model
import
LinearRegression
from
sklearn.metrics
import
mean_squared_error
df
=
pd.read_csv(
'Advertising.csv'
)
# dropping rows which have null values
df.dropna(inplace
=
True
,axis
=
0
)
y
=
df[
'sales'
]
X
=
df.drop(
'sales'
,axis
=
1
)
# splitting the dataframe into train and test sets
X_train,X_test,y_train,y_test
=
train_test_split(
X,y,test_size
=
0.3
,random_state
=
101
)
scaler
=
StandardScaler()
scaler.fit(X_train)
X_train
=
scaler.transform(X_train)
X_test
=
scaler.transform(X_test)
model
=
LinearRegression().fit(X_train,y_train)
y_pred
=
model.predict(X_test)
print
(y_pred)
print
(mean_squared_error(y_test,y_pred))
array([19.82000933, 14.23636718, 12.80417236, 7.75461569, 8.31672266,
15.4001915 , 11.6590983 , 15.22650923, 15.53524916, 19.46415132,
17.21364106, 16.69603229, 16.46449309, 10.15345178, 13.44695953,
24.71946196, 18.67190453, 15.85505154, 14.45450049, 9.91684409,
10.41647177, 4.61335238, 17.41531451, 17.31014955, 21.72288151,
5.87934089, 11.29101265, 17.88733657, 21.04225992, 12.32251227,
14.4099317 , 15.05829814, 10.2105313 , 7.28532072, 12.66133397,
23.25847491, 18.87101505, 4.55545854, 19.79603707, 9.21203026,
10.24668718, 8.96989469, 13.33515217, 20.69532628, 12.17013119,
21.69572633, 16.7346457 , 22.16358256, 5.34163764, 20.43470231,
7.58252563, 23.38775769, 10.2270323 , 12.33473902, 24.10480458,
9.88919804, 21.7781076 ])
2.7506859249500466
Example:
In this example, we’re gonna use the K-nearest neighbors classifier model.
In this example the following steps are executed :
# Import packages
from
sklearn.neighbors
import
KNeighborsClassifier
from
sklearn.model_selection
import
train_test_split
from
sklearn.datasets
import
load_iris
# Load the data
irisData
=
load_iris()
# Create feature and target arrays
X
=
irisData.data
y
=
irisData.target
# Split data into train and test sets
X_train, X_test, y_train, y_test
=
train_test_split(
X, y, test_size
=
0.2
, random_state
=
42
)
knn
=
KNeighborsClassifier(n_neighbors
=
1
)
knn.fit(X_train, y_train)
# predicting on the X_test data set
print
(knn.predict(X_test))
难过的板凳 · Flutter SDK 使用指南 · TA使用指南 2 天前 |
无邪的猴子 · 过渡transition · FE-booklet 2 天前 |
爱旅游的椅子 · 哈哈 | Jessibuca 2 月前 |