数据集划分问题-JobPlus

留出法(hold-out)

使用 n:m and n + m =1 的形式对原数据进行分割,例如 train : test = 7 : 3 or train : test = 6.5 : 3.5 但是这种相对原始的处理方式效果并不好,缺点如下:

缺点一:浪费数据
缺点二:容易过拟合,且矫正方式不方便

这时,我们需要使用另外一种分割方式-交叉验证或者留P法(leave P out)

LOO 留一法 or LPO 留P法

LOO : 对于整个数据集而言,每次选取一个样本作为验证集,其余样本作为训练集
LPO : 对于整个数据集而言,每次选取P个样本作为验证集,其余样本作为训练集

LOO的好处在于,避免的数据的浪费,但是同时也拥有了,更高的性能开销
一般LOO相对于 K-Fold 而言,拥有更高的方差,但是对于方差占主导的情况时,LOO可能拥有比交叉验证更强的能力.

K-Fold

KFold 将所有的样例划分为 k 个组，称为折叠 (fold) （如果 k = n，这等价于 Leave One Out（留一）策略），都具有相同的大小（如果可能）。预测函数学习时使用 k - 1 个折叠中的数据，最后一个剩下的折叠会用于测试。在集成算法Stacking中就使用了这种方式(Bagging则为子采样,也是很有趣的方式,之前有介绍)

注意

而 i.i.d 数据是机器学习理论中的一个常见假设，在实践中很少成立。如果知道样本是使用时间相关的过程生成的，则使用 time-series aware cross-validation scheme 更安全。同样，如果我们知道生成过程具有 group structure （群体结构）（从不同 subjects（主体）， experiments（实验）， measurement devices （测量设备）收集的样本），则使用 group-wise cross-validation 更安全。

是否重复试验与分层的问题

分层: 对于K-Fold而言,保持每个分组中的train : test 的比例大致相等
重复: 即样本的放回采样,比如Bagging,训练集中部分样本会重复,部分样本永远不会出现
重复分层: 对于Sklearn中的K-Fold而言,指实现了采样中各个类别的比例与原数据集的各类别比例大致相等.

交叉验证

LOO 与 LPO 的交叉验证就是每个(或者每P个样本)都作为验证集一次,然后计算平均值,得出Score,K-Fold类似,不过不同的地方是分成了K折.

Sklearn中实现了便捷方法CV

快捷简便的使用

加载数据

from sklearn.model_selection import train_test_split,LeaveOneOut,LeavePOut

from sklearn import datasets

from sklearn import svm

from sklearn.metrics import accuracy_score

import numpy as np

iris = datasets.load_iris()

clf_svc = svm.SVC(kernel='linear')

iris.data.shape,iris.target.shape

((150, 4), (150,))

hold out

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

clf_svc.fit(X_train,y_train)

accuracy_score(clf_svc.predict(X_test),y_test)

0.9666666666666667

Leave One Out

loo = LeaveOneOut()

loo.get_n_splits(iris.data)

mean_accuracy_score_list = []

for train_index, test_index in loo.split(iris.data):

clf_svc.fit(iris.data[train_index], iris.target[train_index])

prediction = clf_svc.predict(iris.data[test_index])

mean_accuracy_score_list.append(accuracy_score(iris.target[test_index], prediction)) print(np.average(mean_accuracy_score_list))

0.98

Leave P Out

LeavePOut 与 LeaveOneOut 非常相似，因为它通过从整个集合中删除 p 个样本来创建所有可能的训练/测试集。对于 n 个样本，这产生了 m 个训练-测试对, m 等于 n个样本中任意选取 p 个样本不计顺序自由组合的个数。值得注意的是这种方式会导致计算开销大幅增加,下面的例子要比上面的例子,多花费 m-n 的时间

loo = LeavePOut(p=3)

mean_accuracy_score_list = []

for train_index, test_index in loo.split(iris.data):

clf_svc.fit(iris.data[train_index], iris.target[train_index])

prediction = clf_svc.predict(iris.data[test_index])

mean_accuracy_score_list.append(accuracy_score(iris.target[test_index], prediction)) print(np.average(mean_accuracy_score_list))

0.9793627184231215

下面的例子更好地展示了,其效果:

X = np.ones(4)

lpo = LeavePOut(p=2)

for train, test in lpo.split(X):

print("%s %s" % (train, test))

[2 3] [0 1]

[1 3] [0 2]

[1 2] [0 3]

[0 3] [1 2]

[0 2] [1 3]

[0 1] [2 3]

K-Fold

普通的K-Fold仅仅是折叠,除此之外,还有分层K-Fold则,则进行的分层K-Fold.

from sklearn.model_selection import

KFold,StratifiedKFold

X = ["a", "b", "c", "d"] kf = KFold(n_splits=4)

for train, test in kf.split(X):

print("%s %s" % (train, test))

[1 2 3] [0]

[0 2 3] [1]

[0 1 3] [2]

[0 1 2] [3]

X = np.array([[1, 2, 3, 4],

[11, 12, 13, 14],

[21, 22, 23, 24],

[31, 32, 33, 34],

[41, 42, 43, 44],

[51, 52, 53, 54],

[61, 62, 63, 64],

[71, 72, 73, 74]])

y = np.array([1, 1, 0, 0, 1, 1, 0, 0])

stratified_folder = StratifiedKFold(n_splits=4, random_state=0, shuffle=False)

for train_index, test_index in stratified_folder.split(X, y):

print("Stratified Train Index:", train_index)

print("Stratified Test Index:", test_index)

print("Stratified y_train:", y[train_index])

print("Stratified y_test:", y[test_index],'\n')

Stratified Train Index: [1 3 4 5 6 7]

Stratified Test Index: [0 2]

Stratified y_train: [1 0 1 1 0 0]

Stratified y_test: [1 0]

Stratified Train Index: [0 2 4 5 6 7]

Stratified Test Index: [1 3]

Stratified y_train: [1 0 1 1 0 0]

Stratified y_test: [1 0]

Stratified Train Index: [0 1 2 3 5 7]

Stratified Test Index: [4 6]

Stratified y_train: [1 1 0 0 1 0]

Stratified y_test: [1 0]

Stratified Train Index: [0 1 2 3 4 6]

Stratified Test Index: [5 7]

Stratified y_train: [1 1 0 0 1 0]

Stratified y_test: [1 0]

不过在实际的使用中我们更常用的是cross_val_score,一个封装好的交叉验证方法,来进行模型选择,其中默认的方法即为K-Fold,除此之外,我们还可以使用cross_val_predict来获取预测结果,不过效果不一定是最好偶.

from sklearn.model_selection import cross_val_score

scores_clf_svc_cv = cross_val_score(clf_svc,iris.data,iris.target,cv=5)

print(scores_clf_svc_cv)

print("Accuracy: %0.2f (+/- %0.2f)" % (scores_clf_svc_cv.mean(), scores_clf_svc_cv.std() * 2))

[0.96666667 1. 0.96666667 0.96666667 1. ]

Accuracy: 0.98 (+/- 0.03)

from sklearn.model_selection import cross_val_predict

predicted = cross_val_predict(clf_svc, iris.data, iris.target, cv=10)

accuracy_score(iris.target, predicted)

0.9733333333333334

<h3>留出法(hold-out)</h3>使用 n:m and n + m =1 的形式对原数据进行分割,例如 train : test = 7 : 3 or train : test = 6.5 : 3.5 但是这种相对原始的处理方式效果并不好,缺点如下:<ul><li>缺点一:浪费数据</li><li>缺点二:容易过拟合,且矫正方式不方便</li></ul>这时,我们需要使用另外一种分割方式-交叉验证 或者 留P法(leave P out)<h3>LOO 留一法 or LPO 留P法</h3>LOO : 对于整个数据集而言,每次选取一个样本作为验证集,其余样本作为训练集  LPO : 对于整个数据集而言,每次选取P个样本作为验证集,其余样本作为训练集LOO的好处在于,避免的数据的浪费,但是同时也拥有了,更高的性能开销  一般LOO相对于 K-Fold 而言,拥有更高的方差,但是对于方差占主导的情况时,LOO可能拥有比交叉验证更强的能力.<h3>K-Fold</h3>KFold 将所有的样例划分为 k 个组，称为折叠 (fold) （如果 k = n， 这等价于 Leave One Out（留一） 策略），都具有相同的大小（如果可能）。预测函数学习时使用 k - 1 个折叠中的数据，最后一个剩下的折叠会用于测试。在集成算法Stacking中就使用了这种方式(Bagging则为子采样,也是很有趣的方式,之前有介绍)注意而 i.i.d 数据是机器学习理论中的一个常见假设，在实践中很少成立。如果知道样本是使用时间相关的过程生成的，则使用 time-series aware cross-validation scheme 更安全。 同样，如果我们知道生成过程具有 group structure （群体结构）（从不同 subjects（主体） ， experiments（实验）， measurement devices （测量设备）收集的样本），则使用 group-wise cross-validation 更安全。<h3>是否重复试验与分层的问题</h3>分层: 对于K-Fold而言,保持每个分组中的train : test 的比例大致相等  重复: 即样本的放回采样,比如Bagging,训练集中部分样本会重复,部分样本永远不会出现  重复分层: 对于Sklearn中的K-Fold而言,指实现了采样中各个类别的比例与原数据集的各类别比例大致相等.<h3>交叉验证</h3>LOO 与 LPO 的交叉验证就是每个(或者每P个样本)都作为验证集一次,然后计算平均值,得出Score,K-Fold类似,不过不同的地方是分成了K折.Sklearn中实现了便捷方法CV<h3>快捷简便的使用</h3>加载数据from sklearn.model_selection import train_test_split,LeaveOneOut,LeavePOutfrom sklearn import datasetsfrom sklearn import svmfrom sklearn.metrics import accuracy_scoreimport numpy as np iris = datasets.load_iris() clf_svc = svm.SVC(kernel='linear') iris.data.shape,iris.target.shape<div> </div>((150, 4), (150,))<div> </div>hold outX_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0) clf_svc.fit(X_train,y_train) accuracy_score(clf_svc.predict(X_test),y_test)<div> </div>0.9666666666666667<div> </div><h3>Leave One Out</h3>loo = LeaveOneOut() loo.get_n_splits(iris.data) mean_accuracy_score_list = []for train_index, test_index in loo.split(iris.data):    clf_svc.fit(iris.data[train_index], iris.target[train_index])    prediction = clf_svc.predict(iris.data[test_index])    mean_accuracy_score_list.append(accuracy_score(iris.target[test_index], prediction))
print(np.average(mean_accuracy_score_list))<div> </div>0.98<div> </div><h3>Leave P Out</h3>LeavePOut 与 LeaveOneOut 非常相似，因为它通过从整个集合中删除 p 个样本来创建所有可能的 训练/测试集。对于 n 个样本，这产生了 m 个 训练-测试 对, m 等于 n个样本中 任意选取 p 个样本不计顺序自由组合的个数 。值得注意的是这种方式会导致计算开销大幅增加,下面的例子要比上面的例子,多花费 m-n 的时间loo = LeavePOut(p=3) mean_accuracy_score_list = []for train_index, test_index in loo.split(iris.data):    clf_svc.fit(iris.data[train_index], iris.target[train_index])    prediction = clf_svc.predict(iris.data[test_index])    mean_accuracy_score_list.append(accuracy_score(iris.target[test_index], prediction))
print(np.average(mean_accuracy_score_list))<ul><li> </li></ul>0.9793627184231215<ul><li> </li></ul>下面的例子更好地展示了,其效果:X = np.ones(4) lpo = LeavePOut(p=2)for train, test in lpo.split(X):    print("%s %s" % (train, test))<ul><li> </li></ul>[2 3] [0 1][1 3] [0 2] [1 2] [0 3] [0 3] [1 2] [0 2] [1 3] [0 1] [2 3]<div> </div><h3>K-Fold</h3>普通的K-Fold仅仅是折叠,除此之外,还有分层K-Fold则,则进行的分层K-Fold.from sklearn.model_selection import KFold,StratifiedKFold X = ["a", "b", "c", "d"]
kf = KFold(n_splits=4)for train, test in kf.split(X):    print("%s %s" % (train, test))<ul><li> </li></ul>[1 2 3] [0][0 2 3] [1][0 1 3] [2][0 1 2] [3]<div> </div>X = np.array([[1, 2, 3, 4],              [11, 12, 13, 14],              [21, 22, 23, 24],              [31, 32, 33, 34],              [41, 42, 43, 44],              [51, 52, 53, 54],              [61, 62, 63, 64],              [71, 72, 73, 74]]) y = np.array([1, 1, 0, 0, 1, 1, 0, 0]) stratified_folder = StratifiedKFold(n_splits=4, random_state=0, shuffle=False)for train_index, test_index in stratified_folder.split(X, y):    print("Stratified Train Index:", train_index)    print("Stratified Test Index:", test_index)    print("Stratified y_train:", y[train_index])    print("Stratified y_test:", y[test_index],'\n')<div> </div>Stratified Train Index: [1 3 4 5 6 7] Stratified Test Index: [0 2] Stratified y_train: [1 0 1 1 0 0] Stratified y_test: [1 0] Stratified Train Index: [0 2 4 5 6 7] Stratified Test Index: [1 3] Stratified y_train: [1 0 1 1 0 0] Stratified y_test: [1 0] Stratified Train Index: [0 1 2 3 5 7] Stratified Test Index: [4 6] Stratified y_train: [1 1 0 0 1 0] Stratified y_test: [1 0] Stratified Train Index: [0 1 2 3 4 6] Stratified Test Index: [5 7] Stratified y_train: [1 1 0 0 1 0] Stratified y_test: [1 0]<ul><li> </li></ul>不过在实际的使用中我们更常用的是cross_val_score,一个封装好的交叉验证方法,来进行模型选择,其中默认的方法即为K-Fold,除此之外,我们还可以使用cross_val_predict来获取预测结果,不过效果不一定是最好偶.from sklearn.model_selection import cross_val_score scores_clf_svc_cv = cross_val_score(clf_svc,iris.data,iris.target,cv=5) print(scores_clf_svc_cv) print("Accuracy: %0.2f (+/- %0.2f)" % (scores_clf_svc_cv.mean(), scores_clf_svc_cv.std() * 2))<ul><li> </li></ul>[0.96666667 1.         0.96666667 0.96666667 1.        ] Accuracy: 0.98 (+/- 0.03)<ul><li> </li></ul>from sklearn.model_selection import cross_val_predict predicted = cross_val_predict(clf_svc, iris.data, iris.target, cv=10) accuracy_score(iris.target, predicted)<ul><li> </li></ul>0.9733333333333334<ul><li> </li></ul>