交叉验证 #
概述 #
交叉验证是一种评估模型性能的技术,通过多次训练和验证来获得更可靠的性能估计。
为什么需要交叉验证? #
| 问题 | 传统方法 | 交叉验证 |
|---|---|---|
| 数据利用 | 只用部分数据 | 充分利用数据 |
| 评估稳定性 | 依赖单次划分 | 多次评估取平均 |
| 过拟合检测 | 困难 | 有效检测 |
交叉验证类型 #
| 类型 | 描述 | 适用场景 |
|---|---|---|
| K-Fold | K 折交叉验证 | 通用 |
| Stratified K-Fold | 分层 K 折 | 分类问题 |
| Leave-One-Out | 留一法 | 小数据集 |
| Time Series Split | 时间序列分割 | 时间序列数据 |
K 折交叉验证 #
基本使用 #
python
from sklearn.model_selection import KFold, cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import numpy as np
iris = load_iris()
X, y = iris.data, iris.target
model = RandomForestClassifier(random_state=42)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)
print(f"各折分数: {scores}")
print(f"平均分数: {scores.mean():.4f}")
print(f"标准差: {scores.std():.4f}")
参数说明 #
| 参数 | 描述 | 默认值 |
|---|---|---|
n_splits |
折数 | 5 |
shuffle |
是否打乱 | False |
random_state |
随机种子 | None |
手动遍历折 #
python
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
model.fit(X_train, y_train)
score = model.score(X_val, y_val)
print(f"Fold {fold + 1}: {score:.4f}")
分层 K 折交叉验证 #
基本使用 #
python
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)
print(f"平均分数: {scores.mean():.4f}")
为什么需要分层? #
python
import matplotlib.pyplot as plt
kf = KFold(n_splits=5)
skf = StratifiedKFold(n_splits=5)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
for ax, (name, cv) in zip(axes, [('KFold', kf), ('StratifiedKFold', skf)]):
for fold, (_, val_idx) in enumerate(cv.split(X, y)):
ax.scatter([fold] * len(val_idx), y[val_idx], alpha=0.5, s=10)
ax.set_xlabel('Fold')
ax.set_ylabel('Class')
ax.set_title(name)
留一法交叉验证 #
基本使用 #
python
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
print(f"样本数: {len(scores)}")
print(f"平均分数: {scores.mean():.4f}")
留 P 法 #
python
from sklearn.model_selection import LeavePOut
lpo = LeavePOut(p=2)
scores = cross_val_score(model, X[:50], y[:50], cv=lpo)
print(f"迭代次数: {len(scores)}")
重复 K 折 #
基本使用 #
python
from sklearn.model_selection import RepeatedKFold, RepeatedStratifiedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(model, X, y, cv=rkf)
print(f"总评估次数: {len(scores)}")
print(f"平均分数: {scores.mean():.4f}")
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
时间序列分割 #
基本使用 #
python
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
X = np.arange(100).reshape(100, 1)
y = np.arange(100)
tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
print(f"Fold {fold + 1}: Train={train_idx.min()}-{train_idx.max()}, Val={val_idx.min()}-{val_idx.max()}")
参数说明 #
| 参数 | 描述 |
|---|---|
n_splits |
分割次数 |
max_train_size |
最大训练集大小 |
test_size |
测试集大小 |
gap |
训练测试间隔 |
python
tscv = TimeSeriesSplit(
n_splits=5,
max_train_size=50,
test_size=10,
gap=2
)
分组交叉验证 #
GroupKFold #
python
from sklearn.model_selection import GroupKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([0, 0, 1, 1, 0, 1])
groups = np.array([1, 1, 2, 2, 3, 3])
gkf = GroupKFold(n_splits=3)
for train_idx, val_idx in gkf.split(X, y, groups):
print(f"Train groups: {groups[train_idx]}, Val groups: {groups[val_idx]}")
GroupShuffleSplit #
python
from sklearn.model_selection import GroupShuffleSplit
gss = GroupShuffleSplit(n_splits=5, test_size=0.3, random_state=42)
for train_idx, val_idx in gss.split(X, y, groups):
print(f"Train groups: {set(groups[train_idx])}, Val groups: {set(groups[val_idx])}")
ShuffleSplit #
基本使用 #
python
from sklearn.model_selection import ShuffleSplit
ss = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
scores = cross_val_score(model, X, y, cv=ss)
print(f"平均分数: {scores.mean():.4f}")
StratifiedShuffleSplit #
python
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
scores = cross_val_score(model, X, y, cv=sss)
cross_validate #
多指标评估 #
python
from sklearn.model_selection import cross_validate
scoring = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
results = cross_validate(model, X, y, cv=5, scoring=scoring)
for metric in scoring:
print(f"{metric}: {results['test_' + metric].mean():.4f}")
返回训练分数 #
python
results = cross_validate(
model, X, y, cv=5,
scoring='accuracy',
return_train_score=True
)
print(f"训练分数: {results['train_score'].mean():.4f}")
print(f"验证分数: {results['test_score'].mean():.4f}")
返回估计器 #
python
results = cross_validate(
model, X, y, cv=5,
return_estimator=True
)
estimators = results['estimator']
cross_val_predict #
基本使用 #
python
from sklearn.model_selection import cross_val_predict
y_pred = cross_val_predict(model, X, y, cv=5)
print(f"预测结果: {y_pred[:10]}")
获取预测概率 #
python
y_proba = cross_val_predict(model, X, y, cv=5, method='predict_proba')
print(f"概率形状: {y_proba.shape}")
嵌套交叉验证 #
基本结构 #
python
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.svm import SVC
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
inner_cv = KFold(n_splits=4, shuffle=True, random_state=42)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
clf = GridSearchCV(
estimator=SVC(),
param_grid=param_grid,
cv=inner_cv
)
nested_scores = cross_val_score(clf, X, y, cv=outer_cv)
print(f"嵌套交叉验证分数: {nested_scores.mean():.4f}")
为什么需要嵌套? #
text
外层交叉验证(评估模型)
├── Fold 1
│ └── 内层交叉验证(选择超参数)
├── Fold 2
│ └── 内层交叉验证
└── ...
学习曲线 #
绘制学习曲线 #
python
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
n_jobs=-1
)
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
plt.plot(train_sizes, train_mean, label='Training Score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
plt.plot(train_sizes, val_mean, label='Validation Score')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1)
plt.xlabel('Training Size')
plt.ylabel('Score')
plt.legend()
plt.title('Learning Curve')
验证曲线 #
绘制验证曲线 #
python
from sklearn.model_selection import validation_curve
param_range = np.logspace(-3, 3, 7)
train_scores, val_scores = validation_curve(
RandomForestClassifier(random_state=42),
X, y,
param_name='max_depth',
param_range=range(1, 11),
cv=5
)
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
plt.plot(range(1, 11), train_mean, label='Training Score')
plt.plot(range(1, 11), val_mean, label='Validation Score')
plt.xlabel('max_depth')
plt.ylabel('Score')
plt.legend()
plt.title('Validation Curve')
自定义交叉验证 #
自定义分割器 #
python
from sklearn.model_selection import BaseCrossValidator
class CustomCV(BaseCrossValidator):
def __init__(self, n_splits=5):
self.n_splits = n_splits
def split(self, X, y=None, groups=None):
n_samples = len(X)
fold_size = n_samples // self.n_splits
for i in range(self.n_splits):
val_start = i * fold_size
val_end = (i + 1) * fold_size if i < self.n_splits - 1 else n_samples
train_idx = np.concatenate([np.arange(0, val_start), np.arange(val_end, n_samples)])
val_idx = np.arange(val_start, val_end)
yield train_idx, val_idx
def get_n_splits(self, X=None, y=None, groups=None):
return self.n_splits
custom_cv = CustomCV(n_splits=5)
scores = cross_val_score(model, X, y, cv=custom_cv)
最佳实践 #
1. 选择合适的 CV 方法 #
| 场景 | 推荐方法 |
|---|---|
| 分类问题 | StratifiedKFold |
| 回归问题 | KFold |
| 时间序列 | TimeSeriesSplit |
| 小数据集 | LeaveOneOut |
| 分组数据 | GroupKFold |
2. 设置随机种子 #
python
kf = KFold(n_splits=5, shuffle=True, random_state=42)
3. 使用并行计算 #
python
scores = cross_val_score(model, X, y, cv=5, n_jobs=-1)
4. 检查过拟合 #
python
results = cross_validate(model, X, y, cv=5, return_train_score=True)
gap = results['train_score'].mean() - results['test_score'].mean()
if gap > 0.1:
print("可能存在过拟合")
下一步 #
掌握交叉验证后,继续学习 性能指标 了解如何全面评估模型!
最后更新:2026-04-04