核心概念 #

Scikit-learn 的设计理念 #

Scikit-learn 采用统一的 API 设计,所有算法都遵循相同的接口模式。这种设计使得学习和使用变得非常简单。

核心接口层次 #

text
BaseEstimator
    │
    ├── ClassifierMixin (分类器)
    │       └── predict()
    │
    ├── RegressorMixin (回归器)
    │       └── predict()
    │
    ├── ClusterMixin (聚类器)
    │       └── fit_predict()
    │
    └── TransformerMixin (转换器)
            └── transform()

估计器(Estimator) #

定义 #

估计器是 Scikit-learn 中所有对象的基类,任何能够从数据中学习参数的对象都是估计器。

核心方法 #

方法 描述
fit(X, y) 从数据中学习模型参数
set_params(**params) 设置参数
get_params() 获取参数

示例 #

python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1.0, max_iter=100)
model.fit(X_train, y_train)

print(model.get_params())
print(model.coef_)

估计器规则 #

  1. 构造函数只存储参数,不做计算
  2. fit() 方法返回 self(支持链式调用)
  3. 所有学习到的参数以 _ 结尾
python
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X, y)

print(clf.tree_)
print(clf.feature_importances_)

预测器(Predictor) #

定义 #

预测器是能够对新数据进行预测的估计器,通常用于监督学习任务。

核心方法 #

方法 描述
predict(X) 预测目标值
predict_proba(X) 预测概率(分类器)
predict_log_proba(X) 预测对数概率
decision_function(X) 决策函数值
score(X, y) 评估模型性能

分类预测器 #

python
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)
accuracy = clf.score(X_test, y_test)

回归预测器 #

python
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
r2_score = reg.score(X_test, y_test)

聚类预测器 #

python
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

labels = kmeans.predict(X)
labels = kmeans.labels_
centers = kmeans.cluster_centers_

转换器(Transformer) #

定义 #

转换器是能够转换数据的估计器,主要用于数据预处理和特征工程。

核心方法 #

方法 描述
fit(X) 学习转换参数
transform(X) 应用转换
fit_transform(X) 拟合并转换

常用转换器 #

python
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

fit_transform vs fit + transform #

python
scaler = StandardScaler()

scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

自定义转换器 #

python
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, base=np.e):
        self.base = base
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return np.log(X) / np.log(self.base)

transformer = LogTransformer(base=10)
X_log = transformer.fit_transform(X)

元估计器(Meta-Estimator) #

定义 #

元估计器是将其他估计器作为参数的估计器,用于增强或组合基础估计器。

常见元估计器 #

元估计器 用途
Pipeline 组合多个步骤
GridSearchCV 超参数调优
VotingClassifier 投票集成
BaggingClassifier 自助聚合
AdaBoostClassifier 提升方法

Pipeline 示例 #

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

GridSearchCV 示例 #

python
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf']
}

grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)
print(grid_search.best_score_)

模型持久化 #

使用 joblib #

python
from joblib import dump, load

dump(model, 'model.joblib')
model = load('model.joblib')

使用 pickle #

python
import pickle

with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

数据集接口 #

内置数据集 #

python
from sklearn.datasets import load_iris, load_wine, load_breast_cancer
from sklearn.datasets import load_diabetes, load_boston
from sklearn.datasets import fetch_20newsgroups, fetch_california_housing

iris = load_iris()
X, y = iris.data, iris.target
print(iris.feature_names)
print(iris.target_names)

生成数据集 #

python
from sklearn.datasets import make_classification, make_regression, make_blobs

X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_classes=3,
    random_state=42
)

X, y = make_regression(
    n_samples=1000,
    n_features=10,
    noise=0.1,
    random_state=42
)

X, y = make_blobs(
    n_samples=500,
    centers=3,
    n_features=2,
    random_state=42
)

数据集分割 #

python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,
    train_size=0.8,
    random_state=42,
    stratify=y
)

参数约定 #

常见参数 #

参数 描述
random_state 随机种子
n_jobs 并行任务数
verbose 日志详细程度
warm_start 是否使用上次结果

参数命名规范 #

python
model = SomeEstimator(
    n_estimators=100,
    max_depth=5,
    min_samples_split=2,
    learning_rate=0.01
)

模型检查 #

获取参数 #

python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=1.0, solver='lbfgs')
params = model.get_params()
print(params)

设置参数 #

python
model.set_params(C=0.5, max_iter=200)

检查是否已拟合 #

python
from sklearn.utils.validation import check_is_fitted

try:
    check_is_fitted(model)
    print("模型已训练")
except:
    print("模型未训练")

可视化配置 #

设置显示模式 #

python
from sklearn import set_config

set_config(display='text')
set_config(display='diagram')

Pipeline 可视化 #

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import set_config

set_config(display='diagram')

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

pipe

最佳实践 #

1. 避免数据泄露 #

python
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. 使用 Pipeline #

python
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('preprocessor', StandardScaler()),
    ('model', LogisticRegression())
])
pipe.fit(X_train, y_train)

3. 设置随机种子 #

python
model = RandomForestClassifier(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

4. 使用交叉验证 #

python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
print(f"平均分数: {scores.mean():.4f}")

下一步 #

理解核心概念后,继续学习 数据预处理 掌握数据处理技巧!

最后更新:2026-04-04