核心概念 #
Scikit-learn 的设计理念 #
Scikit-learn 采用统一的 API 设计,所有算法都遵循相同的接口模式。这种设计使得学习和使用变得非常简单。
核心接口层次 #
text
BaseEstimator
│
├── ClassifierMixin (分类器)
│ └── predict()
│
├── RegressorMixin (回归器)
│ └── predict()
│
├── ClusterMixin (聚类器)
│ └── fit_predict()
│
└── TransformerMixin (转换器)
└── transform()
估计器(Estimator) #
定义 #
估计器是 Scikit-learn 中所有对象的基类,任何能够从数据中学习参数的对象都是估计器。
核心方法 #
| 方法 | 描述 |
|---|---|
fit(X, y) |
从数据中学习模型参数 |
set_params(**params) |
设置参数 |
get_params() |
获取参数 |
示例 #
python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0, max_iter=100)
model.fit(X_train, y_train)
print(model.get_params())
print(model.coef_)
估计器规则 #
- 构造函数只存储参数,不做计算
fit()方法返回 self(支持链式调用)- 所有学习到的参数以
_结尾
python
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X, y)
print(clf.tree_)
print(clf.feature_importances_)
预测器(Predictor) #
定义 #
预测器是能够对新数据进行预测的估计器,通常用于监督学习任务。
核心方法 #
| 方法 | 描述 |
|---|---|
predict(X) |
预测目标值 |
predict_proba(X) |
预测概率(分类器) |
predict_log_proba(X) |
预测对数概率 |
decision_function(X) |
决策函数值 |
score(X, y) |
评估模型性能 |
分类预测器 #
python
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)
accuracy = clf.score(X_test, y_test)
回归预测器 #
python
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
r2_score = reg.score(X_test, y_test)
聚类预测器 #
python
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
labels = kmeans.predict(X)
labels = kmeans.labels_
centers = kmeans.cluster_centers_
转换器(Transformer) #
定义 #
转换器是能够转换数据的估计器,主要用于数据预处理和特征工程。
核心方法 #
| 方法 | 描述 |
|---|---|
fit(X) |
学习转换参数 |
transform(X) |
应用转换 |
fit_transform(X) |
拟合并转换 |
常用转换器 #
python
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
fit_transform vs fit + transform #
python
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
自定义转换器 #
python
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class LogTransformer(BaseEstimator, TransformerMixin):
def __init__(self, base=np.e):
self.base = base
def fit(self, X, y=None):
return self
def transform(self, X):
return np.log(X) / np.log(self.base)
transformer = LogTransformer(base=10)
X_log = transformer.fit_transform(X)
元估计器(Meta-Estimator) #
定义 #
元估计器是将其他估计器作为参数的估计器,用于增强或组合基础估计器。
常见元估计器 #
| 元估计器 | 用途 |
|---|---|
| Pipeline | 组合多个步骤 |
| GridSearchCV | 超参数调优 |
| VotingClassifier | 投票集成 |
| BaggingClassifier | 自助聚合 |
| AdaBoostClassifier | 提升方法 |
Pipeline 示例 #
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
GridSearchCV 示例 #
python
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf']
}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
print(grid_search.best_score_)
模型持久化 #
使用 joblib #
python
from joblib import dump, load
dump(model, 'model.joblib')
model = load('model.joblib')
使用 pickle #
python
import pickle
with open('model.pkl', 'wb') as f:
pickle.dump(model, f)
with open('model.pkl', 'rb') as f:
model = pickle.load(f)
数据集接口 #
内置数据集 #
python
from sklearn.datasets import load_iris, load_wine, load_breast_cancer
from sklearn.datasets import load_diabetes, load_boston
from sklearn.datasets import fetch_20newsgroups, fetch_california_housing
iris = load_iris()
X, y = iris.data, iris.target
print(iris.feature_names)
print(iris.target_names)
生成数据集 #
python
from sklearn.datasets import make_classification, make_regression, make_blobs
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=15,
n_classes=3,
random_state=42
)
X, y = make_regression(
n_samples=1000,
n_features=10,
noise=0.1,
random_state=42
)
X, y = make_blobs(
n_samples=500,
centers=3,
n_features=2,
random_state=42
)
数据集分割 #
python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
train_size=0.8,
random_state=42,
stratify=y
)
参数约定 #
常见参数 #
| 参数 | 描述 |
|---|---|
random_state |
随机种子 |
n_jobs |
并行任务数 |
verbose |
日志详细程度 |
warm_start |
是否使用上次结果 |
参数命名规范 #
python
model = SomeEstimator(
n_estimators=100,
max_depth=5,
min_samples_split=2,
learning_rate=0.01
)
模型检查 #
获取参数 #
python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0, solver='lbfgs')
params = model.get_params()
print(params)
设置参数 #
python
model.set_params(C=0.5, max_iter=200)
检查是否已拟合 #
python
from sklearn.utils.validation import check_is_fitted
try:
check_is_fitted(model)
print("模型已训练")
except:
print("模型未训练")
可视化配置 #
设置显示模式 #
python
from sklearn import set_config
set_config(display='text')
set_config(display='diagram')
Pipeline 可视化 #
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import set_config
set_config(display='diagram')
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
pipe
最佳实践 #
1. 避免数据泄露 #
python
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
2. 使用 Pipeline #
python
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('preprocessor', StandardScaler()),
('model', LogisticRegression())
])
pipe.fit(X_train, y_train)
3. 设置随机种子 #
python
model = RandomForestClassifier(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42
)
4. 使用交叉验证 #
python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(f"平均分数: {scores.mean():.4f}")
下一步 #
理解核心概念后,继续学习 数据预处理 掌握数据处理技巧!
最后更新:2026-04-04