数据预处理 #
为什么需要数据预处理? #
机器学习模型的表现很大程度上取决于数据的质量。数据预处理是机器学习流程中最关键的步骤之一。
常见数据问题 #
| 问题 | 描述 | 解决方案 |
|---|---|---|
| 特征尺度不同 | 不同特征数值范围差异大 | 标准化/归一化 |
| 缺失值 | 数据不完整 | 插补/删除 |
| 类别变量 | 非数值型数据 | 编码转换 |
| 异常值 | 极端值影响模型 | 检测和处理 |
| 数据分布偏斜 | 分布不均匀 | 变换处理 |
标准化与归一化 #
StandardScaler(标准化) #
将特征转换为均值为0,标准差为1的分布。
python
from sklearn.preprocessing import StandardScaler
import numpy as np
X = np.array([[1, 10], [2, 20], [3, 30], [4, 40]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("均值:", scaler.mean_)
print("标准差:", scaler.scale_)
公式:z = (x - μ) / σ
MinMaxScaler(归一化) #
将特征缩放到指定范围(默认 [0, 1])。
python
from sklearn.preprocessing import MinMaxScaler
X = np.array([[1, 10], [2, 20], [3, 30], [4, 40]])
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled = scaler.fit_transform(X)
print("最小值:", scaler.data_min_)
print("最大值:", scaler.data_max_)
公式:x_scaled = (x - min) / (max - min)
MaxAbsScaler #
按最大绝对值缩放,适用于稀疏数据。
python
from sklearn.preprocessing import MaxAbsScaler
X = np.array([[1, -10], [2, 20], [3, -30]])
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
RobustScaler #
使用中位数和四分位数缩放,对异常值鲁棒。
python
from sklearn.preprocessing import RobustScaler
X = np.array([[1, 10], [2, 20], [3, 30], [4, 400]])
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
print("中位数:", scaler.center_)
print("四分位距:", scaler.scale_)
Normalizer(正则化) #
将样本缩放到单位范数。
python
from sklearn.preprocessing import Normalizer
X = np.array([[1, 2, 3], [4, 5, 6]])
normalizer = Normalizer(norm='l2')
X_normalized = normalizer.fit_transform(X)
缩放方法对比 #
| 方法 | 适用场景 | 对异常值 |
|---|---|---|
| StandardScaler | 正态分布数据 | 敏感 |
| MinMaxScaler | 有界范围数据 | 敏感 |
| MaxAbsScaler | 稀疏数据 | 敏感 |
| RobustScaler | 有异常值数据 | 鲁棒 |
| Normalizer | 文本/图像特征 | - |
类别特征编码 #
LabelEncoder #
将类别标签编码为 0 到 n_classes-1。
python
from sklearn.preprocessing import LabelEncoder
y = ['cat', 'dog', 'cat', 'bird', 'dog']
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
print("编码结果:", y_encoded)
print("类别列表:", encoder.classes_)
print("解码:", encoder.inverse_transform(y_encoded))
OrdinalEncoder #
将类别特征编码为有序整数。
python
from sklearn.preprocessing import OrdinalEncoder
X = [['low'], ['medium'], ['high'], ['medium']]
encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
X_encoded = encoder.fit_transform(X)
OneHotEncoder #
将类别特征转换为独热编码。
python
from sklearn.preprocessing import OneHotEncoder
X = [['cat'], ['dog'], ['cat'], ['bird']]
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(X)
print("编码结果:\n", X_encoded)
print("特征名:", encoder.get_feature_names_out())
对比 #
| 编码方式 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|
| LabelEncoder | 简单 | 引入顺序关系 | 目标变量 |
| OrdinalEncoder | 保留顺序 | 需要指定顺序 | 有序类别 |
| OneHotEncoder | 无序关系 | 维度增加 | 无序类别 |
缺失值处理 #
SimpleImputer #
简单统计量插补。
python
from sklearn.impute import SimpleImputer
import numpy as np
X = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
print("填充值:", imputer.statistics_)
策略选项 #
| 策略 | 描述 |
|---|---|
mean |
均值填充(数值型) |
median |
中位数填充(数值型) |
most_frequent |
众数填充 |
constant |
常数填充 |
python
imputer_mean = SimpleImputer(strategy='mean')
imputer_median = SimpleImputer(strategy='median')
imputer_mode = SimpleImputer(strategy='most_frequent')
imputer_const = SimpleImputer(strategy='constant', fill_value=0)
KNNImputer #
使用 K 近邻算法插补。
python
from sklearn.impute import KNNImputer
X = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])
imputer = KNNImputer(n_neighbors=2)
X_imputed = imputer.fit_transform(X)
IterativeImputer #
使用其他特征预测缺失值。
python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
X = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])
imputer = IterativeImputer(estimator=BayesianRidge())
X_imputed = imputer.fit_transform(X)
多项式特征 #
PolynomialFeatures #
生成多项式和交互特征。
python
from sklearn.preprocessing import PolynomialFeatures
X = np.array([[1, 2], [3, 4]])
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print("原始特征:", X.shape)
print("多项式特征:", X_poly.shape)
print("特征名:", poly.get_feature_names_out())
示例 #
python
X = np.array([[2]])
poly = PolynomialFeatures(degree=3, include_bias=True)
X_poly = poly.fit_transform(X)
print(X_poly)
函数转换器 #
FunctionTransformer #
使用自定义函数转换数据。
python
from sklearn.preprocessing import FunctionTransformer
import numpy as np
X = np.array([[1, 2], [3, 4]])
log_transformer = FunctionTransformer(np.log1p, validate=True)
X_log = log_transformer.fit_transform(X)
sqrt_transformer = FunctionTransformer(np.sqrt, validate=True)
X_sqrt = sqrt_transformer.fit_transform(X)
自定义转换 #
python
def custom_transform(X):
return X * 2 + 1
transformer = FunctionTransformer(custom_transform)
X_transformed = transformer.fit_transform(X)
二值化 #
Binarizer #
将数值转换为二进制值。
python
from sklearn.preprocessing import Binarizer
X = np.array([[1, 2, 3], [4, 5, 6]])
binarizer = Binarizer(threshold=3)
X_binary = binarizer.fit_transform(X)
分箱处理 #
KBinsDiscretizer #
将连续特征离散化。
python
from sklearn.preprocessing import KBinsDiscretizer
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
discretizer = KBinsDiscretizer(
n_bins=3,
encode='onehot',
strategy='uniform'
)
X_binned = discretizer.fit_transform(X)
策略选项 #
| 策略 | 描述 |
|---|---|
uniform |
等宽分箱 |
quantile |
等频分箱 |
kmeans |
K-Means 分箱 |
异常值处理 #
检测异常值 #
python
from sklearn.ensemble import IsolationForest
X = np.array([[1], [2], [3], [4], [100]])
detector = IsolationForest(contamination=0.1, random_state=42)
outliers = detector.fit_predict(X)
print("异常值标记:", outliers)
使用 RobustScaler #
python
from sklearn.preprocessing import RobustScaler
X = np.array([[1], [2], [3], [4], [100]])
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
数据预处理 Pipeline #
ColumnTransformer #
对不同列应用不同转换。
python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd
df = pd.DataFrame({
'age': [25, 30, 35, 40],
'income': [50000, 60000, 70000, 80000],
'city': ['Beijing', 'Shanghai', 'Beijing', 'Guangzhou']
})
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['age', 'income']),
('cat', OneHotEncoder(), ['city'])
]
)
X_transformed = preprocessor.fit_transform(df)
make_column_transformer #
简化创建过程。
python
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
preprocessor = make_column_transformer(
(StandardScaler(), ['age', 'income']),
(OneHotEncoder(), ['city'])
)
make_column_selector #
自动选择列。
python
from sklearn.compose import make_column_selector
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), make_column_selector(dtype_include=np.number)),
('cat', OneHotEncoder(), make_column_selector(dtype_include=object))
]
)
完整预处理示例 #
python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
numeric_features = ['age', 'income']
categorical_features = ['city', 'gender']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
clf = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
最佳实践 #
1. 避免数据泄露 #
python
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
2. 使用 Pipeline #
python
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
3. 处理未知类别 #
python
encoder = OneHotEncoder(handle_unknown='ignore')
4. 保存预处理器 #
python
from joblib import dump, load
dump(preprocessor, 'preprocessor.joblib')
preprocessor = load('preprocessor.joblib')
下一步 #
掌握数据预处理后,继续学习 线性模型 开始监督学习之旅!
最后更新:2026-04-04