项目打包 #

MLflow Projects 概述 #

MLflow Projects 提供了一种打包数据科学代码的标准格式，使其可以在任何平台上可复现地运行。

text

┌─────────────────────────────────────────────────────────────┐
│                   MLflow Projects 架构                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   项目代码                           │   │
│  │  ├── train.py                                        │   │
│  │  ├── preprocess.py                                   │   │
│  │  └── utils.py                                        │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│                          ▼                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                MLproject 文件                        │   │
│  │  ├── 项目名称                                        │   │
│  │  ├── 环境配置                                        │   │
│  │  └── 入口点定义                                      │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│                          ▼                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   运行环境                           │   │
│  │  ├── Conda 环境                                      │   │
│  │  ├── Docker 容器                                     │   │
│  │  └── 系统环境                                        │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

项目结构 #

基本项目结构 #

text

my_project/
├── MLproject           # 项目描述文件
├── conda.yaml          # Conda 环境配置
├── requirements.txt    # Python 依赖
├── train.py           # 训练脚本
├── preprocess.py      # 预处理脚本
├── config/            # 配置文件
│   └── config.yaml
└── data/              # 数据目录
    └── sample.csv

MLproject 文件 #

yaml

name: my_project

conda_env: conda.yaml

docker_env:
  image: python:3.10-slim

entry_points:
  main:
    parameters:
      data_path: {type: str, default: "data/train.csv"}
      learning_rate: {type: float, default: 0.01}
      epochs: {type: int, default: 100}
    command: "python train.py --data_path {data_path} --lr {learning_rate} --epochs {epochs}"
  
  preprocess:
    parameters:
      input_path: {type: str}
      output_path: {type: str}
    command: "python preprocess.py --input {input_path} --output {output_path}"

MLproject 文件详解 #

项目名称 #

yaml

name: customer-churn-prediction

环境配置 #

text

┌─────────────────────────────────────────────────────────────┐
│                    环境配置类型                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Conda 环境                                               │
│     ─────────────────────────────────────────────────────   │
│     conda_env: conda.yaml                                   │
│     或                                                      │
│     conda_env:                                              │
│       name: mlflow-env                                      │
│       channels: [conda-forge]                               │
│       dependencies:                                         │
│         - python=3.10                                       │
│         - numpy                                             │
│         - pandas                                            │
│                                                             │
│  2. Docker 环境                                              │
│     ─────────────────────────────────────────────────────   │
│     docker_env:                                             │
│       image: python:3.10-slim                               │
│       volumes: ["/host/path:/container/path"]               │
│       environment:                                          │
│         - CUDA_VISIBLE_DEVICES=0                           │
│                                                             │
│  3. 系统环境                                                 │
│     ─────────────────────────────────────────────────────   │
│     不指定环境配置，使用当前系统环境                         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

入口点定义 #

yaml

entry_points:
  train:
    parameters:
      data_path: {type: str, default: "data/train.csv"}
      learning_rate: {type: float, default: 0.01}
      epochs: {type: int, default: 100}
      batch_size: {type: int, default: 32}
    command: "python train.py --data {data_path} --lr {learning_rate} --epochs {epochs} --batch_size {batch_size}"
  
  evaluate:
    parameters:
      model_path: {type: str}
      test_data: {type: str}
    command: "python evaluate.py --model {model_path} --test {test_data}"
  
  predict:
    parameters:
      model_path: {type: str}
      input_data: {type: str}
      output_path: {type: str, default: "predictions.csv"}
    command: "python predict.py --model {model_path} --input {input_data} --output {output_path}"

参数类型 #

参数定义 #

yaml

parameters:
  string_param: {type: str, default: "hello"}
  int_param: {type: int, default: 100}
  float_param: {type: float, default: 0.01}
  bool_param: {type: bool, default: true}
  required_param: {type: str}

参数类型说明 #

text

┌─────────────────────────────────────────────────────────────┐
│                      参数类型                                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  str (字符串)                                                │
│  ─────────────────────────────────────────────────────────  │
│  {type: str, default: "value"}                             │
│  用于文件路径、模型名称等                                   │
│                                                             │
│  int (整数)                                                  │
│  ─────────────────────────────────────────────────────────  │
│  {type: int, default: 100}                                 │
│  用于 epochs, batch_size 等                                │
│                                                             │
│  float (浮点数)                                              │
│  ─────────────────────────────────────────────────────────  │
│  {type: float, default: 0.01}                              │
│  用于 learning_rate, dropout 等                            │
│                                                             │
│  bool (布尔值)                                               │
│  ─────────────────────────────────────────────────────────  │
│  {type: bool, default: true}                               │
│  用于开关选项                                               │
│                                                             │
│  必需参数（无默认值）                                        │
│  ─────────────────────────────────────────────────────────  │
│  {type: str}                                               │
│  运行时必须提供                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

运行项目 #

使用 mlflow run 命令 #

bash

mlflow run . -P learning_rate=0.01 -P epochs=100

mlflow run . -e train -P data_path=data/train.csv

mlflow run https://github.com/user/mlflow-project.git -P param=value

mlflow run . -e main -P param1=value1 -P param2=value2

使用 Python API #

python

import mlflow

mlflow.run(
    uri=".",
    entry_point="train",
    parameters={
        "learning_rate": 0.01,
        "epochs": 100
    }
)

mlflow.run(
    uri="https://github.com/user/mlflow-project.git",
    entry_point="main",
    version="main",
    parameters={
        "data_path": "data/train.csv"
    }
)

运行参数 #

text

┌─────────────────────────────────────────────────────────────┐
│                    mlflow run 参数                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  -P, --parameter          设置参数                          │
│  ─────────────────────────────────────────────────────────  │
│  -P learning_rate=0.01                                     │
│  -P epochs=100                                             │
│                                                             │
│  -e, --entry-point        指定入口点                        │
│  ─────────────────────────────────────────────────────────  │
│  -e train                                                  │
│  -e evaluate                                               │
│                                                             │
│  -v, --version            指定版本                          │
│  ─────────────────────────────────────────────────────────  │
│  -v v1.0                                                   │
│  -v main                                                   │
│                                                             │
│  --experiment-name        指定实验名称                      │
│  ─────────────────────────────────────────────────────────  │
│  --experiment-name my-experiment                           │
│                                                             │
│  --backend-store-uri      指定存储后端                      │
│  ─────────────────────────────────────────────────────────  │
│  --backend-store-uri sqlite:///mlflow.db                   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

环境配置 #

Conda 环境 #

yaml

name: mlflow-project
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.10
  - numpy=1.24
  - pandas=2.0
  - scikit-learn=1.3
  - pip
  - pip:
    - mlflow==2.10.0
    - torch==2.0.0

Docker 环境 #

yaml

docker_env:
  image: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
  volumes:
    - "/host/data:/workspace/data"
    - "/host/models:/workspace/models"
  environment:
    - CUDA_VISIBLE_DEVICES=0
    - PYTHONPATH=/workspace/src

Dockerfile 示例 #

dockerfile

FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

ENTRYPOINT ["python", "train.py"]

完整项目示例 #

项目结构 #

text

customer_churn/
├── MLproject
├── conda.yaml
├── requirements.txt
├── src/
│   ├── __init__.py
│   ├── train.py
│   ├── evaluate.py
│   ├── predict.py
│   └── utils.py
├── config/
│   └── config.yaml
├── data/
│   └── .gitkeep
└── README.md

MLproject 文件 #

yaml

name: customer-churn-prediction

conda_env: conda.yaml

entry_points:
  train:
    parameters:
      data_path: {type: str, default: "data/train.csv"}
      model_type: {type: str, default: "random_forest"}
      n_estimators: {type: int, default: 100}
      max_depth: {type: int, default: 10}
      test_size: {type: float, default: 0.2}
    command: >
      python src/train.py
      --data_path {data_path}
      --model_type {model_type}
      --n_estimators {n_estimators}
      --max_depth {max_depth}
      --test_size {test_size}
  
  evaluate:
    parameters:
      model_path: {type: str}
      test_data: {type: str}
    command: >
      python src/evaluate.py
      --model_path {model_path}
      --test_data {test_data}
  
  predict:
    parameters:
      model_path: {type: str}
      input_data: {type: str}
      output_path: {type: str, default: "predictions.csv"}
    command: >
      python src/predict.py
      --model_path {model_path}
      --input_data {input_data}
      --output_path {output_path}

conda.yaml 文件 #

yaml

name: customer-churn
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.10
  - numpy=1.24
  - pandas=2.0
  - scikit-learn=1.3
  - matplotlib=3.7
  - seaborn=0.12
  - pyyaml=6.0
  - pip
  - pip:
    - mlflow==2.10.0
    - xgboost==2.0.0
    - lightgbm==4.0.0

train.py 示例 #

python

import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_path", type=str, default="data/train.csv")
    parser.add_argument("--model_type", type=str, default="random_forest")
    parser.add_argument("--n_estimators", type=int, default=100)
    parser.add_argument("--max_depth", type=int, default=10)
    parser.add_argument("--test_size", type=float, default=0.2)
    return parser.parse_args()

def main():
    args = parse_args()
    
    data = pd.read_csv(args.data_path)
    X = data.drop("target", axis=1)
    y = data["target"]
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=args.test_size, random_state=42
    )
    
    mlflow.set_experiment("customer-churn")
    
    with mlflow.start_run():
        mlflow.log_params({
            "model_type": args.model_type,
            "n_estimators": args.n_estimators,
            "max_depth": args.max_depth,
            "test_size": args.test_size
        })
        
        model = RandomForestClassifier(
            n_estimators=args.n_estimators,
            max_depth=args.max_depth,
            random_state=42
        )
        model.fit(X_train, y_train)
        
        y_pred = model.predict(X_test)
        
        metrics = {
            "accuracy": accuracy_score(y_test, y_pred),
            "precision": precision_score(y_test, y_pred),
            "recall": recall_score(y_test, y_pred),
            "f1_score": f1_score(y_test, y_pred)
        }
        
        mlflow.log_metrics(metrics)
        mlflow.sklearn.log_model(model, "model")
        
        print(f"Accuracy: {metrics['accuracy']:.4f}")
        print(f"F1 Score: {metrics['f1_score']:.4f}")

if __name__ == "__main__":
    main()

运行远程项目 #

从 Git 仓库运行 #

bash

mlflow run https://github.com/user/mlflow-project.git \
    -e train \
    -P data_path=s3://bucket/data/train.csv \
    -P learning_rate=0.01

从特定分支运行 #

bash

mlflow run https://github.com/user/mlflow-project.git \
    -v feature-branch \
    -e train \
    -P param=value

从特定标签运行 #

bash

mlflow run https://github.com/user/mlflow-project.git \
    -v v1.0.0 \
    -e train

工作流编排 #

多步骤工作流 #

python

import mlflow

with mlflow.start_run() as parent_run:
    preprocess_run = mlflow.run(
        ".",
        entry_point="preprocess",
        parameters={
            "input_path": "data/raw.csv",
            "output_path": "data/processed.csv"
        }
    )
    
    train_run = mlflow.run(
        ".",
        entry_point="train",
        parameters={
            "data_path": "data/processed.csv",
            "learning_rate": 0.01
        }
    )
    
    evaluate_run = mlflow.run(
        ".",
        entry_point="evaluate",
        parameters={
            "model_path": train_run.run_id,
            "test_data": "data/test.csv"
        }
    )

使用 Databricks Jobs #

python

import mlflow

mlflow.run(
    "databricks://my-workspace/my-project",
    entry_point="train",
    parameters={"param": "value"},
    backend="databricks"
)

最佳实践 #

1. 项目结构规范 #

text

project/
├── MLproject           # 必需
├── conda.yaml          # 推荐
├── requirements.txt    # 可选
├── src/               # 源代码
├── config/            # 配置文件
├── tests/             # 测试代码
├── data/              # 数据目录
└── README.md          # 文档

2. 参数命名规范 #

yaml

parameters:
  data_path: {type: str, default: "data/train.csv"}
  model.learning_rate: {type: float, default: 0.01}
  model.batch_size: {type: int, default: 32}
  train.epochs: {type: int, default: 100}

3. 环境隔离 #

yaml

conda_env:
  name: project-env-v1
  channels: [conda-forge]
  dependencies:
    - python=3.10
    - pip
    - pip:
      - mlflow
      - scikit-learn

4. 版本控制 #

bash

git tag v1.0.0
git push origin v1.0.0

mlflow run https://github.com/user/project.git -v v1.0.0

下一步 #

现在你已经掌握了 MLflow Projects 的核心功能，接下来学习模型注册中心，了解如何管理模型版本和生命周期！