集群管理 #

Ray 集群架构 #

Ray 集群由一个 Head Node 和多个 Worker Node 组成，Head Node 运行控制服务，Worker Node 执行计算任务。

text

┌─────────────────────────────────────────────────────────────┐
│                    Ray 集群架构                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   Head Node                          │   │
│  │  ┌─────────────┐  ┌─────────────┐                   │   │
│  │  │    GCS      │  │  Dashboard  │                   │   │
│  │  │ Global Ctrl │  │   监控界面   │                   │   │
│  │  └─────────────┘  └─────────────┘                   │   │
│  │  ┌─────────────┐  ┌─────────────┐                   │   │
│  │  │  Scheduler  │  │Object Store │                   │   │
│  │  │   调度器    │  │  对象存储   │                   │   │
│  │  └─────────────┘  └─────────────┘                   │   │
│  └─────────────────────────────────────────────────────┘   │
│                          │                                  │
│           ┌──────────────┼──────────────┐                  │
│           ▼              ▼              ▼                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │Worker Node 1│  │Worker Node 2│  │Worker Node 3│        │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │        │
│  │ │ Workers │ │  │ │ Workers │ │  │ │ Workers │ │        │
│  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │        │
│  │ ┌─────────┐ │  │ ┌─────────┐ │  │ ┌─────────┐ │        │
│  │ │  Store  │ │  │ │  Store  │ │  │ │  Store  │ │        │
│  │ └─────────┘ │  │ └─────────┘ │  │ └─────────┘ │        │
│  └─────────────┘  └─────────────┘  └─────────────┘        │
│                                                             │
└─────────────────────────────────────────────────────────────┘

本地集群 #

启动集群 #

bash

ray start --head --port=6379

添加 Worker 节点 #

bash

ray start --address=head-node-ip:6379

连接集群 #

python

import ray

ray.init(address="auto")

ray.init(address="ray://head-node-ip:10001")

print(ray.cluster_resources())

ray.shutdown()

停止集群 #

bash

ray stop

集群配置文件 #

基本配置 #

yaml

cluster_name: my-ray-cluster

max_workers: 10

provider:
    type: local
    head_ip: 192.168.1.100
    worker_ips:
        - 192.168.1.101
        - 192.168.1.102

auth:
    ssh_user: ubuntu
    ssh_private_key: ~/.ssh/id_rsa

head_node:
    resources:
        CPU: 8
        GPU: 2

worker_nodes:
    - resources:
        CPU: 8
        GPU: 2

head_setup_commands:
    - pip install ray[default]

worker_setup_commands:
    - pip install ray[default]

head_start_ray_commands:
    - ray start --head --port=6379

worker_start_ray_commands:
    - ray start --address=$RAY_HEAD_IP:6379

启动集群 #

bash

ray up cluster.yaml

更新集群 #

bash

ray up cluster.yaml --restart-only

关闭集群 #

bash

ray down cluster.yaml

云平台部署 #

AWS 部署 #

yaml

cluster_name: aws-ray-cluster

provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a

auth:
    ssh_user: ubuntu

head_node_type:
    name: head
    instance_type: m5.xlarge
    resources:
        CPU: 4

worker_node_types:
    - name: worker
      instance_type: m5.xlarge
      min_workers: 2
      max_workers: 10
      resources:
        CPU: 4

setup_commands:
    - pip install ray[default]

GCP 部署 #

yaml

cluster_name: gcp-ray-cluster

provider:
    type: gcp
    region: us-central1
    availability_zone: us-central1-a
    project_id: my-project

auth:
    ssh_user: ubuntu

head_node_type:
    name: head
    instance_type: n1-standard-4

worker_node_types:
    - name: worker
      instance_type: n1-standard-4
      min_workers: 2
      max_workers: 10

Azure 部署 #

yaml

cluster_name: azure-ray-cluster

provider:
    type: azure
    location: eastus
    resource_group: my-resource-group

auth:
    ssh_user: azureuser

head_node_type:
    name: head
    instance_type: Standard_D4s_v3

worker_node_types:
    - name: worker
      instance_type: Standard_D4s_v3
      min_workers: 2
      max_workers: 10

Kubernetes 部署 #

使用 Helm 安装 #

bash

helm repo add ray https://ray-project.github.io/kuberay-helm/
helm install ray-cluster ray/ray-cluster

KubeRay 配置 #

yaml

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
    name: ray-cluster
spec:
    rayVersion: '2.9.0'
    headGroupSpec:
        rayStartParams:
            dashboard-host: '0.0.0.0'
        template:
            spec:
                containers:
                    - name: ray-head
                      image: rayproject/ray:2.9.0
                      ports:
                          - containerPort: 6379
                          - containerPort: 8265
                      resources:
                          limits:
                              cpu: 2
                              memory: 4Gi
                          requests:
                              cpu: 2
                              memory: 4Gi
    workerGroupSpecs:
        - replicas: 3
          minReplicas: 1
          maxReplicas: 10
          groupName: worker
          rayStartParams: {}
          template:
              spec:
                  containers:
                      - name: ray-worker
                        image: rayproject/ray:2.9.0
                        resources:
                            limits:
                                cpu: 2
                                memory: 4Gi
                            requests:
                                cpu: 2
                                memory: 4Gi

连接 Kubernetes 集群 #

python

import ray

ray.init(address="ray://ray-cluster-head-svc:10001")

ray.shutdown()

自动扩缩容 #

配置自动扩缩 #

yaml

cluster_name: autoscaling-cluster

provider:
    type: aws
    region: us-west-2

autoscaling_config:
    idle_timeout_minutes: 5

worker_node_types:
    - name: worker
      instance_type: m5.xlarge
      min_workers: 0
      max_workers: 100
      use_spot: true

自动扩缩容策略 #

text

┌─────────────────────────────────────────────────────────────┐
│                    自动扩缩容策略                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  扩容触发：                                                  │
│  ├── 待处理任务数量增加                                     │
│  ├── 资源利用率超过阈值                                     │
│  └── 用户手动请求                                           │
│                                                             │
│  缩容触发：                                                  │
│  ├── 节点空闲时间超过阈值                                   │
│  ├── 资源利用率低于阈值                                     │
│  └── 用户手动请求                                           │
│                                                             │
│  配置参数：                                                  │
│  ├── min_workers: 最小节点数                                │
│  ├── max_workers: 最大节点数                                │
│  ├── idle_timeout_minutes: 空闲超时                         │
│  └── upscaling_speed: 扩容速度                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

集群监控 #

Dashboard #

Ray Dashboard 默认运行在 http://head-node:8265

text

┌─────────────────────────────────────────────────────────────┐
│                    Dashboard 功能                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Overview                                                    │
│  ├── 集群状态概览                                            │
│  ├── 资源使用情况                                            │
│  └── 任务统计                                                │
│                                                             │
│  Jobs                                                        │
│  ├── 任务列表                                                │
│  ├── 任务详情                                                │
│  └── 日志查看                                                │
│                                                             │
│  Actors                                                      │
│  ├── Actor 列表                                              │
│  ├── 状态监控                                                │
│  └── 资源占用                                                │
│                                                             │
│  Nodes                                                       │
│  ├── 节点列表                                                │
│  ├── 健康状态                                                │
│  └── 资源详情                                                │
│                                                             │
│  Metrics                                                     │
│  ├── 性能指标                                                │
│  ├── 内存使用                                                │
│  └── 自定义指标                                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

命令行监控 #

bash

ray status

ray summary actors

ray summary tasks

ray memory

Prometheus 集成 #

yaml

cluster_name: monitored-cluster

monitoring_config:
    prometheus:
        enabled: true
        port: 9090

集群运维 #

日志管理 #

python

import ray

ray.init(logging_level="DEBUG")

ray.init(log_to_driver=True)

bash

ray logs

ray logs --node-id=<node_id>

健康检查 #

python

import ray

ray.init()

def check_cluster_health():
    resources = ray.available_resources()
    if resources.get("CPU", 0) < 1:
        return False
    return True

print(f"Cluster healthy: {check_cluster_health()}")

ray.shutdown()

故障恢复 #

text

┌─────────────────────────────────────────────────────────────┐
│                    故障恢复机制                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Task 失败：                                                 │
│  ├── 自动重试（可配置次数）                                  │
│  ├── 重新调度到其他节点                                      │
│  └── 依赖任务自动重建                                        │
│                                                             │
│  Actor 失败：                                                │
│  ├── 自动重启（可配置）                                      │
│  ├── 状态恢复（需检查点）                                    │
│  └── 引用自动更新                                            │
│                                                             │
│  Node 失败：                                                 │
│  ├── 任务重新调度                                            │
│  ├── Actor 重建                                              │
│  └── 对象重建                                                │
│                                                             │
│  Head Node 失败：                                            │
│  ├── 需要手动恢复                                            │
│  ├── 建议使用高可用配置                                      │
│  └── 定期备份 GCS 数据                                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

最佳实践 #

1. 合理规划资源 #

yaml

head_node:
    resources:
        CPU: 4
        memory: 16Gi

worker_nodes:
    - resources:
        CPU: 8
        GPU: 1
        memory: 32Gi

2. 使用 Spot 实例 #

yaml

worker_node_types:
    - name: spot-worker
      use_spot: true
      spot_price: 0.5
      min_workers: 0
      max_workers: 100

3. 配置自动扩缩容 #

yaml

autoscaling_config:
    idle_timeout_minutes: 10
    upscaling_speed: 1.0

4. 监控和告警 #

yaml

monitoring_config:
    prometheus:
        enabled: true
    alerting:
        enabled: true
        slack_webhook: https://hooks.slack.com/...

下一步 #

掌握了集群管理之后，继续学习高级特性，深入了解 Ray 的高级功能！