安装与配置 #

系统要求 #

支持的操作系统 #

text

┌─────────────────────────────────────────────────────────────┐
│                    Ray 系统支持                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  操作系统：                                                  │
│  ├── Linux (Ubuntu 18.04+, CentOS 7+)                      │
│  ├── macOS (10.15+)                                        │
│  └── Windows (Windows 10+, WSL2)                           │
│                                                             │
│  Python 版本：                                               │
│  ├── Python 3.8                                            │
│  ├── Python 3.9                                            │
│  ├── Python 3.10                                           │
│  ├── Python 3.11                                           │
│  └── Python 3.12                                           │
│                                                             │
│  硬件要求：                                                  │
│  ├── 最低：2 CPU, 4GB 内存                                  │
│  ├── 推荐：4+ CPU, 16GB+ 内存                               │
│  └── GPU：NVIDIA GPU + CUDA 11.x+                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

依赖项 #

text

核心依赖：
├── Python 3.8+
├── pip 或 conda
└── 操作系统相关库

可选依赖：
├── CUDA (GPU 支持)
├── Docker (容器化部署)
└── Kubernetes (集群部署)

安装方式 #

1. pip 安装（推荐） #

bash

pip install ray

安装特定版本：

bash

pip install ray==2.9.0

2. conda 安装 #

bash

conda install -c conda-forge ray

3. 安装完整 AI 库 #

bash

pip install "ray[default]"

pip install "ray[data]"

pip install "ray[train]"

pip install "ray[serve]"

pip install "ray[tune]"

pip install "ray[all]"

4. 从源码安装 #

bash

git clone https://github.com/ray-project/ray.git
cd ray
pip install -e .

GPU 支持 #

安装 GPU 版本 #

bash

pip install "ray[default]" "ray[data]" "ray[train]" "ray[tune]"

CUDA 配置 #

bash

export CUDA_VISIBLE_DEVICES=0,1,2,3

验证 GPU 支持 #

python

import ray

ray.init()

print(ray.available_resources())

GPU 资源指定 #

python

@ray.remote(num_gpus=1)
def train_model():
    import torch
    assert torch.cuda.is_available()
    return "GPU available"

ray.get(train_model.remote())

环境配置 #

基本配置 #

python

import ray

ray.init(
    num_cpus=8,
    num_gpus=2,
    memory=16 * 1024 * 1024 * 1024,
    object_store_memory=4 * 1024 * 1024 * 1024,
)

配置参数说明 #

text

┌─────────────────────────────────────────────────────────────┐
│                    Ray 初始化参数                            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  num_cpus                                                   │
│  ├── 可用 CPU 数量                                          │
│  └── 默认：系统 CPU 数量                                     │
│                                                             │
│  num_gpus                                                   │
│  ├── 可用 GPU 数量                                          │
│  └── 默认：检测到的 GPU 数量                                 │
│                                                             │
│  memory                                                     │
│  ├── 可用内存（字节）                                        │
│  └── 默认：系统可用内存                                      │
│                                                             │
│  object_store_memory                                        │
│  ├── 对象存储内存（字节）                                    │
│  └── 默认：内存的 30%                                        │
│                                                             │
│  include_dashboard                                          │
│  ├── 是否启动 Dashboard                                     │
│  └── 默认：True                                             │
│                                                             │
│  dashboard_host                                             │
│  ├── Dashboard 监听地址                                     │
│  └── 默认：127.0.0.1                                        │
│                                                             │
│  dashboard_port                                             │
│  ├── Dashboard 端口                                         │
│  └── 默认：8265                                             │
│                                                             │
│  logging_level                                              │
│  ├── 日志级别                                               │
│  └── 默认：INFO                                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

环境变量配置 #

bash

export RAY_OBJECT_STORE_MEMORY=4000000000
export RAY_BACKEND_LOG_LEVEL=debug
export RAY_DEDUP_LOGS=0

常用环境变量：

变量	说明
RAY_OBJECT_STORE_MEMORY	对象存储内存
RAY_BACKEND_LOG_LEVEL	后端日志级别
RAY_DEDUP_LOGS	是否去重日志
RAY_DISABLE_MEMORY_MONITOR	禁用内存监控
RAY_ENABLE_RECORD_ACTOR_REF	启用 Actor 引用记录

集群配置 #

本地集群 #

bash

ray start --head --port=6379

连接集群 #

python

import ray

ray.init(address="auto")

ray.init(address="ray://head-node:10001")

集群配置文件 #

yaml

cluster_name: my-cluster

max_workers: 10

head_node_type:
    name: head
    resources: {}

worker_node_types:
    - name: worker
      min_workers: 2
      max_workers: 10
      resources: {}

head_setup_commands:
    - pip install ray[default]

worker_setup_commands:
    - pip install ray[default]

head_start_ray_commands:
    - ray start --head --port=6379

worker_start_ray_commands:
    - ray start --address=$RAY_HEAD_IP:6379

启动集群 #

bash

ray up cluster.yaml

停止集群 #

bash

ray down cluster.yaml

Docker 部署 #

使用官方镜像 #

bash

docker pull rayproject/ray:latest

docker run -d --name ray-head \
    --shm-size=4g \
    -p 6379:6379 \
    -p 8265:8265 \
    rayproject/ray:latest \
    ray start --head --port=6379

Docker Compose 配置 #

yaml

version: '3.8'

services:
  ray-head:
    image: rayproject/ray:latest
    command: ray start --head --port=6379 --dashboard-host=0.0.0.0
    ports:
      - "6379:6379"
      - "8265:8265"
    shm_size: '4gb'
    
  ray-worker:
    image: rayproject/ray:latest
    command: ray start --address=ray-head:6379
    depends_on:
      - ray-head
    shm_size: '4gb'
    deploy:
      replicas: 2

验证安装 #

基本验证 #

python

import ray

ray.init()

@ray.remote
def hello():
    return "Hello, Ray!"

print(ray.get(hello.remote()))

ray.shutdown()

资源验证 #

python

import ray

ray.init()

print("Available resources:")
print(ray.available_resources())

print("\nCluster resources:")
print(ray.cluster_resources())

ray.shutdown()

GPU 验证 #

python

import ray

ray.init(num_gpus=1)

@ray.remote(num_gpus=0.5)
def check_gpu():
    import torch
    return torch.cuda.is_available()

print(f"GPU available: {ray.get(check_gpu.remote())}")

ray.shutdown()

常见问题 #

1. 内存不足 #

text

错误：ObjectStoreFullError

解决方案：
├── 增加 object_store_memory
├── 减少对象数量
├── 使用 ray.put() 预存储
└── 定期清理不需要的对象

python

ray.init(object_store_memory=8 * 1024 * 1024 * 1024)

2. 端口冲突 #

text

错误：Port already in use

解决方案：
├── 更改端口号
├── 停止占用端口的进程
└── 使用 ray stop 清理

bash

ray stop
ray start --head --port=6380 --dashboard-port=8266

3. GPU 检测失败 #

text

错误：No GPUs detected

解决方案：
├── 检查 CUDA 安装
├── 设置 CUDA_VISIBLE_DEVICES
├── 手动指定 num_gpus
└── 检查驱动版本

python

ray.init(num_gpus=2)

4. 连接超时 #

text

错误：Connection timeout

解决方案：
├── 检查网络连接
├── 检查防火墙设置
├── 增加 timeout 时间
└── 检查集群状态

python

ray.init(address="auto", _redis_password="your_password")

下一步 #

安装完成后，继续学习快速开始，编写你的第一个 Ray 应用！