Notebooks 笔记本 #
概述 #
Kubeflow Notebooks 提供了在 Kubernetes 上运行交互式开发环境的能力,支持 JupyterLab、VS Code Server、RStudio 等多种 IDE。
核心特性 #
text
┌─────────────────────────────────────────────────────────────┐
│ Notebooks 核心特性 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 多 IDE 支持: │
│ ├── JupyterLab (默认) │
│ ├── VS Code Server │
│ ├── RStudio │
│ └── 自定义 IDE │
│ │
│ 资源管理: │
│ ├── CPU/内存 配置 │
│ ├── GPU 支持 │
│ ├── 持久化存储 │
│ └── 资源限制 │
│ │
│ 多用户支持: │
│ ├── 命名空间隔离 │
│ ├── 独立工作空间 │
│ ├── 权限控制 │
│ └── 资源配额 │
│ │
│ 自定义能力: │
│ ├── 自定义镜像 │
│ ├── 预装依赖 │
│ ├── 环境变量配置 │
│ └── 启动脚本 │
│ │
└─────────────────────────────────────────────────────────────┘
Notebook 架构 #
架构图 #
text
┌─────────────────────────────────────────────────────────────┐
│ Notebook 架构 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 用户访问层 │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ 浏览器 → Istio Gateway → Notebook Service │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Notebook Controller │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Notebook Controller → 管理 Notebook CRD │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Notebook Pod │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ IDE 容器 │ │ 边车容器 │ │ │
│ │ │ (Jupyter) │ │ (代理) │ │ │
│ │ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 存储层 │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ PVC (持久化存储) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Notebook CRD #
yaml
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: my-notebook
namespace: kubeflow-user-example-com
labels:
app: my-notebook
spec:
template:
spec:
serviceAccountName: default-editor
containers:
- name: notebook
image: public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-tensorflow:v1.8.0
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
volumeMounts:
- name: data
mountPath: /home/jovyan
env:
- name: NOTEBOOK_ARGS
value: "--NotebookApp.default_url=/lab"
volumes:
- name: data
persistentVolumeClaim:
claimName: my-notebook-pvc
创建 Notebook #
通过 Dashboard 创建 #
text
1. 登录 Kubeflow Dashboard
2. 点击左侧导航栏 "Notebooks"
3. 点击 "New Notebook" 按钮
4. 配置 Notebook:
├── Name: 输入名称
├── Namespace: 选择命名空间
├── Image: 选择镜像类型
│ ├── TensorFlow
│ ├── PyTorch
│ ├── Base Python
│ └── Custom
├── CPU: 设置 CPU 核数
├── Memory: 设置内存大小
├── GPU: 设置 GPU 数量
├── Workspace Volume: 配置存储
│ ├── Size: 存储大小
│ ├── Access Mode: 访问模式
│ └── Storage Class: 存储类
└── Affinity/Tolerations: 调度配置
5. 点击 "LAUNCH" 创建
6. 等待状态变为 "Running"
7. 点击 "CONNECT" 连接
通过 YAML 创建 #
yaml
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: ml-notebook
namespace: kubeflow-user-example-com
labels:
app: ml-notebook
spec:
template:
spec:
serviceAccountName: default-editor
containers:
- name: notebook
image: public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-tensorflow:v1.8.0
resources:
requests:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "16Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: workspace
mountPath: /home/jovyan
- name: data
mountPath: /data
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
volumes:
- name: workspace
persistentVolumeClaim:
claimName: workspace-pvc
- name: data
persistentVolumeClaim:
claimName: data-pvc
bash
# 应用配置
kubectl apply -f notebook.yaml
# 查看 Notebook 状态
kubectl get notebooks -n kubeflow-user-example-com
# 查看 Pod 状态
kubectl get pods -n kubeflow-user-example-com -l app=ml-notebook
镜像选择 #
官方镜像 #
text
TensorFlow 镜像:
├── jupyter-tensorflow - TensorFlow + JupyterLab
├── jupyter-tensorflow-cuda - TensorFlow + GPU 支持
└── jupyter-tensorflow-full - 完整 TensorFlow 环境
PyTorch 镜像:
├── jupyter-pytorch - PyTorch + JupyterLab
├── jupyter-pytorch-cuda - PyTorch + GPU 支持
└── jupyter-pytorch-full - 完整 PyTorch 环境
基础镜像:
├── jupyter-scipy - 科学计算环境
├── jupyter-datascience - 数据科学环境
└── jupyter-minimal - 最小化环境
镜像地址 #
yaml
# TensorFlow 镜像
image: public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-tensorflow:v1.8.0
# PyTorch 镜像
image: public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/jupyter-pytorch:v1.8.0
# VS Code Server 镜像
image: public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/codeserver-python:v1.8.0
# RStudio 镜像
image: public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/rstudio-tidyverse:v1.8.0
自定义镜像 #
dockerfile
# Dockerfile
FROM python:3.9-slim
# 安装系统依赖
RUN apt-get update && apt-get install -y \
git \
curl \
&& rm -rf /var/lib/apt/lists/*
# 安装 Python 依赖
RUN pip install --no-cache-dir \
jupyterlab \
numpy \
pandas \
scikit-learn \
tensorflow \
torch
# 设置工作目录
WORKDIR /home/jovyan
# 设置用户
USER root
# 启动 JupyterLab
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]
bash
# 构建镜像
docker build -t my-notebook:latest .
# 推送到镜像仓库
docker push my-registry/my-notebook:latest
资源配置 #
CPU 和内存 #
yaml
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: cpu-notebook
namespace: kubeflow-user-example-com
spec:
template:
spec:
containers:
- name: notebook
image: python:3.9
resources:
requests:
cpu: "2" # 请求 2 核 CPU
memory: "4Gi" # 请求 4GB 内存
limits:
cpu: "4" # 最大 4 核 CPU
memory: "8Gi" # 最大 8GB 内存
GPU 配置 #
yaml
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: gpu-notebook
namespace: kubeflow-user-example-com
spec:
template:
spec:
containers:
- name: notebook
image: tensorflow/tensorflow:latest-gpu
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1" # 请求 1 个 GPU
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "1"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
多 GPU 配置 #
yaml
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: multi-gpu-notebook
namespace: kubeflow-user-example-com
spec:
template:
spec:
containers:
- name: notebook
image: pytorch/pytorch:latest
resources:
limits:
nvidia.com/gpu: "4" # 请求 4 个 GPU
存储配置 #
持久化存储 #
yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: notebook-pvc
namespace: kubeflow-user-example-com
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: standard
---
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: storage-notebook
namespace: kubeflow-user-example-com
spec:
template:
spec:
containers:
- name: notebook
image: python:3.9
volumeMounts:
- name: workspace
mountPath: /home/jovyan
- name: datasets
mountPath: /datasets
readOnly: true
volumes:
- name: workspace
persistentVolumeClaim:
claimName: notebook-pvc
- name: datasets
persistentVolumeClaim:
claimName: shared-datasets-pvc
配置 ConfigMap 和 Secret #
yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: notebook-config
namespace: kubeflow-user-example-com
data:
config.yaml: |
database:
host: mysql-service
port: 3306
---
apiVersion: v1
kind: Secret
metadata:
name: notebook-secrets
namespace: kubeflow-user-example-com
type: Opaque
stringData:
db-password: "my-secret-password"
api-key: "my-api-key"
---
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: config-notebook
namespace: kubeflow-user-example-com
spec:
template:
spec:
containers:
- name: notebook
image: python:3.9
volumeMounts:
- name: config
mountPath: /etc/config
readOnly: true
- name: secrets
mountPath: /etc/secrets
readOnly: true
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: notebook-secrets
key: db-password
volumes:
- name: config
configMap:
name: notebook-config
- name: secrets
secret:
name: notebook-secrets
网络配置 #
服务配置 #
yaml
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: network-notebook
namespace: kubeflow-user-example-com
spec:
template:
spec:
containers:
- name: notebook
image: python:3.9
ports:
- containerPort: 8888
name: notebook
protocol: TCP
- containerPort: 6006
name: tensorboard
protocol: TCP
网络策略 #
yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: notebook-network-policy
namespace: kubeflow-user-example-com
spec:
podSelector:
matchLabels:
app: my-notebook
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: istio-system
ports:
- port: 8888
protocol: TCP
egress:
- to:
- namespaceSelector: {}
ports:
- port: 53
protocol: UDP
VS Code Server #
创建 VS Code Notebook #
yaml
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: vscode-notebook
namespace: kubeflow-user-example-com
spec:
template:
spec:
containers:
- name: vscode
image: public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/codeserver-python:v1.8.0
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
volumeMounts:
- name: workspace
mountPath: /home/coder
volumes:
- name: workspace
persistentVolumeClaim:
claimName: vscode-pvc
VS Code 配置 #
yaml
# 自定义 VS Code 设置
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: vscode-custom
namespace: kubeflow-user-example-com
spec:
template:
spec:
containers:
- name: vscode
image: public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/codeserver-python:v1.8.0
env:
- name: VSCODE_ARGS
value: "--disable-telemetry"
volumeMounts:
- name: vscode-settings
mountPath: /home/coder/.local/share/code-server/User
volumes:
- name: vscode-settings
configMap:
name: vscode-settings
RStudio #
创建 RStudio Notebook #
yaml
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: rstudio-notebook
namespace: kubeflow-user-example-com
spec:
template:
spec:
containers:
- name: rstudio
image: public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/rstudio-tidyverse:v1.8.0
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
volumeMounts:
- name: workspace
mountPath: /home/rstudio
volumes:
- name: workspace
persistentVolumeClaim:
claimName: rstudio-pvc
管理 Notebook #
查看状态 #
bash
# 列出所有 Notebook
kubectl get notebooks -n kubeflow-user-example-com
# 查看 Notebook 详情
kubectl describe notebook my-notebook -n kubeflow-user-example-com
# 查看 Pod 状态
kubectl get pods -n kubeflow-user-example-com -l app=my-notebook
# 查看日志
kubectl logs -n kubeflow-user-example-com -l app=my-notebook -c notebook
停止和启动 #
bash
# 停止 Notebook(删除 Pod)
kubectl delete pod -n kubeflow-user-example-com -l app=my-notebook
# Notebook Controller 会自动重建 Pod
# 完全删除 Notebook
kubectl delete notebook my-notebook -n kubeflow-user-example-com
扩展资源 #
bash
# 编辑 Notebook 配置
kubectl edit notebook my-notebook -n kubeflow-user-example-com
# 修改资源配置后,Pod 会自动重建
最佳实践 #
资源管理 #
text
1. 合理配置资源
├── 根据工作负载设置 CPU/内存
├── 不使用时停止 Notebook
└── 设置资源限制防止资源耗尽
2. 存储管理
├── 使用持久化存储保存数据
├── 定期清理不需要的文件
└── 合理设置存储大小
3. 镜像管理
├── 使用预构建镜像
├── 或自定义镜像预装依赖
└── 定期更新镜像版本
安全实践 #
text
1. 访问控制
├── 使用命名空间隔离
├── 配置 RBAC 权限
└── 不要共享账户
2. 数据安全
├── 使用 Secret 存储敏感信息
├── 不要在代码中硬编码密钥
└── 定期更新密码
3. 网络安全
├── 配置网络策略
├── 限制外部访问
└── 使用 HTTPS
开发效率 #
text
1. 环境配置
├── 预装常用依赖
├── 配置环境变量
└── 使用启动脚本
2. 代码管理
├── 使用 Git 版本控制
├── 定期提交代码
└── 使用分支管理
3. 协作开发
├── 共享数据集
├── 使用共享存储
└── 记录实验结果
故障排查 #
常见问题 #
bash
# Notebook 无法启动
kubectl describe notebook my-notebook -n kubeflow-user-example-com
kubectl get events -n kubeflow-user-example-com --sort-by='.lastTimestamp'
# 资源不足
kubectl describe nodes | grep -A 5 "Allocated resources"
# 镜像拉取失败
kubectl describe pod -n kubeflow-user-example-com -l app=my-notebook
# 存储问题
kubectl get pvc -n kubeflow-user-example-com
kubectl describe pvc my-notebook-pvc -n kubeflow-user-example-com
日志查看 #
bash
# 查看 Notebook 容器日志
kubectl logs -n kubeflow-user-example-com my-notebook-0 -c notebook
# 查看边车容器日志
kubectl logs -n kubeflow-user-example-com my-notebook-0 -c istio-proxy
# 实时查看日志
kubectl logs -f -n kubeflow-user-example-com my-notebook-0 -c notebook
下一步 #
现在你已经掌握了 Notebooks 的使用,接下来学习 Katib 超参数调优,了解如何自动化优化模型参数!
最后更新:2026-04-05