核心概念 #

一、数据模型概览 #

text

Milvus数据模型层次结构：

┌─────────────────────────────────────────┐
│              Database（数据库）          │
├─────────────────────────────────────────┤
│  ┌─────────────────────────────────┐   │
│  │       Collection（集合）         │   │
│  │  ┌───────────────────────────┐  │   │
│  │  │    Partition（分区）       │  │   │
│  │  │  ┌─────────────────────┐  │  │   │
│  │  │  │   Segment（段）      │  │  │   │
│  │  │  │  ┌───────────────┐  │  │  │   │
│  │  │  │  │ Entity（实体） │  │  │  │   │
│  │  │  │  └───────────────┘  │  │  │   │
│  │  │  └─────────────────────┘  │  │   │
│  │  └───────────────────────────┘  │   │
│  └─────────────────────────────────┘   │
└─────────────────────────────────────────┘

二、Database（数据库） #

2.1 概念说明 #

Database 是 Milvus 中数据隔离的顶层单元。

text

┌─────────────────────────────────────────┐
│            Milvus Instance              │
├─────────────────────────────────────────┤
│                                         │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐ │
│  │  DB 1   │  │  DB 2   │  │  DB 3   │ │
│  │ (电商)  │  │ (推荐)  │  │ (搜索)  │ │
│  └─────────┘  └─────────┘  └─────────┘ │
│                                         │
└─────────────────────────────────────────┘

2.2 数据库操作 #

python

from pymilvus import connections, db

connections.connect("default", host="localhost", port="19530")

# 创建数据库
db.create_database("ecommerce")

# 列出所有数据库
print(db.list_databases())
# ['default', 'ecommerce']

# 使用数据库
db.using_database("ecommerce")

# 删除数据库
db.drop_database("ecommerce")

三、Collection（集合） #

3.1 概念说明 #

Collection 是 Milvus 中存储实体的逻辑单元，类似于关系数据库中的表。

text

Collection结构：

┌─────────────────────────────────────────┐
│         Collection: products            │
├─────────────────────────────────────────┤
│  字段定义：                              │
│  ├── id (INT64, 主键)                   │
│  ├── name (VARCHAR)                     │
│  ├── price (FLOAT)                      │
│  ├── category (VARCHAR)                 │
│  └── embedding (FLOAT_VECTOR, dim=128)  │
├─────────────────────────────────────────┤
│  实体数据：                              │
│  ├── {id:1, name:"...", embedding:[...]}│
│  ├── {id:2, name:"...", embedding:[...]}│
│  └── ...                                │
└─────────────────────────────────────────┘

3.2 Collection操作 #

python

from pymilvus import Collection, FieldSchema, CollectionSchema, DataType

# 定义字段
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="name", dtype=DataType.VARCHAR, max_length=200),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]

# 创建Schema
schema = CollectionSchema(
    fields=fields,
    description="产品向量集合",
    enable_dynamic_field=True
)

# 创建Collection
collection = Collection(
    name="products",
    schema=schema,
    using='default',
    shards_num=2
)

# 列出所有Collection
from pymilvus import utility
print(utility.list_collections())

# 删除Collection
utility.drop_collection("products")

四、Schema（模式） #

4.1 概念说明 #

Schema 定义了 Collection 的字段结构和属性。

text

Schema组成：

┌─────────────────────────────────────────┐
│              Schema                      │
├─────────────────────────────────────────┤
│  字段列表 (Fields)                       │
│  ├── 字段名 (Name)                       │
│  ├── 数据类型 (DataType)                 │
│  ├── 是否主键 (is_primary)              │
│  ├── 自动生成ID (auto_id)               │
│  └── 其他属性 (max_length, dim等)       │
├─────────────────────────────────────────┤
│  描述信息 (Description)                  │
├─────────────────────────────────────────┤
│  动态字段 (enable_dynamic_field)        │
└─────────────────────────────────────────┘

4.2 字段类型 #

类型	说明	示例
BOOL	布尔值	True/False
INT8	8位整数	-128 ~ 127
INT16	16位整数	-32768 ~ 32767
INT32	32位整数	-2147483648 ~ 2147483647
INT64	64位整数	大整数
FLOAT	32位浮点数	3.14
DOUBLE	64位浮点数	3.141592653589793
VARCHAR	变长字符串	“hello”
ARRAY	数组	[1, 2, 3]
JSON	JSON对象
FLOAT_VECTOR	浮点向量	[0.1, 0.2, …]
FLOAT16_VECTOR	16位浮点向量	精度较低
BINARY_VECTOR	二进制向量	[0, 1, 0, …]
SPARSE_FLOAT_VECTOR	稀疏向量

4.3 Schema示例 #

python

from pymilvus import FieldSchema, DataType

# 主键字段
id_field = FieldSchema(
    name="id",
    dtype=DataType.INT64,
    is_primary=True,
    auto_id=False,
    description="主键ID"
)

# 向量字段
vector_field = FieldSchema(
    name="embedding",
    dtype=DataType.FLOAT_VECTOR,
    dim=768,
    description="文本嵌入向量"
)

# 标量字段
name_field = FieldSchema(
    name="name",
    dtype=DataType.VARCHAR,
    max_length=256,
    description="名称"
)

# JSON字段
metadata_field = FieldSchema(
    name="metadata",
    dtype=DataType.JSON,
    description="元数据"
)

# 数组字段
tags_field = FieldSchema(
    name="tags",
    dtype=DataType.ARRAY,
    element_type=DataType.VARCHAR,
    max_capacity=10,
    max_length=64,
    description="标签列表"
)

五、Partition（分区） #

5.1 概念说明 #

Partition 是 Collection 的子集，用于优化查询性能和数据管理。

text

分区结构：

┌─────────────────────────────────────────┐
│         Collection: orders              │
├─────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐      │
│  │ Partition:  │  │ Partition:  │      │
│  │  _default   │  │  2024_01    │      │
│  │  (默认分区) │  │  (2024年1月)│      │
│  └─────────────┘  └─────────────┘      │
│                                         │
│  ┌─────────────┐  ┌─────────────┐      │
│  │ Partition:  │  │ Partition:  │      │
│  │  2024_02    │  │  2024_03    │      │
│  │  (2024年2月)│  │  (2024年3月)│      │
│  └─────────────┘  └─────────────┘      │
└─────────────────────────────────────────┘

5.2 分区操作 #

python

from pymilvus import Partition

# 创建分区
partition = Partition(collection, "2024_01", description="2024年1月数据")

# 列出所有分区
print(collection.partitions)

# 检查分区是否存在
print(collection.has_partition("2024_01"))

# 删除分区
collection.drop_partition("2024_01")

5.3 分区使用场景 #

text

分区使用场景：

1. 时间分区
   ├── 按日期/月份分区
   └── 便于历史数据管理

2. 业务分区
   ├── 按业务类型分区
   └── 提高查询效率

3. 地理分区
   ├── 按地区分区
   └── 减少搜索范围

六、Segment（段） #

6.1 概念说明 #

Segment 是 Milvus 数据存储的最小单元，数据在后台自动组织成段。

text

Segment类型：

┌─────────────────────────────────────────┐
│              Segment类型                 │
├─────────────────────────────────────────┤
│                                         │
│  ┌─────────────────────────────────┐   │
│  │  Growing Segment（增长段）       │   │
│  │  - 新写入的数据                  │   │
│  │  - 内存中                        │   │
│  │  - 可追加写入                    │   │
│  └─────────────────────────────────┘   │
│                                         │
│  ┌─────────────────────────────────┐   │
│  │  Sealed Segment（密封段）        │   │
│  │  - 已封存的段                    │   │
│  │  - 磁盘上                        │   │
│  │  - 可建立索引                    │   │
│  └─────────────────────────────────┘   │
│                                         │
└─────────────────────────────────────────┘

6.2 段管理 #

python

# 段信息由系统自动管理
# 可以通过API查看段状态

from pymilvus import utility

# 获取Collection统计信息
collection = Collection("products")
print(collection.num_entities)

# 手动触发段合并
utility.load_balance(
    collection_name="products",
    src_node_id=1,
    dst_node_ids=[2, 3]
)

七、Index（索引） #

7.1 概念说明 #

索引是加速向量搜索的数据结构，是Milvus高性能的关键。

text

索引作用：

无索引搜索：
┌─────────────────────────────────────┐
│  查询向量 vs 所有向量（暴力搜索）    │
│  时间复杂度: O(N)                   │
│  精度: 100%                         │
└─────────────────────────────────────┘

有索引搜索：
┌─────────────────────────────────────┐
│  查询向量 vs 索引结构（近似搜索）    │
│  时间复杂度: O(log N)               │
│  精度: 95%+                         │
└─────────────────────────────────────┘

7.2 索引类型 #

text

索引分类：

┌─────────────────────────────────────────┐
│              索引类型                    │
├─────────────────────────────────────────┤
│                                         │
│  CPU索引                                │
│  ├── FLAT     暴力搜索，精度最高        │
│  ├── IVF_FLAT 倒排索引，平衡精度速度    │
│  ├── IVF_PQ   乘积量化，内存效率高      │
│  ├── IVF_SQ8  标量量化，压缩存储        │
│  ├── HNSW     图索引，搜索速度快        │
│  └── DISKANN 磁盘索引，支持大规模       │
│                                         │
│  GPU索引                                │
│  ├── GPU_IVF_FLAT                       │
│  ├── GPU_IVF_PQ                         │
│  └── GPU_BRUTE_FORCE                    │
│                                         │
└─────────────────────────────────────────┘

7.3 索引操作 #

python

from pymilvus import Collection

collection = Collection("products")

# 创建索引
index_params = {
    "metric_type": "L2",
    "index_type": "IVF_FLAT",
    "params": {"nlist": 128}
}

collection.create_index(
    field_name="embedding",
    index_params=index_params,
    index_name="embedding_idx"
)

# 查看索引
print(collection.indexes)

# 删除索引
collection.drop_index(index_name="embedding_idx")

八、Entity（实体） #

8.1 概念说明 #

Entity 是 Collection 中的一条记录，包含多个字段的值。

python

# 实体示例
entity = {
    "id": 1,
    "name": "iPhone 15",
    "price": 6999.0,
    "category": "electronics",
    "embedding": [0.1, 0.2, 0.3, ...]  # 128维向量
}

8.2 实体操作 #

python

# 插入实体
data = [
    [1, 2, 3],  # id
    ["iPhone", "iPad", "Mac"],  # name
    [[0.1]*128, [0.2]*128, [0.3]*128]  # embedding
]

collection.insert(data)

# 使用字典插入
entities = [
    {"id": 1, "name": "iPhone", "embedding": [0.1]*128},
    {"id": 2, "name": "iPad", "embedding": [0.2]*128}
]

collection.insert(entities)

九、距离度量 #

9.1 度量类型 #

度量类型	公式	适用场景
L2 (欧氏距离)	√Σ(ai-bi)²	图像搜索、通用场景
IP (内积)	Σai*bi	归一化向量、推荐系统
COSINE (余弦相似度)	Σai*bi / (	a
HAMMING (汉明距离)	异或位数	二进制向量
JACCARD (杰卡德距离)	1 - 交集/并集	集合相似度

9.2 度量选择 #

text

度量选择指南：

┌─────────────────────────────────────────┐
│           距离度量选择                   │
├─────────────────────────────────────────┤
│                                         │
│  向量已归一化？                          │
│  ├── 是 → IP 或 COSINE                  │
│  └── 否 → L2                            │
│                                         │
│  应用场景？                              │
│  ├── 图像搜索 → L2                      │
│  ├── 文本相似 → COSINE                  │
│  ├── 推荐系统 → IP                      │
│  └── 二进制向量 → HAMMING               │
│                                         │
└─────────────────────────────────────────┘

十、数据流程 #

10.1 写入流程 #

text

数据写入流程：

┌──────────┐     ┌──────────┐     ┌──────────┐
│  客户端   │────▶│  Proxy   │────▶│ DataCoord│
└──────────┘     └──────────┘     └────┬─────┘
                                       │
                                       ▼
                                 ┌──────────┐
                                 │ DataNode │
                                 │ (写入段)  │
                                 └──────────┘

10.2 查询流程 #

text

查询流程：

┌──────────┐     ┌──────────┐     ┌──────────┐
│  客户端   │────▶│  Proxy   │────▶│QueryCoord│
└──────────┘     └──────────┘     └────┬─────┘
                                       │
                                       ▼
                                 ┌──────────┐
                                 │QueryNode │
                                 │ (执行搜索)│
                                 └──────────┘

十一、总结 #

核心概念对照表：

概念	类比	说明
Database	数据库	数据隔离单元
Collection	表	存储实体的逻辑单元
Schema	表结构	定义字段和属性
Partition	分区	Collection的子集
Segment	数据块	数据存储的最小单元
Index	索引	加速搜索的数据结构
Entity	行/记录	一条数据记录

下一步，让我们学习Milvus的基础语法！