Collection 管理 #

Collection 是 Qdrant 中最核心的概念，本章详细介绍 Collection 的管理操作。

Collection 概述 #

text

Collection 结构：

┌─────────────────────────────────────────────────────────────┐
│                      Collection                              │
├─────────────────────────────────────────────────────────────┤
│  配置参数                                                     │
│  ├── 向量参数（vectors_config）                              │
│  ├── HNSW 参数（hnsw_config）                                │
│  ├── 优化器参数（optimizer_config）                          │
│  ├── WAL 参数（wal_config）                                  │
│  └── 量化配置（quantization_config）                         │
├─────────────────────────────────────────────────────────────┤
│  运行时信息                                                   │
│  ├── 向量数量（points_count）                                │
│  ├── 索引状态（indexed_vectors_count）                       │
│  ├── 段信息（segments_count）                                │
│  └── 状态（status）                                          │
└─────────────────────────────────────────────────────────────┘

创建 Collection #

基础创建 #

python

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="my_collection",
    vectors_config=VectorParams(
        size=384,
        distance=Distance.COSINE
    )
)

完整配置创建 #

python

from qdrant_client.models import (
    VectorParams,
    Distance,
    HnswConfigDiff,
    OptimizersConfigDiff,
    WalConfigDiff,
    ScalarQuantization,
    ScalarQuantizationConfig,
    ScalarType
)

client.create_collection(
    collection_name="advanced_collection",
    vectors_config=VectorParams(
        size=768,
        distance=Distance.COSINE
    ),
    hnsw_config=HnswConfigDiff(
        m=16,
        ef_construct=100,
        full_scan_threshold=10000,
        max_indexing_threads=2,
        on_disk=False
    ),
    optimizer_config=OptimizersConfigDiff(
        deleted_threshold=0.2,
        vacuum_min_vector_count=1000,
        default_segment_number=5,
        max_segment_size_kb=100000,
        memmap_threshold_kb=50000,
        indexing_threshold_kb=20000,
        flush_interval_sec=5,
        max_optimization_threads=2
    ),
    wal_config=WalConfigDiff(
        wal_capacity_mb=32,
        wal_segments_ahead=0
    ),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(
            type=ScalarType.INT8,
            quantile=0.99,
            always_ram=True
        )
    ),
    shard_number=1,
    replication_factor=1,
    write_consistency_factor=1,
    on_disk_payload=False
)

print("高级配置 Collection 创建成功")

多向量 Collection #

python

client.create_collection(
    collection_name="multi_vector_collection",
    vectors_config={
        "text": VectorParams(size=384, distance=Distance.COSINE),
        "image": VectorParams(size=512, distance=Distance.EUCLID),
        "audio": VectorParams(size=128, distance=Distance.DOT)
    }
)

print("多向量 Collection 创建成功")

稀疏向量 Collection #

python

from qdrant_client.models import SparseVectorParams, SparseIndexParams

client.create_collection(
    collection_name="sparse_collection",
    sparse_vectors_config={
        "text-sparse": SparseVectorParams(
            index=SparseIndexParams(
                on_disk=False
            )
        )
    }
)

print("稀疏向量 Collection 创建成功")

配置参数详解 #

向量参数 #

python

VectorParams(
    size=384,
    distance=Distance.COSINE,
    hnsw_config=HnswConfigDiff(
        m=16,
        ef_construct=100
    ),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(
            type=ScalarType.INT8
        )
    ),
    on_disk=False
)

参数	说明	默认值
size	向量维度	必填
distance	距离度量	必填
hnsw_config	HNSW 配置	默认值
quantization_config	量化配置	None
on_disk	是否存储在磁盘	False

HNSW 配置 #

python

HnswConfigDiff(
    m=16,
    ef_construct=100,
    full_scan_threshold=10000,
    max_indexing_threads=2,
    on_disk=False,
    payload_m=None
)

参数	说明	推荐值
m	每个节点的连接数	16-64
ef_construct	构建时搜索范围	100-200
full_scan_threshold	全扫描阈值	10000
max_indexing_threads	索引线程数	CPU 核心数
on_disk	索引是否存磁盘	False
payload_m	Payload 索引连接数	m/2

优化器配置 #

python

OptimizersConfigDiff(
    deleted_threshold=0.2,
    vacuum_min_vector_count=1000,
    default_segment_number=5,
    max_segment_size_kb=100000,
    memmap_threshold_kb=50000,
    indexing_threshold_kb=20000,
    flush_interval_sec=5,
    max_optimization_threads=2
)

参数	说明	默认值
deleted_threshold	删除比例阈值	0.2
vacuum_min_vector_count	最小清理向量数	1000
default_segment_number	默认段数量	5
max_segment_size_kb	最大段大小	100000
memmap_threshold_kb	mmap 阈值	50000
indexing_threshold_kb	索引阈值	20000
flush_interval_sec	刷新间隔	5
max_optimization_threads	优化线程数	2

获取 Collection 信息 #

获取单个 Collection #

python

collection_info = client.get_collection("my_collection")

print(f"向量数量: {collection_info.points_count}")
print(f"索引向量数: {collection_info.indexed_vectors_count}")
print(f"段数量: {collection_info.segments_count}")
print(f"状态: {collection_info.status}")

print(f"向量维度: {collection_info.config.params.vectors.size}")
print(f"距离度量: {collection_info.config.params.vectors.distance}")

获取所有 Collection #

python

collections = client.get_collections()

print(f"Collection 总数: {len(collections.collections)}")
for col in collections.collections:
    print(f"  - {col.name}")

检查 Collection 是否存在 #

python

def collection_exists(collection_name):
    collections = client.get_collections()
    return any(col.name == collection_name for col in collections.collections)

if collection_exists("my_collection"):
    print("Collection 存在")
else:
    print("Collection 不存在")

更新 Collection 配置 #

更新 Collection 参数 #

python

from qdrant_client.models import OptimizersConfigDiff, HnswConfigDiff

client.update_collection(
    collection_name="my_collection",
    optimizer_config=OptimizersConfigDiff(
        deleted_threshold=0.3,
        indexing_threshold_kb=10000
    ),
    hnsw_config=HnswConfigDiff(
        ef_construct=150
    )
)

print("Collection 配置已更新")

更新向量参数 #

python

from qdrant_client.models import VectorParamsDiff

client.update_collection(
    collection_name="my_collection",
    vectors_config={
        "text": VectorParamsDiff(
            hnsw_config=HnswConfigDiff(
                m=32,
                ef_construct=200
            )
        )
    }
)

print("向量配置已更新")

删除 Collection #

删除单个 Collection #

python

client.delete_collection("my_collection")

print("Collection 已删除")

删除多个 Collection #

python

collections_to_delete = ["test1", "test2", "test3"]

for name in collections_to_delete:
    try:
        client.delete_collection(name)
        print(f"已删除: {name}")
    except Exception as e:
        print(f"删除 {name} 失败: {e}")

Collection 别名 #

别名允许为 Collection 创建替代名称，便于版本切换。

创建别名 #

python

client.create_alias(
    collection_name="my_collection",
    alias_name="production"
)

print("别名创建成功")

列出所有别名 #

python

aliases = client.get_collection_aliases("my_collection")

for alias in aliases.aliases:
    print(f"别名: {alias.alias_name}")

删除别名 #

python

client.delete_alias(alias_name="production")

print("别名已删除")

切换别名指向 #

python

client.update_collection_aliases(
    change_aliases_operations=[
        {
            "action": "delete",
            "alias_name": "production"
        },
        {
            "action": "create",
            "alias_name": "production",
            "collection_name": "my_collection_v2"
        }
    ]
)

print("别名已切换到新版本")

Collection 统计信息 #

获取详细统计 #

python

info = client.get_collection("my_collection")

stats = {
    "points_count": info.points_count,
    "indexed_vectors_count": info.indexed_vectors_count,
    "segments_count": info.segments_count,
    "status": info.status,
    "optimizer_status": info.optimizer_status,
    "indexed_vectors_count": info.indexed_vectors_count,
    "points_count": info.points_count,
}

print("Collection 统计信息:")
for key, value in stats.items():
    print(f"  {key}: {value}")

监控索引进度 #

python

import time

def wait_for_indexing(collection_name, timeout=60):
    start_time = time.time()
    
    while time.time() - start_time < timeout:
        info = client.get_collection(collection_name)
        
        if info.status == "green":
            print("索引完成")
            return True
        
        indexed = info.indexed_vectors_count
        total = info.points_count
        
        if total > 0:
            progress = (indexed / total) * 100
            print(f"索引进度: {progress:.1f}% ({indexed}/{total})")
        
        time.sleep(1)
    
    print("索引超时")
    return False

wait_for_indexing("my_collection")

最佳实践 #

选择合适的向量维度 #

text

向量维度选择：

小规模数据（< 10 万）：
├── 可使用高维度（1024+）
└── 精度优先

中等规模（10-100 万）：
├── 推荐 384-768 维
└── 平衡精度和性能

大规模数据（> 100 万）：
├── 推荐 384 维或更低
├── 考虑量化
└── 性能优先

选择距离度量 #

text

距离度量选择：

余弦相似度（COSINE）：
├── 文本嵌入
├── 语义相似性
└── 归一化向量

欧几里得距离（EUCLID）：
├── 图像特征
├── 物理距离
└── 未归一化向量

点积（DOT）：
├── 推荐系统
├── 归一化向量
└── 最高性能

HNSW 参数调优 #

text

HNSW 参数建议：

高精度场景：
├── m: 32-64
├── ef_construct: 200-400
└── 内存占用高

平衡场景：
├── m: 16
├── ef_construct: 100
└── 推荐默认值

高性能场景：
├── m: 8-16
├── ef_construct: 50-100
└── 内存占用低

内存优化 #

python

client.create_collection(
    collection_name="memory_optimized",
    vectors_config=VectorParams(
        size=384,
        distance=Distance.COSINE
    ),
    hnsw_config=HnswConfigDiff(
        m=8,
        ef_construct=50,
        on_disk=True
    ),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(
            type=ScalarType.INT8,
            always_ram=False
        )
    ),
    on_disk_payload=True
)

print("内存优化 Collection 创建成功")

常见问题 #

Collection 状态说明 #

text

状态颜色：

green：
├── 所有段都已索引
└── 查询性能最优

yellow：
├── 部分段正在索引
└── 查询可能较慢

red：
├── 存在错误
└── 需要检查日志

grey：
├── Collection 正在初始化
└── 等待就绪

处理大量 Collection #

python

def batch_create_collections(base_name, count, config):
    for i in range(count):
        name = f"{base_name}_{i}"
        try:
            client.create_collection(
                collection_name=name,
                vectors_config=config
            )
            print(f"创建: {name}")
        except Exception as e:
            print(f"创建 {name} 失败: {e}")

batch_create_collections(
    "shard",
    10,
    VectorParams(size=384, distance=Distance.COSINE)
)

小结 #

本章详细介绍了 Collection 管理：

创建 Collection（基础、高级、多向量）
配置参数详解
获取和更新 Collection 信息
别名管理
最佳实践

下一步 #

掌握 Collection 管理后，继续学习向量操作，了解如何高效地插入和管理向量数据！