核心概念 #

本章深入介绍 Qdrant 的核心概念和数据模型，是理解和使用 Qdrant 的基础。

数据模型概览 #

text

Qdrant 数据模型层次：

┌─────────────────────────────────────────────────────────────┐
│                      Qdrant Instance                         │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   Collection 1                       │   │
│  │  ┌─────────────────────────────────────────────┐    │   │
│  │  │                  Segment                     │    │   │
│  │  │  ┌────────┐ ┌────────┐ ┌────────┐          │    │   │
│  │  │  │ Point  │ │ Point  │ │ Point  │  ...     │    │   │
│  │  │  │ id: 1  │ │ id: 2  │ │ id: 3  │          │    │   │
│  │  │  │ vector │ │ vector │ │ vector │          │    │   │
│  │  │  │ payload│ │ payload│ │ payload│          │    │   │
│  │  │  └────────┘ └────────┘ └────────┘          │    │   │
│  │  └─────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                   Collection 2                       │   │
│  │                      ...                             │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Collection（集合） #

Collection 是 Qdrant 中最顶层的容器，类似于关系数据库中的表。

Collection 特性 #

text

Collection 定义：

├── 名称：唯一标识符
├── 向量配置：
│   ├── 向量维度（size）
│   └── 距离度量（distance）
├── 向量数量：存储的点数
├── 配置参数：
│   ├── HNSW 参数
│   ├── 优化器参数
│   └── WAL 参数
└── 状态信息

向量配置 #

python

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="my_collection",
    vectors_config=VectorParams(
        size=384,
        distance=Distance.COSINE
    )
)

多向量支持 #

Qdrant 支持在一个 Collection 中存储多种类型的向量：

python

from qdrant_client.models import VectorParams, Distance

client.create_collection(
    collection_name="multi_vector_collection",
    vectors_config={
        "text": VectorParams(size=384, distance=Distance.COSINE),
        "image": VectorParams(size=512, distance=Distance.EUCLID),
        "audio": VectorParams(size=128, distance=Distance.DOT)
    }
)

Collection 信息 #

python

collection_info = client.get_collection("my_collection")

print(f"向量数量: {collection_info.points_count}")
print(f"向量维度: {collection_info.config.params.vectors.size}")
print(f"距离度量: {collection_info.config.params.vectors.distance}")
print(f"状态: {collection_info.status}")

Point（点） #

Point 是 Qdrant 中最基本的存储单元，包含 ID、向量和 Payload。

Point 结构 #

text

Point 组成：

┌─────────────────────────────────────────────────────────────┐
│                           Point                              │
├─────────────────────────────────────────────────────────────┤
│  id: 12345                                                   │
│  ├── 整数或 UUID                                             │
│  └── 唯一标识符                                              │
├─────────────────────────────────────────────────────────────┤
│  vector: [0.1, 0.2, -0.3, 0.4, ...]                         │
│  ├── 浮点数数组                                              │
│  └── 与 Collection 维度匹配                                  │
├─────────────────────────────────────────────────────────────┤
│  payload: {                                                  │
│    "title": "文档标题",                                      │
│    "content": "文档内容...",                                 │
│    "tags": ["AI", "ML"],                                     │
│    "created_at": "2024-01-01"                               │
│  }                                                           │
│  ├── JSON 格式元数据                                         │
│  └── 用于过滤和返回                                          │
└─────────────────────────────────────────────────────────────┘

创建 Point #

python

from qdrant_client.models import PointStruct

point = PointStruct(
    id=1,
    vector=[0.1, 0.2, 0.3, 0.4],
    payload={
        "title": "示例文档",
        "content": "这是一个示例文档内容",
        "tags": ["example", "demo"],
        "score": 0.95
    }
)

client.upsert(
    collection_name="my_collection",
    points=[point]
)

批量创建 #

python

points = [
    PointStruct(
        id=i,
        vector=[0.1 * i, 0.2 * i, 0.3 * i, 0.4 * i],
        payload={"index": i, "category": "batch"}
    )
    for i in range(100)
]

client.upsert(
    collection_name="my_collection",
    points=points
)

使用 UUID #

python

from uuid import uuid4

point = PointStruct(
    id=str(uuid4()),
    vector=[0.1, 0.2, 0.3, 0.4],
    payload={"title": "UUID 示例"}
)

Vector（向量） #

向量是 Point 的核心组成部分，代表高维空间中的点。

向量维度 #

text

常见向量维度：

文本嵌入：
├── OpenAI text-embedding-ada-002: 1536 维
├── Sentence Transformers: 384/768 维
├── Cohere Embed: 1024/4096 维
└── BGE: 768/1024 维

图像嵌入：
├── ResNet: 2048 维
├── CLIP: 512 维
└── ViT: 768 维

音频嵌入：
├── Wav2Vec: 768 维
└── VGGish: 128 维

距离度量 #

python

from qdrant_client.models import Distance

Distance.COSINE
Distance.EUCLID
Distance.DOT
Distance.MANHATTAN

余弦相似度 #

text

cos(a, b) = (a · b) / (|a| × |b|)

取值范围：[-1, 1]
1 表示完全相同方向
0 表示正交
-1 表示完全相反方向

适用场景：
├── 文本语义相似性
├── 归一化向量
└── 不关心向量长度

欧几里得距离 #

text

euclid(a, b) = √Σ(ai - bi)²

取值范围：[0, ∞)
0 表示完全相同
值越小越相似

适用场景：
├── 图像特征
├── 物理距离
└── 未归一化向量

点积 #

text

dot(a, b) = Σ(ai × bi)

取值范围：(-∞, ∞)
正值表示同向
负值表示反向

适用场景：
├── 推荐系统
├── 归一化向量时等同于余弦
└── 计算效率最高

稀疏向量 #

Qdrant 支持稀疏向量，适用于关键词搜索场景：

python

from qdrant_client.models import SparseVector

sparse_vector = SparseVector(
    indices=[1, 5, 100, 500],
    values=[0.5, 0.8, 0.3, 0.9]
)

Payload（负载） #

Payload 是附加在 Point 上的元数据，用于过滤和丰富搜索结果。

Payload 数据类型 #

text

支持的数据类型：

├── 整数（integer）
│   "age": 25
│
├── 浮点数（float）
│   "score": 0.95
│
├── 布尔值（bool）
│   "is_active": true
│
├── 字符串（keyword/text）
│   "title": "文档标题"
│
├── 数组
│   "tags": ["AI", "ML", "Python"]
│
├── 对象
│   "metadata": {"author": "张三", "year": 2024}
│
└── 地理位置
    "location": {"lat": 39.9, "lon": 116.4}

Payload 索引 #

为 Payload 字段创建索引以提高过滤性能：

python

from qdrant_client.models import PayloadSchemaType

client.create_payload_index(
    collection_name="my_collection",
    field_name="category",
    field_schema=PayloadSchemaType.KEYWORD
)

client.create_payload_index(
    collection_name="my_collection",
    field_name="price",
    field_schema=PayloadSchemaType.FLOAT
)

client.create_payload_index(
    collection_name="my_collection",
    field_name="created_at",
    field_schema=PayloadSchemaType.INTEGER
)

Payload 过滤示例 #

python

from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

results = client.search(
    collection_name="my_collection",
    query_vector=[0.1, 0.2, 0.3, 0.4],
    query_filter=Filter(
        must=[
            FieldCondition(
                key="category",
                match=MatchValue(value="technology")
            ),
            FieldCondition(
                key="price",
                range=Range(lte=100)
            )
        ]
    ),
    limit=10
)

Segment（段） #

Segment 是 Qdrant 内部的存储和索引单元。

Segment 结构 #

text

Segment 组成：

┌─────────────────────────────────────────────────────────────┐
│                          Segment                             │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ 向量存储     │  │ HNSW 索引   │  │ Payload 索引 │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ ID 映射     │  │  删除标记   │  │  版本信息   │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
└─────────────────────────────────────────────────────────────┘

Segment 类型 #

text

Segment 类型：

1. 内存段（Memory Segment）
   ├── 写入优化
   ├── 新数据首先写入
   └── 达到阈值后转为磁盘段

2. 磁盘段（Disk Segment）
   ├── 存储优化
   ├── 使用 mmap
   └── 只读，查询优化

3. 优化过程：
   新数据 → 内存段 → 累积 → 合并 → 磁盘段

段优化 #

yaml

配置优化器参数：

optimizers:
  deleted_threshold: 0.2
  vacuum_min_vector_count: 1000
  default_segment_number: 5
  max_segment_size_kb: 100000
  memmap_threshold_kb: 50000
  indexing_threshold_kb: 20000

索引类型 #

HNSW 索引 #

text

HNSW 参数：

m: 16
├── 每个节点的连接数
├── 越大越精确，内存越高
└── 推荐：8-64

ef_construct: 100
├── 构建时的搜索范围
├── 越大越精确，构建越慢
└── 推荐：50-200

full_scan_threshold_kb: 10000
├── 小于此值使用全扫描
└── 优化小数据集查询

Payload 索引 #

text

索引类型：

1. Keyword Index
   ├── 精确匹配
   └── 适用于枚举值

2. Integer Index
   ├── 范围查询
   └── 适用于数值过滤

3. Float Index
   ├── 范围查询
   └── 适用于价格、评分等

4. Geo Index
   ├── 地理范围查询
   └── 适用于位置过滤

5. Text Index
   ├── 全文搜索
   └── 适用于文本匹配

向量量化 #

量化可以减少内存占用，以精度换取存储效率。

标量量化 #

python

from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig, ScalarType

client.create_collection(
    collection_name="quantized_collection",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(
            type=ScalarType.INT8,
            quantile=0.99,
            always_ram=True
        )
    )
)

乘积量化 #

python

from qdrant_client.models import ProductQuantization, ProductQuantizationConfig

client.create_collection(
    collection_name="pq_collection",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
    quantization_config=ProductQuantization(
        product=ProductQuantizationConfig(
            compression_ratio=16
        )
    )
)

二进制量化 #

python

from qdrant_client.models import BinaryQuantization, BinaryQuantizationConfig

client.create_collection(
    collection_name="binary_collection",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
    quantization_config=BinaryQuantization(
        binary=BinaryQuantizationConfig(
            always_ram=True
        )
    )
)

数据一致性 #

一致性级别 #

text

一致性级别：

1. 强一致性
   ├── 等待所有副本确认
   └── 最高可靠性，最低性能

2. 最终一致性
   ├── 只等待主节点确认
   └── 高性能，可能短暂不一致

3. 可调一致性
   ├── 配置确认副本数
   └── 平衡可靠性和性能

写确认 #

python

from qdrant_client.models import WriteOrdering

client.upsert(
    collection_name="my_collection",
    points=[point],
    wait=True,
    ordering=WriteOrdering.STRONG
)

小结 #

本章介绍了 Qdrant 的核心概念：

Collection：向量集合容器
Point：基本存储单元
Vector：高维向量数据
Payload：元数据信息
Segment：存储和索引单元
索引：HNSW 和 Payload 索引
量化：内存优化技术

下一步 #

理解核心概念后，继续学习快速开始，动手实践 Qdrant！