Elasticsearch核心概念 #

一、数据模型概念 #

1.1 概念对照表 #

Elasticsearch	关系型数据库	说明
Index	Database	数据库
Type(已废弃)	Table	表
Document	Row	行/记录
Field	Column	列/字段
Mapping	Schema	表结构
Shard	Partition	分区

1.2 架构层次 #

text

Elasticsearch架构
├── Cluster（集群）
│   └── Node（节点）
│       └── Index（索引）
│           └── Shard（分片）
│               └── Segment（段）
│                   └── Document（文档）
│                       └── Field（字段）

二、索引(Index) #

2.1 什么是索引 #

索引是Elasticsearch中存储数据的基本单元，类似于关系数据库中的数据库概念。

text

索引结构
┌─────────────────────────────────────┐
│           Index: products           │
├─────────────────────────────────────┤
│  Document 1: {"name": "iPhone"}     │
│  Document 2: {"name": "MacBook"}    │
│  Document 3: {"name": "iPad"}       │
└─────────────────────────────────────┘

2.2 索引命名规范 #

text

命名规则
├── 全部小写
├── 不能包含 \, /, *, ?, ", <, >, |, 空格, 逗号, #
├── 不能以 -, _, + 开头
├── 不能是 . 或 ..
└── 长度不超过255字节

正确示例：

products
user_logs_2024
.security（系统索引）

错误示例：

Products（大写）
user-logs（以-开头）
user logs（包含空格）

2.3 索引模式 #

text

索引模式应用
├── 时间序列索引
│   ├── logs-2024-01-01
│   ├── logs-2024-01-02
│   └── logs-2024-01-03
├── 多租户索引
│   ├── tenant1_orders
│   └── tenant2_orders
└── 索引别名
    └── logs -> logs-2024-01-01

三、文档(Document) #

3.1 文档结构 #

文档是Elasticsearch中存储的基本数据单元，以JSON格式表示。

json

{
  "_index": "products",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "name": "iPhone 15",
    "price": 999,
    "brand": "Apple",
    "tags": ["phone", "smartphone"]
  }
}

3.2 元数据字段 #

字段	说明
`_index`	文档所属索引
`_id`	文档唯一标识
`_version`	文档版本号
`_seq_no`	序列号（用于并发控制）
`_primary_term`	主分片任期
`_source`	原始JSON文档
`_routing`	路由值
`_score`	相关性得分

3.3 文档ID生成 #

text

ID生成方式
├── 自动生成
│   └── POST /products/_doc
│       └── 自动生成20字符Base64编码ID
├── 手动指定
│   └── POST /products/_create/1
│       └── 使用自定义ID
└── 路由生成
    └── 基于routing字段计算

四、映射(Mapping) #

4.1 什么是映射 #

映射定义了文档的字段结构和类型，类似于数据库的Schema。

json

{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "standard"
      },
      "price": {
        "type": "float"
      },
      "brand": {
        "type": "keyword"
      },
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      }
    }
  }
}

4.2 动态映射 #

Elasticsearch会自动推断字段类型：

JSON类型	Elasticsearch类型
null	不添加字段
boolean	boolean
integer	long
float	float
string	text + keyword
array	取决于数组元素类型

4.3 显式映射 #

json

PUT /products
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "status": {
        "type": "keyword"
      },
      "price": {
        "type": "scaled_float",
        "scaling_factor": 100
      }
    }
  }
}

五、分片(Shard) #

5.1 分片概念 #

分片是索引的物理存储单元，每个分片是一个独立的Lucene索引。

text

索引分片结构
┌─────────────────────────────────────────────┐
│              Index: products                │
│                  (3主分片)                   │
├─────────────┬─────────────┬─────────────────┤
│   Shard 0   │   Shard 1   │     Shard 2     │
│  (Primary)  │  (Primary)  │    (Primary)    │
├─────────────┼─────────────┼─────────────────┤
│   Replica   │   Replica   │     Replica     │
│     0       │     1       │       2         │
└─────────────┴─────────────┴─────────────────┘

5.2 主分片与副本 #

类型	说明	作用
主分片(Primary)	原始数据分片	数据存储、读写操作
副本分片(Replica)	主分片的副本	故障恢复、查询负载均衡

5.3 分片数量规划 #

text

分片规划原则
├── 分片大小
│   └── 单个分片建议10-50GB
├── 分片数量
│   └── 每个节点分片数不超过20个/GB堆内存
├── 主分片数量
│   └── 创建后不可修改（需重建索引）
└── 副本数量
    └── 可动态调整

5.4 分片路由 #

text

文档路由公式
shard_num = hash(_routing) % num_primary_shards

默认使用文档ID作为routing值
可自定义routing实现数据分布控制

六、节点(Node) #

6.1 节点类型 #

节点角色	配置值	功能
主节点	master	集群状态管理、索引创建删除
数据节点	data	数据存储、搜索、聚合
协调节点	ingest	请求路由、结果聚合
预处理节点	ingest	管道处理、数据转换
机器学习节点	ml	机器学习任务

6.2 节点配置 #

yaml

node.roles: [master, data, ingest]

node.roles: [master]

node.roles: [data]

node.roles: [ingest]

6.3 节点架构示例 #

text

生产集群架构
┌─────────────────────────────────────────────┐
│                  Cluster                     │
├─────────────┬─────────────┬─────────────────┤
│   Node-1    │   Node-2    │     Node-3      │
│  (Master)   │  (Master)   │    (Master)     │
│   候选      │   候选      │     候选        │
├─────────────┼─────────────┼─────────────────┤
│   Node-4    │   Node-5    │     Node-6      │
│   (Data)    │   (Data)    │     (Data)      │
├─────────────┼─────────────┼─────────────────┤
│   Node-7    │   Node-8    │                 │
│ (Coordinat) │ (Coordinat) │                 │
└─────────────┴─────────────┴─────────────────┘

七、倒排索引 #

7.1 倒排索引原理 #

倒排索引是Elasticsearch实现快速全文搜索的核心数据结构。

text

正向索引（传统）
文档ID -> 文档内容
┌────┬─────────────────────┐
│ 1  │ "Elasticsearch is fast"   │
│ 2  │ "Search with Elasticsearch" │
│ 3  │ "Fast search engine"    │
└────┴─────────────────────┘

倒排索引
词项 -> 文档ID列表
┌──────────────┬────────────┐
│ elasticsearch │ [1, 2]     │
│ fast         │ [1, 3]     │
│ search       │ [2, 3]     │
│ is           │ [1]        │
│ with         │ [2]        │
│ engine       │ [3]        │
└──────────────┴────────────┘

7.2 倒排索引结构 #

text

倒排索引组成
├── Term Dictionary（词典）
│   └── 所有词项的有序列表
├── Term Index（词项索引）
│   └── 词典的索引，快速定位词项
└── Postings List（倒排表）
    └── 包含该词项的文档ID列表

7.3 搜索过程 #

text

搜索 "elasticsearch"
├── 1. 查找Term Index
│   └── 定到词典位置
├── 2. 扫描Term Dictionary
│   └── 找到 "elasticsearch"
├── 3. 获取Postings List
│   └── 返回 [1, 2]
└── 4. 返回文档
    └── 获取文档内容

八、段(Segment) #

8.1 段的概念 #

段是Lucene中不可变的索引存储单元。

text

索引写入过程
┌─────────────────────────────────────┐
│          Index Buffer               │
│         (内存缓冲区)                 │
└──────────────┬──────────────────────┘
               │ refresh (默认1秒)
               ↓
┌─────────────────────────────────────┐
│           Segment                   │
│         (磁盘上的段)                 │
└─────────────────────────────────────┘

8.2 段合并 #

text

段合并过程
┌─────┐ ┌─────┐ ┌─────┐
│Seg 1│ │Seg 2│ │Seg 3│
└──┬──┘ └──┬──┘ └──┬──┘
   │       │       │
   └───────┴───────┘
           │ merge
           ↓
   ┌───────────────┐
   │   New Segment │
   └───────────────┘

8.3 近实时搜索 #

text

数据可见性时间线
├── 0ms: 文档写入Buffer
├── 1s: refresh到Segment（可搜索）
├── 5s: flush到磁盘（持久化）
└── 30m: translog fsync（确保不丢失）

九、集群状态 #

9.1 集群健康状态 #

状态	含义	说明
Green	健康	所有分片正常分配
Yellow	警告	主分片正常，副本未分配
Red	故障	部分主分片未分配

9.2 状态检查 #

bash

GET /_cluster/health

{
  "status": "green",
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 10,
  "active_shards": 20,
  "unassigned_shards": 0
}

十、总结 #

本章介绍了Elasticsearch的核心概念：

索引是数据的逻辑容器
文档是基本的存储单元
映射定义字段类型和结构
分片实现数据的分布式存储
倒排索引是快速搜索的关键
段是不可变的存储单元

下一步，我们将学习Elasticsearch的基础语法。