Elasticsearch分析器 #
一、分析器概述 #
1.1 分析器组成 #
text
分析器组成
├── Character Filters (字符过滤器)
│ └── 预处理原始文本
├── Tokenizer (分词器)
│ └── 将文本分割为词项
└── Token Filters (词项过滤器)
└── 处理分割后的词项
1.2 分析过程 #
text
文本分析流程
原始文本
↓
Character Filters (0个或多个)
↓
Tokenizer (1个)
↓
Token Filters (0个或多个)
↓
词项列表
二、内置分析器 #
2.1 standard分析器(默认) #
bash
POST /_analyze
{
"analyzer": "standard",
"text": "The Quick Brown Fox"
}
结果:
json
{
"tokens": [
{ "token": "the", "position": 0 },
{ "token": "quick", "position": 1 },
{ "token": "brown", "position": 2 },
{ "token": "fox", "position": 3 }
]
}
2.2 simple分析器 #
bash
POST /_analyze
{
"analyzer": "simple",
"text": "The Quick Brown Fox"
}
只保留字母,按非字母分割。
2.3 whitespace分析器 #
bash
POST /_analyze
{
"analyzer": "whitespace",
"text": "The Quick Brown Fox"
}
按空格分割,不转小写。
2.4 stop分析器 #
bash
POST /_analyze
{
"analyzer": "stop",
"text": "The Quick Brown Fox"
}
移除停用词。
2.5 keyword分析器 #
bash
POST /_analyze
{
"analyzer": "keyword",
"text": "The Quick Brown Fox"
}
不分词,输出原文本。
2.6 pattern分析器 #
bash
POST /_analyze
{
"analyzer": "pattern",
"text": "The-Quick-Brown-Fox"
}
使用正则表达式分割(默认 \W+)。
2.7 language分析器 #
bash
POST /_analyze
{
"analyzer": "english",
"text": "The Quick Brown Foxes"
}
支持语言:english, chinese, french, german等。
2.8 fingerprint分析器 #
bash
POST /_analyze
{
"analyzer": "fingerprint",
"text": "The Quick Brown Fox"
}
去重排序,输出单个词项。
三、中文分析器 #
3.1 IK分词器 #
安装:
bash
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.12.0/elasticsearch-analysis-ik-8.12.0.zip
使用:
bash
POST /_analyze
{
"analyzer": "ik_smart",
"text": "中华人民共和国"
}
IK分词模式:
| 模式 | 说明 |
|---|---|
| ik_smart | 智能分词,粗粒度 |
| ik_max_word | 最大化分词,细粒度 |
3.2 自定义词典 #
bash
PUT /products
{
"settings": {
"analysis": {
"analyzer": {
"my_ik_analyzer": {
"type": "custom",
"tokenizer": "ik_max_word"
}
}
}
}
}
四、自定义分析器 #
4.1 基本配置 #
bash
PUT /products
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": ["& => and", "| => or"]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[\\W_]+"
}
},
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": ["the", "a", "an"]
},
"my_synonyms": {
"type": "synonym",
"synonyms": [
"quick,fast",
"big,large"
]
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": ["my_char_filter"],
"tokenizer": "my_tokenizer",
"filter": ["lowercase", "my_stopwords", "my_synonyms"]
}
}
}
}
}
4.2 测试分析器 #
bash
POST /products/_analyze
{
"analyzer": "my_analyzer",
"text": "The Quick & Big"
}
五、字符过滤器 #
5.1 mapping字符过滤器 #
bash
PUT /products
{
"settings": {
"analysis": {
"char_filter": {
"my_mapping": {
"type": "mapping",
"mappings": [
"零 => 0",
"一 => 1",
"二 => 2"
]
}
}
}
}
}
5.2 html_strip字符过滤器 #
bash
PUT /products
{
"settings": {
"analysis": {
"char_filter": {
"my_html_strip": {
"type": "html_strip",
"escaped_tags": ["script", "style"]
}
}
}
}
}
5.3 pattern_replace字符过滤器 #
bash
PUT /products
{
"settings": {
"analysis": {
"char_filter": {
"my_pattern_replace": {
"type": "pattern_replace",
"pattern": "(\\d+)-(\\d+)",
"replacement": "$1$2"
}
}
}
}
}
六、分词器 #
6.1 standard分词器 #
bash
"my_standard": {
"type": "standard",
"max_token_length": 255
}
6.2 keyword分词器 #
bash
"my_keyword": {
"type": "keyword",
"buffer_size": 256
}
6.3 pattern分词器 #
bash
"my_pattern": {
"type": "pattern",
"pattern": "\\s+",
"lowercase": true
}
6.4 ngram分词器 #
bash
"my_ngram": {
"type": "ngram",
"min_gram": 1,
"max_gram": 2,
"token_chars": ["letter", "digit"]
}
6.5 edge_ngram分词器 #
bash
"my_edge_ngram": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 5,
"token_chars": ["letter", "digit"]
}
6.6 uax_url_email分词器 #
bash
"my_uax": {
"type": "uax_url_email",
"max_token_length": 255
}
6.7 path_hierarchy分词器 #
bash
"my_path": {
"type": "path_hierarchy",
"delimiter": "/",
"replacement": "-",
"skip": 0
}
七、词项过滤器 #
7.1 lowercase过滤器 #
bash
"my_lowercase": {
"type": "lowercase"
}
7.2 stop过滤器 #
bash
"my_stop": {
"type": "stop",
"stopwords": ["the", "a", "an"],
"ignore_case": true
}
7.3 synonym过滤器 #
bash
"my_synonym": {
"type": "synonym",
"synonyms": [
"quick,fast,speedy",
"big,large,huge"
]
}
或使用文件:
bash
"my_synonym": {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
}
7.4 synonym_graph过滤器 #
bash
"my_synonym_graph": {
"type": "synonym_graph",
"synonyms": [
"quick,fast",
"usa,united states,u s a,united states of america"
]
}
7.5 stemmer过滤器 #
bash
"my_stemmer": {
"type": "stemmer",
"language": "english"
}
7.6 porter_stem过滤器 #
bash
"my_porter": {
"type": "porter_stem"
}
7.7 snowball过滤器 #
bash
"my_snowball": {
"type": "snowball",
"language": "English"
}
7.8 ngram过滤器 #
bash
"my_ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 2
}
7.9 edge_ngram过滤器 #
bash
"my_edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 5
}
7.10 word_delimiter过滤器 #
bash
"my_word_delimiter": {
"type": "word_delimiter",
"split_on_case_change": true,
"split_on_numerics": true,
"stem_english_possessive": true
}
7.11 unique过滤器 #
bash
"my_unique": {
"type": "unique",
"only_on_same_position": true
}
7.12 length过滤器 #
bash
"my_length": {
"type": "length",
"min": 2,
"max": 10
}
7.13 trim过滤器 #
bash
"my_trim": {
"type": "trim"
}
7.14 reverse过滤器 #
bash
"my_reverse": {
"type": "reverse"
}
八、索引时与搜索时分析器 #
8.1 配置不同分析器 #
bash
PUT /products
{
"settings": {
"analysis": {
"analyzer": {
"my_index_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stemmer"]
},
"my_search_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_index_analyzer",
"search_analyzer": "my_search_analyzer"
}
}
}
}
九、最佳实践 #
9.1 分析器选择 #
text
分析器选择建议
├── 英文文本
│ └── english 或 standard
├── 中文文本
│ └── ik_smart 或 ik_max_word
├── 精确匹配
│ └── keyword
├── 自动补全
│ └── edge_ngram
└── 自定义需求
└── 自定义分析器
9.2 性能优化 #
text
性能优化建议
├── 减少过滤器数量
├── 避免复杂正则
├── 合理设置ngram范围
└── 测试分析器性能
十、总结 #
本章介绍了Elasticsearch分析器:
- 分析器由字符过滤器、分词器和词项过滤器组成
- 内置多种分析器满足常见需求
- 中文需要安装IK等分词器
- 可自定义分析器满足特定需求
- 索引和搜索可使用不同分析器
- 合理选择分析器优化搜索效果
下一步,我们将学习索引模板。
最后更新:2026-03-27