Elasticsearch分析器 #

一、分析器概述 #

1.1 分析器组成 #

text

分析器组成
├── Character Filters (字符过滤器)
│   └── 预处理原始文本
├── Tokenizer (分词器)
│   └── 将文本分割为词项
└── Token Filters (词项过滤器)
    └── 处理分割后的词项

1.2 分析过程 #

text

文本分析流程
原始文本
    ↓
Character Filters (0个或多个)
    ↓
Tokenizer (1个)
    ↓
Token Filters (0个或多个)
    ↓
词项列表

二、内置分析器 #

2.1 standard分析器（默认） #

bash

POST /_analyze
{
  "analyzer": "standard",
  "text": "The Quick Brown Fox"
}

结果：

json

{
  "tokens": [
    { "token": "the", "position": 0 },
    { "token": "quick", "position": 1 },
    { "token": "brown", "position": 2 },
    { "token": "fox", "position": 3 }
  ]
}

2.2 simple分析器 #

bash

POST /_analyze
{
  "analyzer": "simple",
  "text": "The Quick Brown Fox"
}

只保留字母，按非字母分割。

2.3 whitespace分析器 #

bash

POST /_analyze
{
  "analyzer": "whitespace",
  "text": "The Quick Brown Fox"
}

按空格分割，不转小写。

2.4 stop分析器 #

bash

POST /_analyze
{
  "analyzer": "stop",
  "text": "The Quick Brown Fox"
}

移除停用词。

2.5 keyword分析器 #

bash

POST /_analyze
{
  "analyzer": "keyword",
  "text": "The Quick Brown Fox"
}

不分词，输出原文本。

2.6 pattern分析器 #

bash

POST /_analyze
{
  "analyzer": "pattern",
  "text": "The-Quick-Brown-Fox"
}

使用正则表达式分割（默认 \W+）。

2.7 language分析器 #

bash

POST /_analyze
{
  "analyzer": "english",
  "text": "The Quick Brown Foxes"
}

支持语言：english, chinese, french, german等。

2.8 fingerprint分析器 #

bash

POST /_analyze
{
  "analyzer": "fingerprint",
  "text": "The Quick Brown Fox"
}

去重排序，输出单个词项。

三、中文分析器 #

3.1 IK分词器 #

安装：

bash

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.12.0/elasticsearch-analysis-ik-8.12.0.zip

使用：

bash

POST /_analyze
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国"
}

IK分词模式：

模式	说明
ik_smart	智能分词，粗粒度
ik_max_word	最大化分词，细粒度

3.2 自定义词典 #

bash

PUT /products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ik_analyzer": {
          "type": "custom",
          "tokenizer": "ik_max_word"
        }
      }
    }
  }
}

四、自定义分析器 #

4.1 基本配置 #

bash

PUT /products
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": ["& => and", "| => or"]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[\\W_]+"
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": ["the", "a", "an"]
        },
        "my_synonyms": {
          "type": "synonym",
          "synonyms": [
            "quick,fast",
            "big,large"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["my_char_filter"],
          "tokenizer": "my_tokenizer",
          "filter": ["lowercase", "my_stopwords", "my_synonyms"]
        }
      }
    }
  }
}

4.2 测试分析器 #

bash

POST /products/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The Quick & Big"
}

五、字符过滤器 #

5.1 mapping字符过滤器 #

bash

PUT /products
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_mapping": {
          "type": "mapping",
          "mappings": [
            "零 => 0",
            "一 => 1",
            "二 => 2"
          ]
        }
      }
    }
  }
}

5.2 html_strip字符过滤器 #

bash

PUT /products
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_html_strip": {
          "type": "html_strip",
          "escaped_tags": ["script", "style"]
        }
      }
    }
  }
}

5.3 pattern_replace字符过滤器 #

bash

PUT /products
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_pattern_replace": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(\\d+)",
          "replacement": "$1$2"
        }
      }
    }
  }
}

六、分词器 #

6.1 standard分词器 #

bash

"my_standard": {
  "type": "standard",
  "max_token_length": 255
}

6.2 keyword分词器 #

bash

"my_keyword": {
  "type": "keyword",
  "buffer_size": 256
}

6.3 pattern分词器 #

bash

"my_pattern": {
  "type": "pattern",
  "pattern": "\\s+",
  "lowercase": true
}

6.4 ngram分词器 #

bash

"my_ngram": {
  "type": "ngram",
  "min_gram": 1,
  "max_gram": 2,
  "token_chars": ["letter", "digit"]
}

6.5 edge_ngram分词器 #

bash

"my_edge_ngram": {
  "type": "edge_ngram",
  "min_gram": 1,
  "max_gram": 5,
  "token_chars": ["letter", "digit"]
}

6.6 uax_url_email分词器 #

bash

"my_uax": {
  "type": "uax_url_email",
  "max_token_length": 255
}

6.7 path_hierarchy分词器 #

bash

"my_path": {
  "type": "path_hierarchy",
  "delimiter": "/",
  "replacement": "-",
  "skip": 0
}

七、词项过滤器 #

7.1 lowercase过滤器 #

bash

"my_lowercase": {
  "type": "lowercase"
}

7.2 stop过滤器 #

bash

"my_stop": {
  "type": "stop",
  "stopwords": ["the", "a", "an"],
  "ignore_case": true
}

7.3 synonym过滤器 #

bash

"my_synonym": {
  "type": "synonym",
  "synonyms": [
    "quick,fast,speedy",
    "big,large,huge"
  ]
}

或使用文件：

bash

"my_synonym": {
  "type": "synonym",
  "synonyms_path": "analysis/synonyms.txt"
}

7.4 synonym_graph过滤器 #

bash

"my_synonym_graph": {
  "type": "synonym_graph",
  "synonyms": [
    "quick,fast",
    "usa,united states,u s a,united states of america"
  ]
}

7.5 stemmer过滤器 #

bash

"my_stemmer": {
  "type": "stemmer",
  "language": "english"
}

7.6 porter_stem过滤器 #

bash

"my_porter": {
  "type": "porter_stem"
}

7.7 snowball过滤器 #

bash

"my_snowball": {
  "type": "snowball",
  "language": "English"
}

7.8 ngram过滤器 #

bash

"my_ngram_filter": {
  "type": "ngram",
  "min_gram": 1,
  "max_gram": 2
}

7.9 edge_ngram过滤器 #

bash

"my_edge_ngram_filter": {
  "type": "edge_ngram",
  "min_gram": 1,
  "max_gram": 5
}

7.10 word_delimiter过滤器 #

bash

"my_word_delimiter": {
  "type": "word_delimiter",
  "split_on_case_change": true,
  "split_on_numerics": true,
  "stem_english_possessive": true
}

7.11 unique过滤器 #

bash

"my_unique": {
  "type": "unique",
  "only_on_same_position": true
}

7.12 length过滤器 #

bash

"my_length": {
  "type": "length",
  "min": 2,
  "max": 10
}

7.13 trim过滤器 #

bash

"my_trim": {
  "type": "trim"
}

7.14 reverse过滤器 #

bash

"my_reverse": {
  "type": "reverse"
}

八、索引时与搜索时分析器 #

8.1 配置不同分析器 #

bash

PUT /products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_index_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stemmer"]
        },
        "my_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "my_index_analyzer",
        "search_analyzer": "my_search_analyzer"
      }
    }
  }
}

九、最佳实践 #

9.1 分析器选择 #

text

分析器选择建议
├── 英文文本
│   └── english 或 standard
├── 中文文本
│   └── ik_smart 或 ik_max_word
├── 精确匹配
│   └── keyword
├── 自动补全
│   └── edge_ngram
└── 自定义需求
    └── 自定义分析器

9.2 性能优化 #

text

性能优化建议
├── 减少过滤器数量
├── 避免复杂正则
├── 合理设置ngram范围
└── 测试分析器性能

十、总结 #

本章介绍了Elasticsearch分析器：

分析器由字符过滤器、分词器和词项过滤器组成
内置多种分析器满足常见需求
中文需要安装IK等分词器
可自定义分析器满足特定需求
索引和搜索可使用不同分析器
合理选择分析器优化搜索效果

下一步，我们将学习索引模板。