分析器与分词 #
一、分析器概述 #
1.1 分析流程 #
text
输入文本
↓
CharFilter(字符过滤器)
↓
Tokenizer(分词器)
↓
TokenFilter(词元过滤器)
↓
Token流
1.2 组件说明 #
| 组件 | 说明 |
|---|---|
| CharFilter | 预处理字符 |
| Tokenizer | 分词 |
| TokenFilter | 处理词元 |
二、分词器(Tokenizer) #
2.1 常用分词器 #
| 分词器 | 说明 |
|---|---|
| StandardTokenizer | 标准分词器 |
| WhitespaceTokenizer | 空格分词器 |
| KeywordTokenizer | 关键词分词器 |
| LetterTokenizer | 字母分词器 |
| LowerCaseTokenizer | 小写分词器 |
| PatternTokenizer | 正则分词器 |
| UAX29URLEmailTokenizer | URL/Email分词器 |
| PathHierarchyTokenizer | 路径分词器 |
2.2 StandardTokenizer #
xml
<fieldType name="text_standard" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
分词示例
text
"Solr is a search engine" → ["Solr", "is", "a", "search", "engine"]
2.3 WhitespaceTokenizer #
xml
<fieldType name="text_whitespace" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
分词示例
text
"Solr Search Engine" → ["Solr", "Search", "Engine"]
2.4 KeywordTokenizer #
xml
<fieldType name="text_keyword" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
分词示例
text
"Solr Search Engine" → ["Solr Search Engine"]
2.5 PatternTokenizer #
xml
<fieldType name="text_pattern" class="solr.TextField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[,;]"/>
</analyzer>
</fieldType>
分词示例
text
"a,b;c" → ["a", "b", "c"]
2.6 PathHierarchyTokenizer #
xml
<fieldType name="text_path" class="solr.TextField">
<analyzer>
<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
</analyzer>
</fieldType>
分词示例
text
"/a/b/c" → ["/a", "/a/b", "/a/b/c"]
三、过滤器(Filter) #
3.1 常用过滤器 #
| 过滤器 | 说明 |
|---|---|
| LowerCaseFilter | 小写转换 |
| StopFilter | 停用词过滤 |
| SynonymFilter | 同义词扩展 |
| StemmerFilter | 词干提取 |
| RemoveDuplicatesTokenFilter | 去重 |
| ASCIIFoldingFilter | ASCII转换 |
| TrimFilter | 去除空白 |
| LengthFilter | 长度过滤 |
3.2 LowerCaseFilter #
xml
<filter class="solr.LowerCaseFilterFactory"/>
效果
text
"Solr" → "solr"
3.3 StopFilter #
xml
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
stopwords.txt
text
a
an
and
are
as
at
be
but
by
3.4 SynonymFilter #
xml
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true"
expand="true"/>
synonyms.txt
text
iphone,苹果手机,Apple手机
ipad,苹果平板,Apple平板
3.5 词干过滤器 #
xml
<!-- 英文词干 -->
<filter class="solr.PorterStemFilterFactory"/>
<!-- 雪球词干 -->
<filter class="solr.SnowballPorterFilterFactory" language="English"/>
效果
text
"running" → "run"
"jumped" → "jump"
3.6 ASCIIFoldingFilter #
xml
<filter class="solr.ASCIIFoldingFilterFactory"/>
效果
text
"café" → "cafe"
"naïve" → "naive"
3.7 LengthFilter #
xml
<filter class="solr.LengthFilterFactory" min="2" max="20"/>
3.8 RemoveDuplicatesTokenFilter #
xml
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
四、字符过滤器(CharFilter) #
4.1 常用字符过滤器 #
| 过滤器 | 说明 |
|---|---|
| MappingCharFilter | 字符映射 |
| PatternReplaceCharFilter | 正则替换 |
| HTMLStripCharFilter | HTML去除 |
4.2 MappingCharFilter #
xml
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
mapping.txt
text
" => "
' => '
4.3 HTMLStripCharFilter #
xml
<charFilter class="solr.HTMLStripCharFilterFactory"/>
效果
text
"<p>Solr</p>" → "Solr"
4.4 PatternReplaceCharFilter #
xml
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[-]" replacement=""/>
五、完整分析器配置 #
5.1 通用文本分析器 #
xml
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
5.2 英文文本分析器 #
xml
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
5.3 中文文本分析器 #
xml
<fieldType name="text_cn" class="solr.TextField">
<analyzer type="index">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
六、分析API #
6.1 分析字段类型 #
bash
curl "http://localhost:8983/solr/mycore/analysis/field" \
-d "fieldtype=text_general" \
-d "fieldvalue=Solr is a search engine"
6.2 分析字段 #
bash
curl "http://localhost:8983/solr/mycore/analysis/field" \
-d "field=title" \
-d "fieldvalue=Solr is a search engine"
6.3 分析文档 #
bash
curl "http://localhost:8983/solr/mycore/analysis/document" \
-d "analysis.fieldvalue=Solr is a search engine"
6.4 响应示例 #
json
{
"analysis": {
"field_types": {
"text_general": {
"index": [
{
"text": "Solr is a search engine",
"tokens": [
{"text": "solr", "start": 0, "end": 4},
{"text": "is", "start": 5, "end": 7},
{"text": "a", "start": 8, "end": 9},
{"text": "search", "start": 10, "end": 16},
{"text": "engine", "start": 17, "end": 23}
]
}
]
}
}
}
}
七、中文分词 #
7.1 IK分词器 #
安装
bash
# 下载IK分词器
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.17.0/elasticsearch-analysis-ik-7.17.0.zip
# 解压到Solr的lib目录
unzip elasticsearch-analysis-ik-7.17.0.zip -d server/solr-webapp/webapp/WEB-INF/lib/
配置
xml
<fieldType name="text_ik" class="solr.TextField">
<analyzer type="index">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true"/>
</analyzer>
</fieldType>
7.2 SmartChinese分词 #
xml
<fieldType name="text_smartcn" class="solr.TextField">
<analyzer>
<tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
<filter class="solr.SmartChineseStopTokenFilterFactory"/>
</analyzer>
</fieldType>
7.3 自定义词典 #
IK分词器自定义词典
text
# custom/ext.dic
搜索引擎
分布式搜索
全文检索
八、同义词配置 #
8.1 同义词格式 #
text
# 同义词配置
# 格式1:等价同义词
iphone,苹果手机,Apple手机
# 格式2:单向同义词
苹果手机 => iphone
8.2 expand=true #
xml
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true"
expand="true"/>
效果
text
搜索"苹果手机" → 匹配"iphone"、"苹果手机"、"Apple手机"
8.3 expand=false #
xml
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true"
expand="false"/>
效果
text
搜索"苹果手机" → 只匹配"iphone"
九、停用词配置 #
9.1 停用词文件 #
stopwords.txt
text
# 英文停用词
a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with
9.2 配置 #
xml
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"/>
十、总结 #
分析器配置要点:
| 组件 | 说明 |
|---|---|
| CharFilter | 字符预处理 |
| Tokenizer | 分词 |
| TokenFilter | 词元处理 |
最佳实践:
- 索引和查询使用不同分析器
- 合理配置同义词
- 使用停用词过滤
- 中文使用专用分词器
- 使用分析API测试
下一步,让我们学习DataImportHandler!
最后更新:2026-03-27