DataImportHandler #
一、DIH概述 #
1.1 什么是DIH #
DataImportHandler(DIH)是Solr提供的数据导入工具,支持:
- 关系数据库导入
- XML文件导入
- CSV文件导入
- 增量导入
- 全量导入
1.2 配置DIH #
solrconfig.xml
xml
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
二、数据库导入 #
2.1 配置数据源 #
data-config.xml
xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:3306/mydb"
user="root"
password="password"/>
<document>
<entity name="product" query="SELECT * FROM products">
<field column="id" name="id"/>
<field column="name" name="title"/>
<field column="description" name="content"/>
<field column="price" name="price"/>
<field column="category_id" name="category_id"/>
</entity>
</document>
</dataConfig>
2.2 导入数据 #
bash
# 全量导入
curl "http://localhost:8983/solr/mycore/dataimport?command=full-import"
# 增量导入
curl "http://localhost:8983/solr/mycore/dataimport?command=delta-import"
# 查看状态
curl "http://localhost:8983/solr/mycore/dataimport?command=status"
2.3 导入参数 #
| 参数 | 说明 |
|---|---|
| command | 导入命令 |
| clean | 是否清空索引 |
| commit | 是否提交 |
| optimize | 是否优化 |
| debug | 调试模式 |
| entity | 指定实体 |
三、多表关联 #
3.1 一对多关联 #
xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:3306/mydb"
user="root"
password="password"/>
<document>
<entity name="product" query="SELECT * FROM products">
<field column="id" name="id"/>
<field column="name" name="title"/>
<entity name="category"
query="SELECT name FROM categories WHERE id='${product.category_id}'">
<field column="name" name="category"/>
</entity>
<entity name="tag"
query="SELECT tag_name FROM product_tags WHERE product_id='${product.id}'">
<field column="tag_name" name="tags"/>
</entity>
</entity>
</document>
</dataConfig>
3.2 多表JOIN #
xml
<entity name="product"
query="SELECT p.*, c.name as category_name
FROM products p
LEFT JOIN categories c ON p.category_id = c.id">
<field column="id" name="id"/>
<field column="name" name="title"/>
<field column="category_name" name="category"/>
</entity>
四、增量导入 #
4.1 配置增量导入 #
xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:3306/mydb"
user="root"
password="password"/>
<document>
<entity name="product"
query="SELECT * FROM products"
deltaImportQuery="SELECT * FROM products WHERE id='${dih.delta.id}'"
deltaQuery="SELECT id FROM products WHERE updated_at > '${dih.last_index_time}'">
<field column="id" name="id"/>
<field column="name" name="title"/>
<field column="updated_at" name="updated_at"/>
</entity>
</document>
</dataConfig>
4.2 增量查询说明 #
| 查询 | 说明 |
|---|---|
| query | 全量导入查询 |
| deltaQuery | 查找变更的ID |
| deltaImportQuery | 根据ID导入 |
4.3 执行增量导入 #
bash
curl "http://localhost:8983/solr/mycore/dataimport?command=delta-import"
五、数据转换 #
5.1 内置转换器 #
| 转换器 | 说明 |
|---|---|
| RegexTransformer | 正则转换 |
| NumberFormatTransformer | 数字格式化 |
| DateFormatTransformer | 日期格式化 |
| HTMLStripTransformer | HTML去除 |
| TemplateTransformer | 模板生成 |
5.2 正则转换 #
xml
<entity name="product" query="SELECT * FROM products" transformer="RegexTransformer">
<field column="id" name="id"/>
<field column="price_info" name="price" regex="(\d+)" group="1"/>
</entity>
5.3 日期格式化 #
xml
<entity name="product" query="SELECT * FROM products" transformer="DateFormatTransformer">
<field column="created_at" name="created_at" dateTimeFormat="yyyy-MM-dd HH:mm:ss"/>
</entity>
5.4 模板生成 #
xml
<entity name="product" query="SELECT * FROM products" transformer="TemplateTransformer">
<field column="id" name="id"/>
<field column="name" name="title"/>
<field column="url" template="http://example.com/product/${product.id}"/>
</entity>
六、脚本处理 #
6.1 JavaScript脚本 #
xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:3306/mydb"
user="root"
password="password"/>
<script><![CDATA[
function processRow(row) {
var title = row.get('title');
if (title != null) {
row.put('title', title.toUpperCase());
}
return row;
}
]]></script>
<document>
<entity name="product" query="SELECT * FROM products" transformer="script:processRow">
<field column="id" name="id"/>
<field column="name" name="title"/>
</entity>
</document>
</dataConfig>
七、文件导入 #
7.1 CSV文件导入 #
xml
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8"/>
<document>
<entity name="csv"
processor="LineEntityProcessor"
url="/path/to/data.csv"
transformer="RegexTransformer">
<field column="rawLine" regex="([^,]*),([^,]*),([^,]*)" groupNames="id,title,price"/>
</entity>
</document>
</dataConfig>
7.2 XML文件导入 #
xml
<dataConfig>
<dataSource type="FileDataSource" encoding="UTF-8"/>
<document>
<entity name="xml"
processor="XPathEntityProcessor"
url="/path/to/data.xml"
forEach="/products/product">
<field column="id" xpath="/products/product/id"/>
<field column="title" xpath="/products/product/name"/>
<field column="price" xpath="/products/product/price"/>
</entity>
</document>
</dataConfig>
八、调度导入 #
8.1 使用crontab #
bash
# 每小时增量导入
0 * * * * curl "http://localhost:8983/solr/mycore/dataimport?command=delta-import"
# 每天凌晨全量导入
0 0 * * * curl "http://localhost:8983/solr/mycore/dataimport?command=full-import&clean=true"
8.2 使用Solr调度 #
xml
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
<lst name="defaults">
<str name="update.chain">schedule</str>
</lst>
</lst>
</requestHandler>
九、监控与调试 #
9.1 查看导入状态 #
bash
curl "http://localhost:8983/solr/mycore/dataimport?command=status"
9.2 响应示例 #
json
{
"status": "idle",
"importResponse": "",
"statusMessages": {
"Total Requests made to DataSource": "1",
"Total Rows Fetched": "1000",
"Total Documents Processed": "1000",
"Total Documents Skipped": "0",
"Full Dump Started": "2026-03-27 10:00:00",
"Time taken": "0:0:5.123"
}
}
9.3 调试模式 #
bash
curl "http://localhost:8983/solr/mycore/dataimport?command=full-import&debug=true"
十、性能优化 #
10.1 批量大小 #
xml
<entity name="product" query="SELECT * FROM products" batchSize="1000">
10.2 连接池 #
xml
<dataSource type="JdbcDataSource"
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:3306/mydb"
user="root"
password="password"
batchSize="1000"
maxActive="100"
maxIdle="30"
maxWait="10000"/>
10.3 并行导入 #
xml
<entity name="product"
query="SELECT * FROM products"
threads="4">
十一、实战示例 #
11.1 电商商品导入 #
xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:3306/ecommerce"
user="root"
password="password"/>
<document>
<entity name="product"
query="SELECT * FROM products WHERE status = 'active'"
deltaImportQuery="SELECT * FROM products WHERE id='${dih.delta.id}'"
deltaQuery="SELECT id FROM products WHERE updated_at > '${dih.last_index_time}' AND status = 'active'">
<field column="id" name="id"/>
<field column="name" name="title"/>
<field column="description" name="content"/>
<field column="price" name="price"/>
<field column="stock" name="stock"/>
<entity name="category"
query="SELECT name FROM categories WHERE id='${product.category_id}'">
<field column="name" name="category"/>
</entity>
<entity name="brand"
query="SELECT name FROM brands WHERE id='${product.brand_id}'">
<field column="name" name="brand"/>
</entity>
<entity name="image"
query="SELECT url FROM product_images WHERE product_id='${product.id}'">
<field column="url" name="images"/>
</entity>
<entity name="tag"
query="SELECT tag FROM product_tags WHERE product_id='${product.id}'">
<field column="tag" name="tags"/>
</entity>
</entity>
</document>
</dataConfig>
11.2 文章导入 #
xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.cj.jdbc.Driver"
url="jdbc:mysql://localhost:3306/cms"
user="root"
password="password"/>
<document>
<entity name="article"
query="SELECT a.*, u.name as author_name, c.name as category_name
FROM articles a
LEFT JOIN users u ON a.author_id = u.id
LEFT JOIN categories c ON a.category_id = c.id
WHERE a.status = 'published'"
deltaImportQuery="SELECT a.*, u.name as author_name, c.name as category_name
FROM articles a
LEFT JOIN users u ON a.author_id = u.id
LEFT JOIN categories c ON a.category_id = c.id
WHERE a.id='${dih.delta.id}'"
deltaQuery="SELECT id FROM articles WHERE updated_at > '${dih.last_index_time}'">
<field column="id" name="id"/>
<field column="title" name="title"/>
<field column="content" name="content"/>
<field column="author_name" name="author"/>
<field column="category_name" name="category"/>
<field column="views" name="views"/>
<field column="likes" name="likes"/>
<field column="created_at" name="publish_date"/>
</entity>
</document>
</dataConfig>
十二、总结 #
DIH要点:
| 功能 | 说明 |
|---|---|
| 全量导入 | command=full-import |
| 增量导入 | command=delta-import |
| 数据转换 | transformer |
| 多表关联 | entity嵌套 |
| 脚本处理 | script |
最佳实践:
- 使用增量导入减少数据量
- 合理配置批量大小
- 使用连接池优化性能
- 定期监控导入状态
- 使用调度定期导入
下一步,让我们学习SolrCloud集群!
最后更新:2026-03-27