DataImportHandler #

一、DIH概述 #

1.1 什么是DIH #

DataImportHandler(DIH)是Solr提供的数据导入工具,支持:

  • 关系数据库导入
  • XML文件导入
  • CSV文件导入
  • 增量导入
  • 全量导入

1.2 配置DIH #

solrconfig.xml

xml
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">data-config.xml</str>
  </lst>
</requestHandler>

二、数据库导入 #

2.1 配置数据源 #

data-config.xml

xml
<dataConfig>
  <dataSource type="JdbcDataSource" 
              driver="com.mysql.cj.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/mydb"
              user="root"
              password="password"/>
  
  <document>
    <entity name="product" query="SELECT * FROM products">
      <field column="id" name="id"/>
      <field column="name" name="title"/>
      <field column="description" name="content"/>
      <field column="price" name="price"/>
      <field column="category_id" name="category_id"/>
    </entity>
  </document>
</dataConfig>

2.2 导入数据 #

bash
# 全量导入
curl "http://localhost:8983/solr/mycore/dataimport?command=full-import"

# 增量导入
curl "http://localhost:8983/solr/mycore/dataimport?command=delta-import"

# 查看状态
curl "http://localhost:8983/solr/mycore/dataimport?command=status"

2.3 导入参数 #

参数 说明
command 导入命令
clean 是否清空索引
commit 是否提交
optimize 是否优化
debug 调试模式
entity 指定实体

三、多表关联 #

3.1 一对多关联 #

xml
<dataConfig>
  <dataSource type="JdbcDataSource" 
              driver="com.mysql.cj.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/mydb"
              user="root"
              password="password"/>
  
  <document>
    <entity name="product" query="SELECT * FROM products">
      <field column="id" name="id"/>
      <field column="name" name="title"/>
      
      <entity name="category" 
              query="SELECT name FROM categories WHERE id='${product.category_id}'">
        <field column="name" name="category"/>
      </entity>
      
      <entity name="tag" 
              query="SELECT tag_name FROM product_tags WHERE product_id='${product.id}'">
        <field column="tag_name" name="tags"/>
      </entity>
    </entity>
  </document>
</dataConfig>

3.2 多表JOIN #

xml
<entity name="product" 
        query="SELECT p.*, c.name as category_name 
               FROM products p 
               LEFT JOIN categories c ON p.category_id = c.id">
  <field column="id" name="id"/>
  <field column="name" name="title"/>
  <field column="category_name" name="category"/>
</entity>

四、增量导入 #

4.1 配置增量导入 #

xml
<dataConfig>
  <dataSource type="JdbcDataSource" 
              driver="com.mysql.cj.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/mydb"
              user="root"
              password="password"/>
  
  <document>
    <entity name="product" 
            query="SELECT * FROM products"
            deltaImportQuery="SELECT * FROM products WHERE id='${dih.delta.id}'"
            deltaQuery="SELECT id FROM products WHERE updated_at > '${dih.last_index_time}'">
      <field column="id" name="id"/>
      <field column="name" name="title"/>
      <field column="updated_at" name="updated_at"/>
    </entity>
  </document>
</dataConfig>

4.2 增量查询说明 #

查询 说明
query 全量导入查询
deltaQuery 查找变更的ID
deltaImportQuery 根据ID导入

4.3 执行增量导入 #

bash
curl "http://localhost:8983/solr/mycore/dataimport?command=delta-import"

五、数据转换 #

5.1 内置转换器 #

转换器 说明
RegexTransformer 正则转换
NumberFormatTransformer 数字格式化
DateFormatTransformer 日期格式化
HTMLStripTransformer HTML去除
TemplateTransformer 模板生成

5.2 正则转换 #

xml
<entity name="product" query="SELECT * FROM products" transformer="RegexTransformer">
  <field column="id" name="id"/>
  <field column="price_info" name="price" regex="(\d+)" group="1"/>
</entity>

5.3 日期格式化 #

xml
<entity name="product" query="SELECT * FROM products" transformer="DateFormatTransformer">
  <field column="created_at" name="created_at" dateTimeFormat="yyyy-MM-dd HH:mm:ss"/>
</entity>

5.4 模板生成 #

xml
<entity name="product" query="SELECT * FROM products" transformer="TemplateTransformer">
  <field column="id" name="id"/>
  <field column="name" name="title"/>
  <field column="url" template="http://example.com/product/${product.id}"/>
</entity>

六、脚本处理 #

6.1 JavaScript脚本 #

xml
<dataConfig>
  <dataSource type="JdbcDataSource" 
              driver="com.mysql.cj.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/mydb"
              user="root"
              password="password"/>
  
  <script><![CDATA[
    function processRow(row) {
      var title = row.get('title');
      if (title != null) {
        row.put('title', title.toUpperCase());
      }
      return row;
    }
  ]]></script>
  
  <document>
    <entity name="product" query="SELECT * FROM products" transformer="script:processRow">
      <field column="id" name="id"/>
      <field column="name" name="title"/>
    </entity>
  </document>
</dataConfig>

七、文件导入 #

7.1 CSV文件导入 #

xml
<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8"/>
  <document>
    <entity name="csv" 
            processor="LineEntityProcessor"
            url="/path/to/data.csv"
            transformer="RegexTransformer">
      <field column="rawLine" regex="([^,]*),([^,]*),([^,]*)" groupNames="id,title,price"/>
    </entity>
  </document>
</dataConfig>

7.2 XML文件导入 #

xml
<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8"/>
  <document>
    <entity name="xml" 
            processor="XPathEntityProcessor"
            url="/path/to/data.xml"
            forEach="/products/product">
      <field column="id" xpath="/products/product/id"/>
      <field column="title" xpath="/products/product/name"/>
      <field column="price" xpath="/products/product/price"/>
    </entity>
  </document>
</dataConfig>

八、调度导入 #

8.1 使用crontab #

bash
# 每小时增量导入
0 * * * * curl "http://localhost:8983/solr/mycore/dataimport?command=delta-import"

# 每天凌晨全量导入
0 0 * * * curl "http://localhost:8983/solr/mycore/dataimport?command=full-import&clean=true"

8.2 使用Solr调度 #

xml
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">data-config.xml</str>
    <lst name="defaults">
      <str name="update.chain">schedule</str>
    </lst>
  </lst>
</requestHandler>

九、监控与调试 #

9.1 查看导入状态 #

bash
curl "http://localhost:8983/solr/mycore/dataimport?command=status"

9.2 响应示例 #

json
{
  "status": "idle",
  "importResponse": "",
  "statusMessages": {
    "Total Requests made to DataSource": "1",
    "Total Rows Fetched": "1000",
    "Total Documents Processed": "1000",
    "Total Documents Skipped": "0",
    "Full Dump Started": "2026-03-27 10:00:00",
    "Time taken": "0:0:5.123"
  }
}

9.3 调试模式 #

bash
curl "http://localhost:8983/solr/mycore/dataimport?command=full-import&debug=true"

十、性能优化 #

10.1 批量大小 #

xml
<entity name="product" query="SELECT * FROM products" batchSize="1000">

10.2 连接池 #

xml
<dataSource type="JdbcDataSource" 
            driver="com.mysql.cj.jdbc.Driver"
            url="jdbc:mysql://localhost:3306/mydb"
            user="root"
            password="password"
            batchSize="1000"
            maxActive="100"
            maxIdle="30"
            maxWait="10000"/>

10.3 并行导入 #

xml
<entity name="product" 
        query="SELECT * FROM products"
        threads="4">

十一、实战示例 #

11.1 电商商品导入 #

xml
<dataConfig>
  <dataSource type="JdbcDataSource" 
              driver="com.mysql.cj.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/ecommerce"
              user="root"
              password="password"/>
  
  <document>
    <entity name="product" 
            query="SELECT * FROM products WHERE status = 'active'"
            deltaImportQuery="SELECT * FROM products WHERE id='${dih.delta.id}'"
            deltaQuery="SELECT id FROM products WHERE updated_at > '${dih.last_index_time}' AND status = 'active'">
      <field column="id" name="id"/>
      <field column="name" name="title"/>
      <field column="description" name="content"/>
      <field column="price" name="price"/>
      <field column="stock" name="stock"/>
      
      <entity name="category" 
              query="SELECT name FROM categories WHERE id='${product.category_id}'">
        <field column="name" name="category"/>
      </entity>
      
      <entity name="brand" 
              query="SELECT name FROM brands WHERE id='${product.brand_id}'">
        <field column="name" name="brand"/>
      </entity>
      
      <entity name="image" 
              query="SELECT url FROM product_images WHERE product_id='${product.id}'">
        <field column="url" name="images"/>
      </entity>
      
      <entity name="tag" 
              query="SELECT tag FROM product_tags WHERE product_id='${product.id}'">
        <field column="tag" name="tags"/>
      </entity>
    </entity>
  </document>
</dataConfig>

11.2 文章导入 #

xml
<dataConfig>
  <dataSource type="JdbcDataSource" 
              driver="com.mysql.cj.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/cms"
              user="root"
              password="password"/>
  
  <document>
    <entity name="article" 
            query="SELECT a.*, u.name as author_name, c.name as category_name 
                   FROM articles a 
                   LEFT JOIN users u ON a.author_id = u.id 
                   LEFT JOIN categories c ON a.category_id = c.id
                   WHERE a.status = 'published'"
            deltaImportQuery="SELECT a.*, u.name as author_name, c.name as category_name 
                              FROM articles a 
                              LEFT JOIN users u ON a.author_id = u.id 
                              LEFT JOIN categories c ON a.category_id = c.id
                              WHERE a.id='${dih.delta.id}'"
            deltaQuery="SELECT id FROM articles WHERE updated_at > '${dih.last_index_time}'">
      <field column="id" name="id"/>
      <field column="title" name="title"/>
      <field column="content" name="content"/>
      <field column="author_name" name="author"/>
      <field column="category_name" name="category"/>
      <field column="views" name="views"/>
      <field column="likes" name="likes"/>
      <field column="created_at" name="publish_date"/>
    </entity>
  </document>
</dataConfig>

十二、总结 #

DIH要点:

功能 说明
全量导入 command=full-import
增量导入 command=delta-import
数据转换 transformer
多表关联 entity嵌套
脚本处理 script

最佳实践:

  • 使用增量导入减少数据量
  • 合理配置批量大小
  • 使用连接池优化性能
  • 定期监控导入状态
  • 使用调度定期导入

下一步,让我们学习SolrCloud集群!

最后更新:2026-03-27