Solr中DIH模式的使用

Solr中使用DIH（DataImportHandler）模式

最早只是使用全量导入功能，即full-import

网上能搜索到的大部分demo多采用的情况是建立索引，查询，都是在同一个容器内操作，即开启同一个tomcat/jetty等

通过http请求中加入相应的参数即完成了相关操作，如：

http://localhost:8080/dataimport?command=full-import

http://localhost:8080/dataimport?command=delta-import

http://localhost:8080/solr/select/?q=*:*&version=2.2&start=0&rows=100&indent=on&group=true&group.field=albumId&group.ngroups=true

group用来分组，group.ngroups用来统计命中的数量，不过貌似这个东西比较慢

或者通过代码

 private static final String DEFAULT_URL = "http://localhost:8983/solr/";
    
    @Before
    public void init() {
        try {
            server = new CommonsHttpSolrServer(DEFAULT_URL);
            httpServer = new CommonsHttpSolrServer(DEFAULT_URL);
        } catch (MalformedURLException e) {
            e.printStackTrace();
        }
    }

这样做的前提就是必须先启动一个容器，非常不便于单元测试；

如果不启动容器，则无法完成上述操作

而在代码中如果不通过jetty等来进行操作，Solr提供了如下方式：

private void initCore(){
		System.setProperty("solr.solr.home", "/home/admin/solr");
		CoreContainer.Initializer init = new CoreContainer.Initializer();
		try {
			 core =   init.initialize().getCore(CORE_NAME);
		} catch (Exception e){
			logger.error("error occur when create core:{}", CORE_NAME);
		}
	}

首先建立初始化一个core，你可以设置一个全局的

	SolrCore core = null;

如果有全量的操作

如下所示：

                if(core==null){
			initCore();
		}
		
		try {
			SolrRequestHandler requestHandler = core.getRequestHandler("/dataimport");
			NamedList params = new NamedList();
			params.add("command", DataImporter.FULL_IMPORT_CMD);
			params.add("synchronous", Boolean.TRUE);
			params.add("clean", Boolean.TRUE);
			SolrQueryRequest req = new LocalSolrQueryRequest(core,params) ;
			SolrQueryResponse rsp = new SolrQueryResponse();
			requestHandler.handleRequest(req, rsp);
		} catch (Exception e) {
			result = Boolean.FALSE;
			logger.error("error",e);
		}

这样便完成了数据的全量导入，你太幸运了，不过在这之前你必须完成以下操作

第一：配置schema.xml,其中主要配置一下东西：

1 <fields></fields>这个是你要建立索引所需要的所有的字段

2 <uniqueKey>id</uniqueKey>（可选，如果你不打算在后续的db-data-config.xml）中配置pk，那么这里就是必须的，

如果你在db-data-config.xml中配置了pk那么这个就不是必须的了

第二：配置solrconfig.xml，这个当中要做的事情是

添加：

 <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
    	<str name="config">db-data-config.xml</str>
    </lst>
  </requestHandler>

既然这里配置了

db-data-config.xml

那么

第三：配置db-data-config.xml

这个文件中主要配置了大量的field的属性

<dataConfig>
	<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/test"
		user="test" password="test" /><!-- 这里定义datasource（数据源）solr默认采用JdbcDataSource，这里你可以使用自己的自定义的XXXDataSource-->
	<document>
		<entity name="item"
			query="select * from student where studentnum > 1" pk="id"
			deltaImportQuery="select * from student where id='${dataimporter.delta.id}'"
			deltaQuery="select id from studentwhere  gmt_modified > '${dataimporter.last_index_time}'">
			<field column="id" name="id" />
			<field column="username" name="username" />
			<field column="studentnum" name="studentnum" />
			<field column="birthday" name="birthday" />
			<field column="pic" name="pic" />
			<entity name="course" transformer="com.my.dump.test.transformer.CourseTransformer" pk="id"
				query="select id,CONTENT from course where studentnum='${item.studentnum}'"
				deltaQuery="select id,studentnum from course where  gmt_modified > '${dataimporter.last_index_time}'"
				parentDeltaQuery="select id from student where studentnum='${course.studentnum}'">
				<field name="prop" column="content" />
			</entity>
			
		</entity>

	</document>
</dataConfig>

dataSource可以自定义，只需要继承

public class JdbcDataSource extends
        DataSource<Iterator<Map<String, Object>>> {

例如：

<dataSource type="com.my.test.dataimport.StudentDataSource"/>

document中使用的是自己要被加入到索引中的field

query,被用来做为全量导入的时候使用

deltaImportQuery 这个是在增量时使用的修改语句，其中需要注意的是dataimporter.delta这个前缀一定要带

pk,根据我艰苦卓绝的跟踪代码知道这个pk其实作用只是用来对deltaQuery查询出来的内容放入到一个map中的时候作为key用的

如果你不想deltaQuery查询出来的结果最后出现混乱，那么最好保证pk是唯一的

deltaQuery，这个是用来查询需要被更新的对象的主键，一边deltaImportQuery使用

transformer：很多时候数据库中的字段不能满足你的需要，比如存储了用户生日，那么你需要将他的生肖存储，则此时需要对生日做自己的处理

那么你需要一个

public class PropTransformer extends Transformer {

这样实现它的

        @Override
	public Object transformRow(Map<String, Object> row, Context context) {
		System.out.println("--------------------------------------");
		String content = (String)row.get("CONTENT");
               //转换处理逻辑
		if(content==null){
			return null;
		}
		try {
			row.put(BABYNICK, kvMap.get(BABYNICK));
			return row;
		} catch (Exception e) {
			return null;
		}
		
	}

这样就可以使用自己的操作了

parentDeltaQuery：这个是当子类发生变化是同时要通知主类，因为solr现在还没有能够只修改对应document中的个别属性的能力，通常是删除后重新插入

所以在这里也是要查询出子类对应的主类的主键id

Solr的DIH功能工作过程如下：

1.加载配置文件，确定命令类型，加载datasource

2.如果是full-import，则执行全量导入，依据配置文件或代码中的query配置执行

3.如果是delta-import，首先从根entity的下一级中执行detlaQuery，查找满足的row，同时根据pk从row中取出pk的value，然后以value作为键，以row作为值存入到map中，检测是否含有parentDeltaQuery，如果存在，从上一个map中取出parentDeltaQuery需要的参数值，查处parent的pk，依次递归，知道拿到根元素的pk，最终将所有的根元素的pk放入到一个map中，依次执行deltaImportQuery，这样所有相关的数据将被更新

其中让人头疼的是，在加载dataimport.properties文件时，通过查看文件发现只有如下配置：

#Wed Feb 08 10:56:12 CST 2012
item.last_index_time=2012-02-08 10\:56\:11
last_index_time=2012-02-08 10\:56\:11

那么我就这样处理

deltaQuery="select id,user_id from user_profile where  gmt_modified > '${last_index_time}'"

结果发现执行后的sql中last_index_time是空的，经过艰苦卓绝的跟踪代码发现，原来最后生成的 resolver对象中存放的数据（map）的key没有以last_index_time开头的，只有以dataimport为key的map为value，其中value中有lasta_index_time和item.last_index_time,所以当加上dataimport前缀后预期的结果出现了，也就是说其中的一些配置并不是随意的，有些属于固定配置，

还有就是pk的作用一直没搞懂，也是通过跟踪代码发现pk本身的作用只是保证查询出来的row放入map中的时候作为key使用的，所以么个entiy如果涉及到增量你就得有pk属性存在

具体的操作过程详见Solr中的DocBuilder这个类

solr搜索的时候使用类sql的group如下：

group=true&group.field=sellerId&group.format=simple

其中sellerId是索引字段

Solr中DIH模式的使用

猜你喜欢