Index MySQL data with Solr

See literature: https://www.cnblogs.com/luxiaoxun/p/4442770.html

Version: solr-5.3.0

In solr-5.0 and above, the default management of schema is managed-schema, which cannot be modified manually. It needs to use Schema Restful API operations. If you want to manually modify the configuration, copy a copy of managed-schema and modify it as schema.xml, and modify it in solrconfig.xml as follows:

<!-- <schemaFactory class="ManagedIndexSchemaFactory">
    <bool name="mutable">true</bool>
    <str name="managedSchemaResourceName">managed-schema</str>
  </schemaFactory> -->
  
<!-- <processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
      <str name="defaultFieldType">strings</str>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Boolean</str>
        <str name="fieldType">booleans</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.util.Date</str>
        <str name="fieldType">tdates</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Long</str>
        <str name="valueClass">java.lang.Integer</str>
        <str name="fieldType">tlongs</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Number</str>
        <str name="fieldType">tdoubles</str>
      </lst>
    </processor> -->
    
  <schemaFactory class="ClassicIndexSchemaFactory"/>

Create MySQL data

 

Import and index data using DataImportHandler

G:\solr-5.3.0\server\solr\collection3\conf\solrconfig.xml

Add a dataimport processing Handler in front of <requestHandler name="/select" class="solr.SearchHandler">

 

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
       <lst name="defaults">
          <str name="config">data-config.xml</str>
       </lst>
  </requestHandler>

2) Add data-config.xml in the same directory

<?xml version="1.0" encoding="UTF8" ?>
<dataConfig>
    <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://127.0.0.1:3306/test" user="root" password="888" batchSize="-1" />
   <document name="testDoc">
        <entity name="user" pk="id" query="select * from user"
             deltaImportQuery="select * from user where id='${dih.delta.id}'"
                deltaQuery="select id from user where updateTime> '${dataimporter.last_index_time}'">
         <field column="id" name="id"/>
         <field column="userName" name="userName"/>
            <field column="userAge" name="userAge"/>
            <field column="userAddress" name="userAddress"/>
            <field column="updateTime" name="updateTime"/>
     </entity>
  </document>
</dataConfig>

说明:

dataSource是数据库数据源。

Entity就是一张表对应的实体,pk是主键,query是查询语句。

Field对应一个字段,column是数据库里的column名,后面的name属性对应着Solr的Filed的名字。

3) 修改同目录下的schema.xml,这是Solr对数据库里的数据进行索引的模式

1)保留_version_ 这个field

2)添加索引字段:这里每个field的name要和data-config.xml里的entity的field的name一样,一一对应。

 

 

3)删除多余的field,删除copyField里的设置,这些用不上。注意:text这个field不能删除,否则Solr启动失败。

(4) Set the unique primary key: <uniqueKey>id</uniqueKey>, note: the primary key of the index in Solr only supports type="string" string type by default, and the id in my database is int type, there will be Problem, solution: Modify the elevate.xml in the same directory, and comment out the following 2 lines. This seems to be a bug of Solr, and the reason is unknown.

 

Copy mysql-connector-java-5.1.36.jar and solr-dataimporthandler-5.3.0.jar, solr-dataimporthandler-extras-5.3.0.jar to G:\solr-5.3.0\server\solr-webapp\ One of webapp\WEB-INF\lib is the java driver of mysql, and the other is in the G:\solr-5.3.0\dist directory

Restart Solr.

If the configuration is correct, it can start successfully.

solrconfig.xml is the basic file of solr, which configures various web request processors, request response processors, logs, caches, etc.

The schema.xml configuration maps indexing schemes for various data types. The configuration of the tokenizer, the fields included in the indexed documents are also configured here.

Go to the Solr homepage and select collection1 in the Core Selector: http://localhost:8983/solr/#/collection 3

Click Dataimport, Command select full-import (default), click "Execute", Refresh Status, you can see the result:

Inquire:

 

DIH incremental import data from MYSQL database

I have learned how to import MySQL data in full. The cost of full import is very high when the amount of data is large. Generally speaking, the incremental method is used to import data. The following describes how to incrementally import data in the MYSQL database and how to set it. Do it on time.

1) Changes to database tables

A User table has been created before. In order to enable incremental import, a new field updateTime needs to be added, the type is timestamp, and the default value is CURRENT_TIMESTAMP.

With such a field, Solr can determine which data is new when incrementally imported.

Because Solr itself has a default value of last_index_time, which records the time of the last full import or delta import (incremental import), this value is stored in the dataimport.properties file in the file conf directory.

2) Setting of necessary attributes in data-config.xml

transformer format conversion: HTML tags are ignored in HTMLStripTransformer index

query: query the database table to match the record data

deltaQuery: Incremental index query primary key ID Note that this can only return the ID field   

deltaImportQuery: incremental index query imported data 

deletedPkQuery: Incremental index delete primary key ID query Note that this can only return the ID field   

有关“query”,“deltaImportQuery”, “deltaQuery”的解释,引用官网说明,如下所示:
The query gives the data needed to populate fields of the Solr document in full-import
The deltaImportQuery gives the data needed to populate fields when running a delta-import
The deltaQuery gives the primary keys of the current entity which have changes since the last index time

If you need to query related subtables, you may need to use parentDeltaQuery

The principle of incremental indexing is to query the ID numbers of all data that needs to be incrementally imported from the database according to the SQL statement specified by deltaQuery.

Then return the data of all these IDs according to the SQL statement specified by deltaImportQuery, that is, the data to be processed for this incremental import.

The core idea is to record the id of the secondary index and the time of the last index through the built-in variables "${dih.delta.id}" and "${dataimporter.last_index_time}".

If there is a delete operation in the business, you can add an isDeleted field to the database to indicate whether the data has been deleted. At this time, when Solr updates the index, it can update the index of the deleted records according to this field.

At this time, you need to add in dataConfig.xml:

query="select * from user where isDeleted=0"
deltaImportQuery="select * from user where id='${dih.delta.id}'"
deltaQuery="select id from user where updateTime> '${dataimporter.last_index_time}' and isDeleted=0"
deletedPkQuery="select id from user where isDeleted=1"

Test incremental import

If there is data in the User table, you can first clear the previous test data (because the added updateTime has no value), add a User with my Mybatis test program, and the database will assign the current time to this field. Use Query in Solr to query all the values ​​that have not been queried, use dataimport?command=delta-import to incrementally import, and query all again to query the values ​​just inserted into MySQL.

 

Set incremental import as a scheduled task

You can use Windows scheduled tasks or Linux's Cron to periodically access the incremental import connection to complete the function of timed incremental import. This is actually possible, and there should be no problem.

But more convenient and more integrated with Solr itself is to use its own timed incremental import function.

2. Modify the web.xml file under the WEB-INF directory of solr:
add a child element to the <web-app> element

<listener>
        <listener-class>
    org.apache.solr.handler.dataimport.scheduler.ApplicationListener
        </listener-class>
    </listener>

3. Create a new configuration file dataimport.properties:

Create a new directory conf under the G:\solr-5.3.0\server\solr directory (note that it is not the conf under G:\solr-5.3.0\server\solr\collection3), and then use the decompressed file to open apache-solr- dataimportscheduler-1.0.jar file, copy the dataimport.properties file inside and modify it. The following is the final content of my automatic timing update configuration file:

#################################################
#                                               #
#       dataimport scheduler properties         #
#                                               #
#################################################

#  to sync or not to sync
#  1 - active; anything else - inactive
syncEnabled=1

#  which cores to schedule
#  in a multi-core environment you can decide which cores you want syncronized
#  leave empty or comment it out if using single-core deployment
syncCores=collection3

#  solr server name or IP address
#  [defaults to localhost if empty]
server=localhost

#  solr server port
#  [defaults to 80 if empty]
port=8983

#  application name/context
#  [defaults to current ServletContextListener's context (app) name]
webapp=solr

#  URL params [mandatory]
#  remainder of URL
params=/dataimport?command=delta-import&clean=false&commit=true

#  schedule interval
#  number of minutes between two runs
#  [defaults to 30 if empty]
interval=1

# The time interval for redoing the index, in minutes, the default is 7200, which is 1 day;
# Empty, 0, or commented out: means never redo the index
reBuildIndexInterval=2

# redo index parameters
reBuildIndexParams=/select?qt=/dataimport&command=full-import&clean=true&commit=true

# The timing start time of the redo index interval, the time of the first real execution = reBuildIndexBeginTime+reBuildIndexInterval*60*1000;
# Two formats: 2012-04-11 03:10:00 or 03:10:00, the latter will automatically complete the date part as the date when the service is started
reBuildIndexBeginTime=03:10:00

 

Problems encountered:

 

Byte 1 of a 1-byte UTF-8 sequence is invalid

1. Manually change UTF-8 in <?xml version="1.0" encoding="UTF-8"?> to UTF8

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324610365&siteId=291194637