Production installation of Solr in Linux

The current version of solr installed on the server: 5.3.1. Different from testing and research, if solr is to be deployed as a product, it needs to be installed as a service. There is a script file **install_solr_service.sh** in the bin/ directory in the solr compressed package, which is responsible for the installation of solr and is registered as a self-starting service.

 

1. Environmental preparation

First, you need to create a solr user and give it the corresponding permissions:

groupadd zpsolr
useradd -g zpsolr zpsolr
passwd zpsolr
chown -R zpsolr:zpsolr /var/solr /usr/local/solrcloud/

At the same time, two directories are also created to distinguish solr installation files from dynamic files:

  • /var/solr, dynamic file directory, there are index data, logs, etc., different from the installation directory of solr, easy to manage and upgrade later

  • /usr/local/solrcloud, the default installation directory during script installation, the installation file will automatically create a soft link for the solr directory, pointing to the corresponding solr installation directory, which is convenient for future upgrades (solr -> /usr/local/solrcloud /solr-5.3.1)

At this point, the files in the /usr/local directory look like the following structure:

lrwxrwxrwx. 1 zpsolr zpsolr   31 1211 18:20 solr -> /usr/local/solrcloud/solr-5.3.1
drwxr-xr-x. 9 zpsolr zpsolr 4096 1211 18:20 solr-5.3.1
drwxr-xr-x. 3 zpsolr zpsolr 4096 1211 01:27 tmp
drwxr-xr-x. 4 zpsolr zpsolr 4096 1211 00:57 zookeeper

 

2. Install the service script

Solr provides relevant scripts for installing solr services on the server. Note that root privileges are required to install successfully:

 sudo bash ./bin/install_solr_service.sh ../solr-5.3.1.tgz -i /usr/local/solrcloud -d /var/solr -u zpsolr -s solr -p 8983

Here, the solr installation directory, data and configuration file directory, user, service name, and default http port are specified, and the solr script parameters are explained:

  • -d dynamic file directory, default is "/var/solr"
  • -i solr pressurized installation directory, the default is "/opt"
  • -p working/listening port, default is "8983"
  • -s Install as the name of the Linus service, the default is "solr"
  • -u The username for running solr, if the user does not exist, it will be created automatically, the default is "solr"

After the service is installed, you can start/stop the solr service through the service method:

sudo service solr start/stop/status ...

 

3. Solr basic directory structure

├── data
│   ├── brand
│   │   ├── core.properties
│   │   └── datasolr索引数据目录)
│   ├── conf
│   │   └── dataimport.properties
│   ├── lib
│   │   ├── mmseg4j-core-1.10.0.jar
│   │   ├── mmseg4j-solr-2.3.0.jar
│   │   ├── mysql-connector-java-5.1.37.jar
│   │   ├── solr-1.0.0.jar
│   │   ├── solr-dataimporthandler-5.3.1.jar
│   │   └── solr-dataimporthandler-extras-5.3.1.jar
│   ├── solr.xml
├── logs
│   ├── solr-8983-console.log
│   ├── solr_gc.log
│   ├── solr.log
├── log4j.properties
├── logs
├── solr-8983.pid
└── solr.in.sh

 

3.1 solr.in.sh

The file user configures the solr startup mode, the corresponding java virtual machine parameters and other information, the corresponding solrHome and other information:

SOLR_PID_DIR=/var/solr
SOLR_HOME=/var/solr/data
LOG4J_PROPS=/var/solr/log4j.properties
SOLR_LOGS_DIR=/var/solr/logs
SOLR_PORT=8983

Solr has joined the cluster mode since version 5. If ZK_HOST is configured in the solr.in.sh file, it will start in solrCloud mode. Different solr nodes can be registered in the same zookeeper directory, which will form a cluster.

 

3.2 data

The data directory stores the key collection configurations. Each collection is stored in a folder. The collection stores the corresponding configuration directory and data directory. There will be corresponding description files in the configuration directory.

But if it is in solrCloud mode, the collection is created uniformly, and all configurations must be uploaded to zookeeper before creation. For uploading and downloading collections, you can use the scripts provided in solrcloud:

/usr/local/solrcloud/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 -cmd upconfig -confname my_new_config -confdir server/solr/configsets/basic_configs/conf

/usr/local/solrcloud/solr/server/scripts/cloud-scripts/zkcli.sh -cmd downconfig -zkhost 192.168.1.162:2181,192.168.1.163:2181,192.168.1.165:2181/solrcloud -confname evaluation -confdir .

In stand-alone mode, the configuration files can only be stored in the data/conf directory.

 

4. Solr overall structure

In general, the application scenarios of Solr integrating with other applications are shown in the following figure:

solr_integration.png

In the content relationship system, data is obtained from a data source (usually Database) and indexed into the solr service, so that end users can use the solr service to query the corresponding information from the document list.

DiagramOfTheMainComponentsOfSolr4cn.jpg

Solr provides  RestAPI , which can be used for common solr operations, such as creating a typical collection, adding a replica, adding a shard, etc. The basic url is: http://:/solr.

In each solr instance, there is a solr.xml file that specifies the instance location of the Solr Server. For each solr core, there are

  • core.properties: define its name for each core and which collection it belongs to;
  • conf/solrconfig.xml, which controls higher behavior, can specify other locations to store data information;
  • conf/schema.xml, which has been renamed managed-schema in version 5.5, describes the content configuration that requires solr indexing, in which a Document can be defined as a collection of attributes of multiple fields, and these field attributes define the solr processor How to handle field values ​​and query field values;
  • The data directory contains various indexed files.

Note that in SolrCloud mode, the conf directory is not included, these are stored in zookeeper for use by the server.

From Solr's point of view, the basic information unit is the document Document, which is composed of a part of description information, specifically, a collection of some fields. Fields can contain different types of data, specific to the name, color, and size of a pair of shoes, all of which belong to the field (Field) of this shoe (Document).

When adding a document (Document) to Solr, Solr will extract the field (Field) information in the document and save it to the index (Index); when the query is actually executed, Solr can quickly query the index ( index) and returns the matching records.

 

4.1 Field Analysis

It is how solr processes the incoming data, such as building an index. Suppose there is a field of personal profile, which includes many words. We need to search for this person by searching for a word after building the index. Then we need to index each word. However, a sentence There are many modal particles in words. If there is an or and not, we do not need to build an index for these words. When building an index, we need to exclude them. Which words we want to build an index, those do not need to build, this analysis process we call Field Analysis.

 

4.2 Field Type

How solr handles the data of a field, and this field is processed when querying. A Field Type includes the following four pieces of information:

  • name, name;
  • The implementation class name, which is the class name of the real type in solr;
  • If FieldType is of TextFiled type, there is also an analysis attribute;
  • Attributes;

Copy field (copy field) , you can copy all full-text fields into one field, which is convenient for subsequent unified retrieval processing; you may want some fields of the document to be used multiple times. Solr has a field replication mechanism that can submit multiple fields of different types into one field. Field replication mainly involves two concepts, source and destination, one is the field to be copied, and the other is to which field to copy. In fact, it is simpler. For example, if you want to query a blog that contains "Java", then you must check the content, whether the title contains Java, but solr cannot be like SQL, where tittle like '%Java%' or content like '% Java%'. At this time, copyField comes in handy, define a new field, copy title and content to this new field, and query directly from this new field when indexing, so that the goal is achieved. This is the typical application scenario of copyField.

Dynamic field (dynamic field) , you don't need to specify the name of the specific field, as long as the rules of the field name are defined, for example, name is  *_i, then all fields ending with _i conform to the definition. Dynamic fields allow Solr to index fields that are not explicitly defined in the schema. This is useful when forgetting to define some fields. Dynamic fields can make the system more flexible and versatile. A dynamic field is similar to a regular field, except that it contains a wildcard in its name. When indexing a document, a field that does not match in the regular field will be matched in the dynamic field. Assuming that a *_idynamic field called cost_i is defined in the schema, if a field called cost_i is to be indexed, but the field of cost_i does not exist in the schema, then cost_i will be indexed into the *_ifield. Dynamic fields are also defined in the schema.xml file, and like other fields, they have a noun, field type, and attribute.

 

4.3 schema.xml file

The fundamental purpose of the schema.xml configuration file is to tell Solr how to build an index through configuration, and in the V5 version, this file has been replaced by conf/managed-schema, and the file format is the same.

The data structure of solr is as follows:
* document: a document, a record
* field: domain, attribute

Solr returns several documents that meet the conditions by searching for one or some fields, or returns them sorted by the searched score. If compared with the database, the document is equivalent to the table of the database, and the field is equivalent to the field in the table. And schema.xml is to define the structure of a table (define the name, type, constraints, etc. of each field).

The basic structure of schema.xml is as follows:

<schema>
    <types>
    <fields>
    <uniqueKey>
    <copyField>
</schema>

Common configuration instructions:

  • field: defines each field in a document
  • name: required. The name of the field. A name with an underscore before and after is a name reserved by the system, such as "_version_"
  • type: required. Type, corresponding to the name of fieldType
  • default: the default value of the field
  • indexed: true/false, whether to index the field so that users can search for it and count it (facet)
  • stored: true/false, defines whether this field can be returned to the queryer
  • multiValued: true/false, whether it can accommodate multiple values ​​(such as the dest of multiple copyFields pointing to it). If it is true, the field cannot be sorted and cannot be used as a uniqueKey
  • required: true/false, tell solr whether this field accepts null values, the default is false
  • docValues: true/false, establish a document-to-value index to improve the efficiency of some special searches (sorting, statistics, highlighting)
    copyField: copy the content of one field to another field. Generally used to copy several different fields into the same field to facilitate searching for only one field
  • source:被拷贝的field,支持用通配符指定多个field,比如:*_name
  • dest:拷贝到的目的field
  • maxChars:最大字符数
    uniqueKey:指定一个field为唯一索引
    fieldType:定义field的类型,包括下面一些属性
  • name:必填,被field配置使用
  • class:必填,filedType的实现类。solr.TextField是路径缩写,“等价于”org.apache.solr.schema.TextField"
  • multiValued:?
  • positionIncrementGap:指定mutiValued的距离
  • ananlyzer:如果class是solr.TextField,这个配置是必填的。告诉solr如何处理某些单词、如何分词,比如要不要去掉“a”,要不要全部变成小写……
  • type:index或query
  • tokenizer:分词器,比如:StandardTokenizerFactory
  • filter:过滤器,比如:LowerCaseFilterFactory
    dynamicField:用通配符定义一个field来存在没有被field定义的漏网之鱼
  • name:使用通配符,比如“*_i”,来处理类似“cost_i”之类的field

 

4.4 理解分析器(Analysers),分词器(Tokenizers)和过滤器(Filters)

字段分析器在文档(document)被索引时,以及查询发生时被调用。分析器会检查文本中的字段,产生token流;分词器可以将一段文本数据分解成词汇单元,或者token;过滤器可以检查一段token流并决定是否做一些保留,转换,抛弃,分解等行为,分词器和过滤器可以以某种顺序(管道或是链条)来组合成分析器。

对于文本数据(solr.TextField),solr在建立索引和搜索的时候需要拆分它们、并做一些相应的处理(比如英文要去掉介词、转成小写、单词原形化等,中文要恰当地要分词)。这些工作,一般由Analyzers、Tokenizers、和Filter来实现,这些都是在fieldType中进行配置的。

  • ananlyzer:告诉solr在建立索引和搜索的时候,如何处理text类型的内容,比如要不要去掉“a”、去掉介词、去掉复数,要不要全部变成小写等等……它在schema.xml文件中配置,可以直接指定一个类给它;也可以由tokenizer和filter的组合来实现;
  • type:可选参数,index或query,表名这个配置是在建立索引还是在搜索的时候使用;
  • tokenizer:分词器,比如:StandardTokenizerFactory;
  • filter:过滤器,比如:LowerCaseFilterFactory,小写转换;

Analyzer负责把文本field转成token流,然后自己处理、或调用Tokenzier和Filter进一步处理,Tokenizer和Filter是同等级和顺序执行的关系,一个处理完后交给下一个处理。

Tokenizer和Filter的区别:

  • Tokenizer:接收text(通过从solr那里获得一个reader来读取文本),拆分成tokens,输出token stream
  • Filter:接收token stream,对每个token进行处理(比如:替换、丢弃、不理),输出token stream

因此,在配置文件中,Tokenizer放在第一位,Filter放在第二位直到最后一位。

一个Tokenizer处理文本的例子:

输入:"Please, email [email protected] by 03-09, re: m37-xq."  
输出:"Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

 

5. Solr数据导入

DIH用于从数据库中抓取数据并创建索引,其中包括一系列相关的概念:

  • Datasource:数据源,包括获取数据必需的信息:数据位置(url)、数据库driver、登录账号和密码
  • Entity:相当于数据库的一个视图,可以从一个表或联表查询获得
  • Processor:数据处理器,负责从数据源中获取数据、处理、然后加入到索引中
  • Transformer:数据转换器,可选,负责修改数据、创建新的field、或根据需要把一条记录变成多条记录

要使用data import handler,首先需要在solrconfig.xml中添加requestHandler(从命名来看,requestHandler更像是用于请求处理器,提供暴露solr向外的服务接口, /dataimport跟/select属于平级)。

  <requestHandler name="/dataimport" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">db-data-config.xml</str>
    </lst>
  </requestHandler>

config参数负责指定DIH配置文件的位置,如果需要添加jar包依赖,需要在该文件中增加lib标签:

<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />

DIH配置文件的格式为:

根节点:

  • dataSource
    • type: 数据源类型,如:JdbcDataSource(缺省值),采用SqlEntityProcessor
    • driver: 数据库驱动,如:com.mysql.jdbc.Driver
    • convertType:
    • url: 数据库url
    • user: 你懂的
    • password: 你懂的
  • document
    • entity
      • name: 随便起的一个标志,可以给嵌套的entity使用,比如${item.id}(假设name=“item”)
      • query: 拉取所有数据的语句
      • deltaQuery: 拉取delta数据的语句,如:deltaQuery=“select * from xxx where last_modified > ‘${dataimporter.last_index_time}’”
      • field: 指定 dataSource 的 field 和 solr 的field的对应关系,如:
      • entity: 嵌套的entity,定义一些一对多的数据。可以使用父entity的name作为条件,比如:where item_id=‘${item.id}’
      • transformer: 指定transformer对象,多个的话用逗号分开

通过HTTP请求(POST或GET)执行各种DIH操作:请求格式:http://:/ solr/ / dataimport? command=,其中的各种如下:

  • abort:停止当前正在进行的操作
  • delta-import:调用deltaQuery拉取数据。可以带上几个额外的参数:&clean=true&commit=true等等(和full_import相同)
  • full-import:调用query拉取所有数据。请求会马上返回,后台有新线程执行重建索引的操作。可以通过status操作查询状态。额外参数:
  • clean:缺省为true。是否在开始重建索引前清除旧索引
  • commit:缺省为true。是否提交操作请求
  • debug:缺省为false。debug模式,不会commit操作。如果同时需要commit,得带上commit=true参数
  • entity:缺省为所有的entity。可以指定某一个或多个entity
  • optimize:缺省为true。是否需要在完成操作后优化索引
  • reload-config:如果修改了配置文件,执行这个命令使它生效
  • status:返回各种统计数据、以及DIH的当前状态

 

6. SolrCloud迁移数据

如果将solrCloud向solr单点迁移的话,比较简单,直接将在zookeeper上的配置download到本地,将其放在在本地保存到conf目录即可(注意,zk的目录结构与本地会有所不同),然后在solr.in.sh中去掉ZK_HOST一行或将其注释掉,重启即可。

但如果将solr单点组成solrCloud,会有所不同,而且也会比较麻烦,原因是组成的solrCloud集群可能会有一些服务器明明可以提供服务,但在clusterstatus中会显示失败(down状态),此时就需要先将replica删除。

在api中可以先删除replica,注意需要提供replica name,在本地文件中的 core.properties,

name=brand_brand_replica1  
shard=brand    
collection=brand
coreNodeName=core_node2

The coreNodeName is the corresponding replicaName, which can be done through the DELETE URL in the REST API.

For solr service nodes, they are all CPU-consuming, and it is best to deploy them on 8-core CPUs. In solr.in.sh, JVM_OPTS is set to 3096M (can not be set too large, otherwise the process of a FullGC will be too slow).

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326727593&siteId=291194637