如何用Solr搭建自己的搜索服务

最近在学习Solr，借这个机会丰富一下自己空荡荡的博客，同时也加深一下自己记忆。还有特别提示各位读者，本人也是刚刚接触Solr，对其了解并不深入，有说的不对或错误的地方，望各位多多指点。

在搭建Solr搜索服务之前，先来了解两个问题。

什么是Solr?
Solr能做什么?

什么是Solr

Solr是一个基于Lucene实现的全文搜索服务器。底层使用易于扩展和修改的Java 来实现。服务器通信使用标准的HTTP 和XML，所以如果使用Solr 了解Java 技术会有用却不是必须的要求。

Solr 主要特性有：

强大的全文检索功能，
高亮显示检索结果，
动态集群，
数据库接口和电子文档（Word ，PDF 等）的处理。

而且Solr 具有高度的可扩展，支持分布搜索和索引的复制。

Solr能做什么

现在有站点都使用Solr作为搜索服务器，不仅仅因为它是Apache的开源项目，而是因为它

下载最新的Solr和Resin
Solr: http://lucene.apache.org/solr/
Resin: http://www.caucho.com/download/

我这里下载的Solr是4.3.0版本； Resin用的是本人以前一直用的3.0.25，这里也可以用Tomcat，有关Tomcat搭建Solr搜索服务器的可以参考：
http://www.originsoft.net/archives/32

数据用的是MySql，这里就不说MySql了

Solr 程序包目录结构
解压Solr压缩包到D:/solr-4.3.0

build ：在solr 构建过程中放置已编译文件的目录。
client ：包含了一些特定语言调用Solr 的API 客户端程序，目前只有Ruby 可供选择，Java 客户端叫SolrJ 在src/solrj 中可以找到。
dist ：存放Solr 构建完成的JAR 文件、WAR 文件和Solr 依赖的JAR 文件。
example ：是一个安装好的Jetty 中间件，其中包括一些样本数据和Solr 的配置信息。

src ：Solr 相关源码。

　　Solr 的源码没有放在同一个目录下，src/java 存放大多数文件，src/common 是服务器端与客户端公用的代码，src/test 放置solr 的测试程序，serlvet 的代码放在src/webapp/src 中。

安装Resin
这里下载的是压缩包，直接解压即可D:/resin-pro-3.0.25

开始搭建Solr搜索环境
该准备的工作已经准备完毕，下面开始详述搭建Solr搜索环境

1、打开Eclipse新建一个Web项目

这里我是用新建的Maven项目，将D:\solr-4.3.0\example\webapps\solr.war解压到新建项目的webapp下面：如下图

在webapp下新建solr_home，这个目录后面配置solr/home JNDI 即：Lucene索引目录文件

2、修改web.xml配置solr/home JNDI

  <env-entry>
    <env-entry-name>solr/home</env-entry-name>
    <env-entry-value>src/main/webapp/solr_home</env-entry-value>
    <env-entry-type>java.lang.String</env-entry-type>
  </env-entry>

这边也可以在中间件配置solr/home
如：Tomcat可以在%CATALINA_HOME%/conf/Catalina/localhost/solr.xml

<?xml version='1.0' encoding='utf-8'?>
<Context docBase="D:/eclipse-4.2/workspace/acme-search/src/main/webapp" debug="0" crossContext="true">
  <Environment name="solr/home" type="java.lang.String" value="D:/eclipse-4.2/workspace/acme-search/src/main/webapp/solr_home" override="true"/>
</Context>

3、配置索引服务
在solr/home配置的目录下有solr.xml，该文件配置该索引服务上所拥有的所有索引。每个<core>标签表示一个搜索服务，它的name可以任意取名，再好与instanceDir索引根目录名一致。
instanceDir索引根目录名，即：solr/home下索引实例目录。该目录下需要有conf子目录下面包含有data-config.xml、schema.xml、solrconfig.xml。这三个Xml文件在下面会具体讲到。

<?xml version="1.0" encoding="UTF-8" ?>

  <solr persistent="true">

  <cores adminPath="/admin/cores" defaultCoreName="collection1" host="${host:}" hostPort="${jetty.port:8983}" hostContext="${hostContext:solr}" zkClientTimeout="${zkClientTimeout:15000}">
    <core name="collection1" instanceDir="collection1" />
  </cores>
</solr>

4、配置索引创建数据源
这里要说道data-config.xml、schema.xml、solrconfig.xml和DataImportHandler

配置索引数据源data-config.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
	<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"  url="jdbc:mysql://192.168.1.101:3306/wms"  user="root"  password="123"/>  
	<document name="content">
		<entity name="node" pk="USER_ID" query="SELECT A.USER_ID, A.ADDRESS, DATE_FORMAT(A.BIRTHDAY,'%Y-%m-%d') BIRTHDAY, A.CARD_NO, A.EMAIL, A.NAME, A.PHONE, A.PRE_MONEY, DATE_FORMAT(A.REG_DATE,'%Y-%m-%d') REG_DATE, A.SCORE, A.SEX, A.STATUS, A.TYPE FROM USERS A">
			<field column="USER_ID" name="id" />
			<field column="ADDRESS" name="address" />
			<field column="BIRTHDAY" name="birthday" />
			<field column="CARD_NO" name="cardNo" />
			<field column="EMAIL" name="email" />
			<field column="NAME" name="name" />
			<field column="PHONE" name="phone" />
			<field column="PRE_MONEY" name="preMoney" />
			<field column="REG_DATE" name="redDate" />
			<field column="SCORE" name="score" />
			<field column="SEX" name="sex" />
			<field column="STATUS" name="status" />
			<field column="TYPE" name="type" />
		</entity>
	</document>
</dataConfig>

配置域（Field）类型及索引、查询分词器schema.xml

<?xml version="1.0" encoding="UTF-8" ?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

<!--  
 This is the Solr schema file. This file should be named "schema.xml" and
 should be in the conf directory under the solr home
 (i.e. ./solr/conf/schema.xml by default) 
 or located where the classloader for the Solr webapp can find it.

 This example schema is the recommended starting point for users.
 It should be kept correct and concise, usable out-of-the-box.

 For more information, on how to customize this file, please see
 http://wiki.apache.org/solr/SchemaXml
-->

<schema name="db" version="1.1">
  <!-- attribute "name" is the name of this schema and is only used for display purposes.
       Applications should change this to reflect the nature of the search collection.
       version="1.1" is Solr's version number for the schema syntax and semantics.  It should
       not normally be changed by applications.
       1.0: multiValued attribute did not exist, all fields are multiValued by nature
       1.1: multiValued attribute introduced, false by default -->

  <types>
    <!-- field type definitions. The "name" attribute is
       just a label to be used by field definitions.  The "class"
       attribute and any other attributes determine the real
       behavior of the fieldType.
         Class names starting with "solr" refer to java classes in the
       org.apache.solr.analysis package.
    -->

    <!-- The StrField type is not analyzed, but indexed/stored verbatim.  
       - StrField and TextField support an optional compressThreshold which
       limits compression (if enabled in the derived fields) to values which
       exceed a certain size (in characters).
    -->
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

    <!-- boolean type: "true" or "false" -->
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>

    <!-- The optional sortMissingLast and sortMissingFirst attributes are
         currently supported on types that are sorted internally as strings.
       - If sortMissingLast="true", then a sort on this field will cause documents
         without the field to come after documents with the field,
         regardless of the requested sort order (asc or desc).
       - If sortMissingFirst="true", then a sort on this field will cause documents
         without the field to come before documents with the field,
         regardless of the requested sort order.
       - If sortMissingLast="false" and sortMissingFirst="false" (the default),
         then default lucene sorting will be used which places docs without the
         field first in an ascending sort and last in a descending sort.
    -->    


    <!-- numeric field types that store and index the text
         value verbatim (and hence don't support range queries, since the
         lexicographic ordering isn't equal to the numeric ordering) -->
    <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
    <fieldType name="long" class="solr.LongField" omitNorms="true"/>
    <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
    <fieldType name="double" class="solr.DoubleField" omitNorms="true"/>


    <!-- Numeric field types that manipulate the value into
         a string value that isn't human-readable in its internal form,
         but with a lexicographic ordering the same as the numeric ordering,
         so that range queries work correctly. -->
    <fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>


    <!-- The format for this date field is of the form 1995-12-31T23:59:59Z, and
         is a more restricted form of the canonical representation of dateTime
         http://www.w3.org/TR/xmlschema-2/#dateTime    
         The trailing "Z" designates UTC time and is mandatory.
         Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z
         All other components are mandatory.

         Expressions can also be used to denote calculations that should be
         performed relative to "NOW" to determine the value, ie...

               NOW/HOUR
                  ... Round to the start of the current hour
               NOW-1DAY
                  ... Exactly 1 day prior to now
               NOW/DAY+6MONTHS+3DAYS
                  ... 6 months and 3 days in the future from the start of
                      the current day
                      
         Consult the DateField javadocs for more information.
      -->
    <fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>


    <!-- The "RandomSortField" is not used to store or search any
         data.  You can declare fields of this type it in your schema
         to generate psuedo-random orderings of your docs for sorting 
         purposes.  The ordering is generated based on the field name 
         and the version of the index, As long as the index version
         remains unchanged, and the same field name is reused,
         the ordering of the docs will be consistent.  
         If you want differend psuedo-random orderings of documents,
         for the same version of the index, use a dynamicField and
         change the name
     -->
    <fieldType name="random" class="solr.RandomSortField" indexed="true" />

    <!-- solr.TextField allows the specification of custom text analyzers
         specified as a tokenizer and a list of token filters. Different
         analyzers may be specified for indexing and querying.

         The optional positionIncrementGap puts space between multiple fields of
         this type on the same document, with the purpose of preventing false phrase
         matching across fields.

         For more info on customizing your analyzer chain, please see
         http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
     -->

    <!-- One can also specify an existing Analyzer class that has a
         default constructor via the class attribute on the analyzer element
    <fieldType name="text_greek" class="solr.TextField">
      <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/>
    </fieldType>
    -->

    <!-- A text field that only splits on whitespace for exact matching of words -->
    <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>

    <!-- A text field that uses WordDelimiterFilter to enable splitting and matching of
        words on case-change, alpha numeric boundaries, and non-alphanumeric chars,
        so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi".
        Synonyms and stopwords are customized by external files, and stemming is enabled.
        Duplicate tokens at the same position (which may result from Stemmed Synonyms or
        WordDelim parts) are removed.
        -->
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>


    <!-- Less flexible matching, but less false matches.  Probably not ideal for product names,
         but may be good for SKUs.  Can insert dashes in the wrong place and still match. -->
    <fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

    <!-- This is an example of using the KeywordTokenizer along
         With various TokenFilterFactories to produce a sortable field
         that does not include some properties of the source text
      -->
    <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <!-- KeywordTokenizer does no actual tokenizing, so the entire
             input string is preserved as a single token
          -->
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <!-- The LowerCase TokenFilter does what you expect, which can be
             when you want your sorting to be case insensitive
          -->
        <filter class="solr.LowerCaseFilterFactory" />
        <!-- The TrimFilter removes any leading or trailing whitespace -->
        <filter class="solr.TrimFilterFactory" />
        <!-- The PatternReplaceFilter gives you the flexibility to use
             Java Regular expression to replace any sequence of characters
             matching a pattern with an arbitrary replacement string, 
             which may include back refrences to portions of the orriginal
             string matched by the pattern.
             
             See the Java Regular Expression documentation for more
             infomation on pattern and replacement string syntax.
             
             http://java.sun.com/j2se/1.6.0/docs/api/java/util/regex/package-summary.html
          -->
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-z])" replacement="" replace="all"
        />
      </analyzer>
    </fieldType>

    <!-- since fields of this type are by default not stored or indexed, any data added to 
         them will be ignored outright 
     --> 
    <fieldtype name="ignored" stored="false" indexed="false" class="solr.StrField" /> 
    
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

 </types>


 <fields>
   <!-- Valid attributes for fields:
     name: mandatory - the name for the field
     type: mandatory - the name of a previously defined type from the <types> section
     indexed: true if this field should be indexed (searchable or sortable)
     stored: true if this field should be retrievable
     multiValued: true if this field may contain multiple values per document
     omitNorms: (expert) set to true to omit the norms associated with
       this field (this disables length normalization and index-time
       boosting for the field, and saves some memory).  Only full-text
       fields or fields that need an index-time boost need norms.
     termVectors: [false] set to true to store the term vector for a given field.
       When using MoreLikeThis, fields used for similarity should be stored for 
       best performance.
   -->
			
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
   <field name="address" type="string" indexed="true" stored="true"/>
   <field name="cardNo" type="string" indexed="true" stored="false"/>
   <field name="name" type="string" indexed="true" stored="true"/>
   <field name="email" type="text" indexed="true" stored="true"/>

   <field name="phone" type="string" indexed="true" stored="true"/>
   <field name="preMoney"  type="sfloat" indexed="true" stored="true"/>
   <!-- "default" values can be specified for fields, indicating which
        value should be used if no value is specified when adding a document.
     -->
   <field name="birthday" type="string" indexed="true" stored="true"/>
   <field name="score" type="sint" indexed="true" stored="true" default="0"/>

   <!-- Some sample docs exists solely to demonstrate the spellchecker
        functionality, this is the only field they container.
        Typically you might build the spellchecker of "catchall" type field
        containing all of the text in each document
     -->
   <field name="sex" type="string" indexed="true" stored="true"/>
   <field name="type" type="string" indexed="true" stored="true"/>

   <!-- non-tokenized version of manufacturer to make it easier to sort or group
        results by manufacturer.  copied from "manu" via copyField -->
   <field name="status" type="string" indexed="true" stored="false"/>
   
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

   <field name="_version_" type="long" indexed="true" stored="true"/>

   <!-- Dynamic field definitions.  If a field name is not found, dynamicFields
        will be used if the name matches any of the patterns.
        RESTRICTION: the glob-like pattern in the name attribute must have
        a "*" only at the start or the end.
        EXAMPLE:  name="*_i" will match any field ending in _i (like myid_i, z_i)
        Longer patterns will be matched first.  if equal size patterns
        both match, the first appearing in the schema will be used.  -->
   <dynamicField name="*_i"  type="sint"    indexed="true"  stored="true"/>
   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
   <dynamicField name="*_l"  type="slong"   indexed="true"  stored="true"/>
   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
   <dynamicField name="*_f"  type="sfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_d"  type="sdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>

   <dynamicField name="random*" type="random" />

   <!-- uncomment the following to ignore any fields that don't already match an existing 
        field name or dynamic field, rather than reporting them as an error. 
        alternately, change the type="ignored" to some other type e.g. "text" if you want 
        unknown fields indexed and/or stored by default --> 
   <!--dynamicField name="*" type="ignored" multiValued="true" /-->
   
 </fields>

 <!-- Field to use to determine and enforce document uniqueness. 
      Unless this field is marked with required="false", it will be a required field
   -->
 <uniqueKey>id</uniqueKey>

 <!-- field for the QueryParser to use when an explicit fieldname is absent -->
 <defaultSearchField>name</defaultSearchField>

 <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
 <solrQueryParser defaultOperator="OR"/>

 <!-- Similarity is the scoring routine for each document vs. a query.
      A custom similarity may be specified here, but the default is fine
      for most applications.  -->
 <!-- <similarity class="org.apache.lucene.search.similarities.DefaultSimilarity"/> -->

</schema>

配置DataImportHandler
打开solrconfig.xml，加上下面的内容

  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  	<lst name="defaults">
      <str name="config">data-config.xml</str>
    </lst>
  </requestHandler>

到这里，配置完成。重启应用，打开浏览器
http://localhost:8089/solr/

如何用Solr搭建自己的搜索服务

猜你喜欢