Solr索引更新-JSON、CSV

在上一节中，我们通过XML格式介绍了solr更新数据的格式，以及XML格式的一些字段。有了这些字段，已经能够很好地控制solr对于文档的索引功能。

仅仅通过XML格式进行更新确实能够比较详细的操作文档的内容，不过格式太复杂，现在流行json格式，格式简单，内容清晰，solr在3.1版本就对json格式更新索引进行了支持。

不过solr4.0之后，可以直接使用UpdateRequestHandler，不过需要在请求头添加Content-type:application/json或者Content-type:text/json字段，告知更新方式为json格式索引。

如果对json格式有疑问的话，可以看本这篇文章：http://isilic.iteye.com/blog/1747660

利用curl命令：

curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @books.json -H 'Content-type:application/json'

其中--data-binary字段后面提交的数据以@开头，这个表示提交的内容为本地文件，将本地文件的内容提交，这个在这篇文章http://isilic.iteye.com/blog/1764049里面有提到，不过没有说的太多，这里单独提出来，其它的curl提交参数没有区别。

其中books.json位于example/exampledocs目录下，部分内容如下：

[
  {
    "id" : "978-0641723445",
    "cat" : ["book","hardcover"],
    "name" : "The Lightning Thief",
    "author" : "Rick Riordan",
    "series_t" : "Percy Jackson and the Olympians",
    "sequence_i" : 1,
    "genre_s" : "fantasy",
    "inStock" : true,
    "price" : 12.50,
    "pages_i" : 384
  }
,
  {
    "id" : "978-1423103349",
    "cat" : ["book","paperback"],
    "name" : "The Sea of Monsters",
    "author" : "Rick Riordan",
    "series_t" : "Percy Jackson and the Olympians",
    "sequence_i" : 2,
    "genre_s" : "fantasy",
    "inStock" : true,
    "price" : 6.49,
    "pages_i" : 304
  },
  {},
  {}
]

这个不用解释了吧，就是XML格式的转化，提交这些数据，就能通知solr新建索引。

需要注意的是其中有些字段的后缀 _i，_t，这些后缀是有特殊含义的，在下篇我们讲解DIH的时候会有这部分的内容。

如果你需要操作多个文档，同时对多个文档有不同的控制，就需要下面这个更复杂的json格式了，其实就是格式稍微有点复杂，内容和XML完全一样。

{ 
"add": {
  "doc": {
    "id": "DOC1",
    "my_boosted_field": {        /* use a map with boost/value for a boosted field */
      "boost": 2.3,
      "value": "test"
    },
    "my_multivalued_field": [ "aaa", "bbb" ]   /* use an array for a multi-valued field */
  }
},
"add": {
  "commitWithin": 5000,          /* commit this document within 5 seconds */
  "overwrite": false,            /* don't check for existing documents with the same uniqueKey */
  "boost": 3.45,                 /* a document boost */
  "doc": {
    "f1": "v1",
    "f1": "v2"
  }
},

"commit": {},
"optimize": { "waitFlush":false, "waitSearcher":false },

"delete": { "id":"ID" },                               /* delete by ID */
"delete": { "query":"QUERY" }                          /* delete by query */
"delete": { "query":"QUERY", 'commitWithin':'500' }    /* delete by query, commit within 500ms */
}

这些字段的含义都在上篇XML格式中介绍过，这个不在解释。

在Solr4.0中，Solr提供了一种Atomic Update的方式，来增加对inc字段的支持，如下面这种方式：

curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '
 {
  "id"        : "TestDoc1",
  "title"     : {"set":"test1"},
  "revision"  : {"inc":3},
  "publisher" : {"add":"TestPublisher"}
 }'

实际上通过内建的_version_字段来支持field域的原子更新操作，实际上就是字段的inc属性。这个算是json格式特有的吧。

curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '
 {
  "id"        : "TestDoc1",
  "title"     : {"set":"test1"},
  "revision"  : {"inc":3},
  "publisher" : {"add":"TestPublisher"}
  "_version_" : {12345}
 }'

其实最后这个_version_是solr自己添加的，以提供一种称为Optimistic_Concurrency http://wiki.apache.org/solr/Optimistic_Concurrency的操作，这个我看了下，没有看的特别清楚，不敢多说，大家可以自行了解下，如果需要这个功能的话。

最后我们再来学习下CSV格式更新文档索引，CSV文档格式在solr1.2就已经支持，由于csv格式比较简单，实际上使用时受限更大，所以在这里简单介绍，在什么时候才会使用CSV分割的数据呢？在请求数据格式简单，字段控制不复杂的时候才会使用，不过我更偏向于这是个历史的产物，尤其在格式这么丰富的年代里，一家之言。

使用CSV格式提交索引更新，需要在solrconfig.xml中添加

<requestHandler name="/update/csv" class="solr.CSVRequestHandler" startup="lazy"></requestHandler>

在solr4.0中，可以直接使用这个

<requestHandler name="/update" class="solr.UpdateRequestHandler"/>

不过需要在头部添加Content-type:application/csv或者Content-type:text/csv，这个和XML、JSON一样的

CSV-Comma Separated Values，就是逗号分割的内容，如果想向solr提交csv格式的数据，可以使用curl命令这么提交：

curl http://localhost:8983/solr/update/csv --data-binary @books.csv -H 'Content-type:text/plain; charset=utf-8'

books.csv位于example/exampledocs目录下，其中的内容为：

id,cat,name,price,inStock,author,series_t,sequence_i,genre_s
0553573403,book,A Game of Thrones,7.99,true,George R.R. Martin,"A Song of Ice and Fire",1,fantasy
0553579908,book,A Clash of Kings,7.99,true,George R.R. Martin,"A Song of Ice and Fire",2,fantasy
055357342X,book,A Storm of Swords,7.99,true,George R.R. Martin,"A Song of Ice and Fire",3,fantasy
0553293354,book,Foundation,7.99,true,Isaac Asimov,Foundation Novels,1,scifi
0812521390,book,The Black Company,6.99,false,Glen Cook,The Chronicles of The Black Company,1,fantasy
0812550706,book,Ender's Game,6.99,true,Orson Scott Card,Ender,1,scifi
0441385532,book,Jhereg,7.95,false,Steven Brust,Vlad Taltos,1,fantasy
0380014300,book,Nine Princes In Amber,6.99,true,Roger Zelazny,the Chronicles of Amber,1,fantasy
0805080481,book,The Book of Three,5.99,true,Lloyd Alexander,The Chronicles of Prydain,1,fantasy
080508049X,book,The Black Cauldron,5.99,true,Lloyd Alexander,The Chronicles of Prydain,2,fantasy

简单介绍下solr对这种格式的支持参数：

separator：分隔符，指分割数据的各个字段，默认是comma
header：字段索引头部
skip：跳过某个字段

fieldnames=id,name,category&skip=name
或者直接忽略name这个字段：
fieldnames=id,,category

skipLines：跳过csv数据的行数，默认是skipLines=0
trim：这个参数会将输入的数据行头尾的空格去掉
encapsulator：这个参数是用来设置忽略内容的，在这个符号内的值，是不能分割的，默认是双引号"
escape:转义某些字符，这个和上面的encapsulator不能同时使用。

keepEmpty：如果field没有值，或者值长度为Zero的话，也一样会索引，默认是false
literal：给每个文档都添加固定的键值对
literal.datasource=products：给所有的文档都添加datasource=product的属性值
map：这个是映射，会将LHS的值替换为RHS的值
map=Absolutely:true   在每个域中，将Absolutely出现的地方都用true代替
split：如果为true的话，值中如果含有comma的话，会继续解析，添加多个值
id,tags
101,"movie,spiderman,action"
这个会将tags的值继续解析，相当于添加了3个域的tag值，其中tags=movie、tags=spiderman、tags=action
overwrite：这个参数会检查是否覆盖同样的文档，默认是true
commit：所有的请求都被索引完成后，才提交变化；默认是false，避免潜在的多次提交的性能问题。

CSV的缺点是不能设置索引的boost选项，如果你想使用这个特性的话，CSV怕不是你的选择。

最后说一点，CSV格式也能提供其他格式分割的文件，不只是comma，如用Tab键分割，在mysql中：

select * into outfile '/tmp/result.txt' from tmptable

注：这个命令执行会有权限问题，大家可以自行解决。

这样可以得到Tab键分割的数据，提交solr时设置好separator就可以提交索引了。

总体来看，XML格式和JSON格式才是主流，如果你的数据格式刚好能转换成CSV格式的话，可以使用CSV，否则还是使用XML或者JSON，给你更好的控制。

Solr索引更新-JSON、CSV

猜你喜欢