Scrapy to build a distributed search engine crawlers - (h) elasticsearch combined django build search engine

Eight, elasticsearch build search engine

elasticsearch description: one based on lucene search server, multi-user distributed full-text search engine java-based development restful web interface.
Build their own Web site or program, add search more difficult. So we want to search for solutions to effectively zero configuration and free.
elasticsearch can simply by http and json interact with the search engine, support for distributed, can scale from a single server to multiple servers

Internal features:
word search results scoring resolve search request
full-text search engine: solr sphinx
many large companies are using Facebook, Microsoft, Dell and so on elasticsearch

  1. elasticsearch of Lucene the package, both to store data, but also analyze data, and make search engine for
    relational data search drawback:
    not scoring sort of search elements result
    is not distributed, search troublesome, high demand for programmers
    not parses search requests for content search can not be resolved, such as word and other
    data more, low efficiency
    required word, the relationship between data points out of focus

  2. nosql database:
    a document database json code, data stored in a relational database, it is necessary to keep a plurality of tables, and other internal-many relationship or the like, need to involve a plurality of tables can be stored inside the content json down, directly nosql the contents of a json save up to an archive database as a document.
    mongodb:

Installation and Configuration 1. elasticsearch

  1. java sdk installation
  2. elasticsearch official website to download the installation without using a version of the official website, providing much of the original plug-ins
  3. elasticsearc-rtf github search, Chinese release, has a lot of plug-in installed https://github.com/medcl/elasticsearch-rtf
  4. Elasticsearch operating method, enter the command line in the bin directory, perform elasticsearch.bat
    5. The profile: elasticsearch-rtfelasticsearch-rtf-masterconfigelasticsearch.yml

View elasticsearch installation .png

2. elasticsearch two important plug-ins: head and kibana installation

head plug-equivalent Navicat, manages the database, a browser-based

https://github.com/mobz/elasticsearch-head

1
2
3
4
5
6
7
Running with built in server

git clone git://github.com/mobz/elasticsearch-head.git
cd elasticsearch-head
npm install
npm run start
open http://localhost:9100/

Configuring elasticsearch interoperability with heade

head installation

kibana.bat

kibana.png

2. elasticsearch basic concepts

  1. Clusters: one or more nodes grouped together
  2. Node: a cluster of one server
  3. Slice: index capability divided into multiple parts, allowing split level, expanding capacity, a plurality of slices respond to the request
  4. Copy: copy or pay would slice, a node fails, other nodes on top

|index | 数据库|
|type | 表|
|document | 行|
|fields | 列|

集合搜索和保存:增加了五种方法:
OPTIONS & PUT & DELETE & TRACE & CONNECT

3. 倒排索引:

Inverted index

Inverted index

倒排索引待解决的问题:

Inverted index

Creating an index

View index .png head

4. elasticsearch命令

PUT lagou/job/1
1为id

PUT lagou/job/
不指明id自动生成uuid。

修改部分字段
POST lagou/job/1/_update

DELETE lagou/job/1

elasticserach批量操作:

查询index为testdb下的job1表的id为1和job2表的id为2的数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
GET _mget
{
"docs":[
{
"_index":"testdb",
"_type":"job1",
"_id":1
},
{
"_index":"testdb",
"_type":"job2",
"_id":2
}
]
}

index已经指定了,所有在doc中就不用指定了

1
2
3
4
5
6
7
8
9
10
11
12
GET testdb/_mget{
"docs":[
{
"_type":"job1",
"_id":1
},
{
"_type":"job2",
"_id":2
}
]
}

连type都一样,只是id不一样

1
2
3
4
5
6
7
8
9
10
11
GET testdb/job1/_megt
{
"docs":[
{
"_id":1
},
{
"_id":2
}
]
}

或者继续简写

1
2
3
4
GET testdb/job1/_megt
{
"ids":[1,2]
}

elasticsearch的bulk批量操作:可以合并多个操作,比如index,delete,update,create等等,包括从一个索引到另一个索引:

  • action_and_meta_datan
  • option_sourcen
  • action_and_meta_datan
  • option_sourcen
  • ….
  • action_and_meta_datan
  • option_sourcen

每个操作都是由两行构成,除了delete除外,由元信息行和数据行组成
注意数据不能美化,即只能是两行的形式,而不能是经过解析的标准的json排列形式,否则会报错

1
2
3
POST _bulk
{"index":...}
{"field":...}

elasticserach的mapping映射

elasticserach的mapping映射:创建索引时,可以预先定义字段的类型以及相关属性,每个字段定义一种类型,属性比mysql里面丰富,前面没有传入,因为elasticsearch会根据json源数据来猜测是什么基础类型。M挨批评就是我们自己定义的字段的数据类型,同时告诉elasticsearch如何索引数据以及是否可以被搜索。
作用:会让索引建立的更加细致和完善,对于大多数是不需要我们自己定义

相关属性的配置

  • String类型: 两种text keyword。text会对内部的内容进行分析,索引,进行倒排索引等,为设置为keyword则会当成字符串,不会被分析,只能完全匹配才能找到String。 在es5已经被废弃了
  • 日期类型:date 以及datetime等
  • 数据类型:integer long double等等
  • bool类型
  • binary类型
  • 复杂类型:object nested
  • geo类型:geo-point地理位置
  • 专业类型:ip competition
  • object :json里面内置的还有下层{}的对象
  • nested:数组形式的数据

elasticserach查询:

大概分为三类:

  • 基本查询:
  • 组合查询:
  • 过滤:查询同时,通过filter条件在不影响打分的情况下筛选数据

match查询:

后面为关键词,关于python的都会提取出来,match查询会对内容进行分词,并且会自动对传入的关键词进行大小写转换,内置ik分词器会进行切分,如python网站,只要搜到存在的任何一部分,都会返回
GET lagou/job/_search

1
2
3
4
5
6
7
{
"query":{
"match":{
"title":"python"
}
}
}

term查询

区别,对传入的值不会做任何处理,就像keyword,只能查包含整个传入的内容的,一部分也不行,只能完全匹配

terms查询

title里传入多个值,只要有一个匹配,就会返回结果

控制查询的返回数量

1
2
3
4
5
6
7
8
9
10
GET lagou/_serach
{
"query":{
"match":{
"title":"python"
}
},
"form":1,
"size":2
}

通过这里就可以完成分页处理洛,从第一条开始查询两条

match_all 返回所有

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
GET lagou/_search
{
"query":{
"match_all":{}
}
}

**match_phrase查询 短语查询**

GET lagou/_search
{
"query":{
"match_phrase":{
"title":{
"query":"python系统",
"slop":6
}
}
}
}

python系统,将其分词,分为词条,满足词条里面的所有词才会返回结果,slop参数说明两个词条之间的最小距离

multi_match查询

比如可以指定多个字段,比如查询title和desc这两个字段包含python的关键词文档

1
2
3
4
5
6
7
8
9
GET lagou/_search
{
"query":{
"multi_match":{
"query":"python",
"fileds":["title^3","desc"]
}
}
}

query为要查询的关键词 fileds在哪些字段里查询关键词,只要其中某个字段中出现了都返回
^3的意思为设置权重,在title中找到的权值为在desc字段中找到的权值的三倍

指定返回字段

1
2
3
4
5
6
7
8
GET lagou/_search{
"stored_fields":["title","company_name"],
"query":{
"match":{
"title":"pyhton"
}
}
}

通过sort把结果排序

1
2
3
4
5
6
7
8
9
10
11
GET lagou/_search
{
"query";{
"match_all":{}
},
"sort":[{
"comments":{
"order":"desc"
}
}]
}

sort是一个数组,里面是一个字典,key就是要sort的字段,asc desc是升序降序的意思

查询范围 range查询

1
2
3
4
5
6
7
8
9
10
11
12
GET lagou/_search
{
"query";{
"range":{
"comments":{
"gte":10,
"lte":20,
"boost":2.0
}
}
}
}

range是在query里面的,boost是权重,gte lte是大于等于 小于等于的意思
对时间的范围查询,则是以字符串的形式传入

wildcard模糊查询,可以使用通配符
*

组合查询:bool查询

bool查询包括了must should must_not filter来完成
格式如下:

1
2
3
4
5
6
bool:{
"filter":[],
"must":[],
"should":[],
"must_not":[],
}

5. 把爬取的数据保存至elasticsearch

1
2
3
4
5
6
7
8
class (object):


def process_item(self, item, spider):
#将item转换为es的数据
item.save_to_es()

return item

elasticsearch-dsl-py

High level Python client for Elasticsearch

pip install elasticsearch-dsl

items.py 中将数据保存至es

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def save_to_es(self):
article = ArticleType()
article.title = self['title']
article.create_date = self["create_date"]
article.content = remove_tags(self["content"])
article.front_image_url = self["front_image_url"]
if "front_image_path" in self:
article.front_image_path = self["front_image_path"]
article.praise_nums = self["praise_nums"]
article.fav_nums = self["fav_nums"]
article.comment_nums = self["comment_nums"]
article.url = self["url"]
article.tags = self["tags"]
article.meta.id = self["url_object_id"]

article.suggest = gen_suggests(ArticleType._doc_type.index, ((article.title,10),(article.tags, 7)))

article.save()

redis_cli.incr("jobbole_count")

return

6. elasticsearch结合django搭建搜索引擎

获取elasticsearch的查询接口

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
body={
"query":{
"multi_match":{
"query":key_words,
"fields":["tags", "title", "content"]
}
},
"from":(page-1)*10,
"size":10,
"highlight": {
"pre_tags": ['<span class="keyWord">'],
"post_tags": ['</span>'],
"fields": {
"title": {},
"content": {},
}
}
}

The django interact with it.

Search interface

Results interface

Original: Big Box  Scrapy to build a distributed search engine crawlers - (h) elasticsearch combined django build search engine


Guess you like

Origin www.cnblogs.com/chinatrump/p/11584232.html