Python basic routines reptiles

What are reptiles?
Also known as web spider web crawler, if the Internet metaphor of a spider's web, the spider is then parsed data collection on the Internet crawling spiders, crawlers by requesting url address, according to the content of the response,
such as: if the response is content html, structural analysis dom, dom for analyzing, matching or regular, if the response contents are xml / json data, may turn data object, and then analyzing the data.

what's the effect?
Batch data collected by means of an effective reptiles, can reduce labor costs, improve the amount of valid data, the data give support operations / sales, speed up product development.

The situation of the industry
today's highly competitive Internet products, most of the industry will use crawler technology of data mining products, competing products, acquisition, large data analysis, which is an essential tool, and many companies have set up a reptile engineer jobs

Legitimacy
crawler using procedures disclosed in bulk crawling information on the page, which is the front end of the display data. Because the information is completely open, it is legal. In fact, like the browser, the browser parses the response and render content for the page, and reptiles parse the response content capture the desired data is stored.

Anti reptile
reptile is difficult to completely stop, foot step ahead, this is a war of no smoke, agriculture VS code learning python code farmers do not understand the process can join my python resource sharing qun: 855 , 408,893, video materials related to learning, development tools have to share
some means anti reptiles:

Valid detection: calibration request (useragent, referer, the signature-added interfaces, etc.)
dark room: IP / User request frequency limits, or direct interception
poisoning: trans crawler high level can not intercept interception is temporary, poisoning return false data may be misleading competing products decision
... ...
crawler basic routines
basic flow
target data
source address
analysis of the structure
to achieve the concept
surgeon encoding
basic means
break request limit
request header is provided, such as: useragant valid client
control the frequency of requests (according to the actual situation)
IP proxy
signature / encryption parameters from html / cookie / js analysis
cracks login authorization
request to take the user's cookie information
cracks codes
simple codes can identify FIG reading codes third party library
analysis data
HTML Dom parsed
regular matching, n-through is expressions to match the data you want to crawl, such as: some of the data is not in the html tag, but variable js script tags in html
library using third-party libraries parse html dom, prefer jquery class of
data strings
regular match (The use scenario)
transfected JSON / XML parsing objects
python crawler
p ython write the advantages of reptiles
python syntax is easy to learn, easy to use
community activists, implementation and more can refer to
a variety of rich feature pack
a small amount of code to complete the power
involved in module package
request
urllib
urllib2
cookielib
multi-threaded
threading
regular
Re
json parsing
json
HTML parsing dom
pyquery
Beautiful Soup
operation browser
selenium
Analysis examples
betta anchor Ranking

Target data
acquisition rankings anchor information
source address
[Ranking address]
https://www.douyu.com/directory/rank_list/game
[anchor room Address]
https://www.douyu.com/xxx
xxx = room number
structure analysis
by Ethereal [Ranking address], [anchor room address] (Google debug network / charles / fiddler)
obtained ranking data interface: https://www.douyu.com/directory/rank_list/game
parameter validation (to remove unnecessary parameter)
Cookie acknowledgment (remove unnecessary cookie)
analog request (charles / fiddler / postman)
obtained room anchor information data
found R O O M Yes the Lord broadcast room between letter interest in page surface s c r i p t j s 使 [ ] [ ] [ ] [ ] [ ] [ ROOM is the anchor room information, js variable in the script tag of the page, you can use regular expressions to match the tool to realize the idea of ​​writing by requesting [Ranking anchor Interface] get [leaderboard data] [data charts], there anchor the room number can be obtained by concatenating the [anchors room address] request [anchor room address] can be obtained [ rOOM information], analytical information can be obtained anchor the room
surgeon coding
Disclaimer: this example only reptile learning DEMO, no other use

Achieve reptile python-based learning basic demo

douyu_rank DEF (rankName, stattype):
'' '
Betta Top anchor fetch
data address

    * `rankName` anchor(巨星主播榜),fans(主播粉丝榜),haoyou(土豪实力榜),user(主播壕友榜)
    * `statType` day(日),week(周),month(月)
'''
if not isinstance(rankName, ERankName):
    raise Exception("rankName 类型错误,必须是ERankName枚举")
if not isinstance(statType, EStatType):
    raise Exception("statType 类型错误,必须是EStatType枚举")

rankName = '%sListData' % rankName.name
statType = '%sListData' % statType.name
# 请求获取html源码 
rs = rq.get(
    "https://www.douyu.com/directory/rank_list/game",
    headers={'User-Agent': 'Mozilla/5.0'})
# 正则解析出数据
mt = re.search(r'rankListDatas+?=(.*?);', rs, re.S)
if (not mt):
    print u"无法解析rankListData数据"
    return
grps = mt.groups()
# 数据转json
rankListDataStr = grps[0]
rankListData = json.loads(rankListDataStr)
dayList = rankListData[rankName][statType]
# 修改排序
dayList.sort(key=lambda k: (k.get('id', 0)), reverse=False)
return dayList

douyu_room DEF (romm_id):
'' '
anchor room information analyzing
data address
' romm_id 'anchor room number
' ''
RS rq.get = (
( " https://www.douyu.com/%s " romm_id%),
headers {= '- Agent-the User': 'the Mozilla / 5.0'})
MT = the re.search (R & lt '?? {.? * = S +} $ Rooms + ();', RS, re.S)
IF (Not MT) :
Print U "ROOM data can not be resolved"
return
GRPS = mt.groups ()
roomDataStr GRPS = [0]
roomData = json.loads (roomDataStr)
return roomData

def run():
‘’’
测试爬虫
‘’’
datas = douyu_rank(ERankName.anchor, EStatType.month)
print ’
主播排行榜:’
for item in datas:
room_id = item[‘room_id’]
roomData = douyu_room(room_id)
rommName = None
if roomData is not None:
rommName = roomData[‘room_name’]
roomInfo = (u’房间(%s):%s’ % (item[‘room_id’], rommName))
print item[‘id’], item[
‘nickname’], roomInfo, ‘[’ + item[‘catagory’] + ‘]’

run()

Guess you like

Origin blog.csdn.net/weichen090909/article/details/95244830