Python's method of getting hot searches on Weibo

The crawling of Weibo hot search requires two libraries: lxml and requests

url=https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6

1. Analyze the source code of the webpage: Right click-view the source code of the webpage, you can get the information from the webpage code:

(1) The names of the hot searches are all in the sub-nodes

(2) The rankings of the hot search are all in the (note that the top Weibo is not ranked!)

(3) Hot search visits are all in the child node

2.requests to get the webpage

(1) First set the url address, and then simulate the browser (this step is not necessary) to prevent being recognized as a crawler program.

###URL

url=“https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6”

###Simulate browser, this request header can be used under windows

header={‘User-Agent’:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36’}

(2) Use the get() of the requests library and etree() of the lxml to get the webpage code

###Get html page

html=etree.HTML(requests.get(url,headers=header).text)

3. Construct xpath path

The three xath paths in the first step above are:

affair=html.xpath(’//td[@class=“td-02”]/a/text()’)

rank=html.xpath(’//td[@class=“td-01 ranktop”]/text()’)

view=html.xpath(’//td[@class=“td-02”]/span/text()’)

The return result of xpath is a list, so affair, rank, and view are all string lists

4. Formatted output

It should be noted that there is one more hot search in affair, we will separate him first.

top=affair[0]

affair=affair[1:]

The slice of python is used here.

print(’{0:<10}\t{1:<40}’.format(“top”,top))

for i in range(0, len(affair)):

print("{0:<10}\t{1:{3}<30}\t{2:{3}>20}".format(rank[i],affair[i],view[i],chr(12288)))

5. All codes

###Import modules

import requests

from lxml import etree

###URL

url=“https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6”

###Analog browser

header={‘User-Agent’:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36’}

###Main function

def main():

###Get html page

html=etree.HTML(requests.get(url,headers=header).text)

rank=html.xpath(’//td[@class=“td-01 ranktop”]/text()’)

affair=html.xpath(’//td[@class=“td-02”]/a/text()’)

view = html.xpath(’//td[@class=“td-02”]/span/text()’)

top=affair[0]

affair=affair[1:]

print(’{0:<10}\t{1:<40}’.format(“top”,top))

for i in range(0, len(affair)):

print("{0:<10}\t{1:{3}<30}\t{2:{3}>20}".format(rank[i],affair[i],view[i],chr(12288)))

main()

Guess you like

Origin blog.csdn.net/tianqiIP/article/details/114132042