The crawling of Weibo hot search requires two libraries: lxml and requests
url=https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6
1. Analyze the source code of the webpage: Right click-view the source code of the webpage, you can get the information from the webpage code:
(1) The names of the hot searches are all in the sub-nodes
(2) The rankings of the hot search are all in the (note that the top Weibo is not ranked!)
(3) Hot search visits are all in the child node
2.requests to get the webpage
(1) First set the url address, and then simulate the browser (this step is not necessary) to prevent being recognized as a crawler program.
###URL
url=“https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6”
###Simulate browser, this request header can be used under windows
header={‘User-Agent’:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36’}
(2) Use the get() of the requests library and etree() of the lxml to get the webpage code
###Get html page
html=etree.HTML(requests.get(url,headers=header).text)
3. Construct xpath path
The three xath paths in the first step above are:
affair=html.xpath(’//td[@class=“td-02”]/a/text()’)
rank=html.xpath(’//td[@class=“td-01 ranktop”]/text()’)
view=html.xpath(’//td[@class=“td-02”]/span/text()’)
The return result of xpath is a list, so affair, rank, and view are all string lists
4. Formatted output
It should be noted that there is one more hot search in affair, we will separate him first.
top=affair[0]
affair=affair[1:]
The slice of python is used here.
print(’{0:<10}\t{1:<40}’.format(“top”,top))
for i in range(0, len(affair)):
print("{0:<10}\t{1:{3}<30}\t{2:{3}>20}".format(rank[i],affair[i],view[i],chr(12288)))
5. All codes
###Import modules
import requests
from lxml import etree
###URL
url=“https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6”
###Analog browser
header={‘User-Agent’:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36’}
###Main function
def main():
###Get html page
html=etree.HTML(requests.get(url,headers=header).text)
rank=html.xpath(’//td[@class=“td-01 ranktop”]/text()’)
affair=html.xpath(’//td[@class=“td-02”]/a/text()’)
view = html.xpath(’//td[@class=“td-02”]/span/text()’)
top=affair[0]
affair=affair[1:]
print(’{0:<10}\t{1:<40}’.format(“top”,top))
for i in range(0, len(affair)):
print("{0:<10}\t{1:{3}<30}\t{2:{3}>20}".format(rank[i],affair[i],view[i],chr(12288)))
main()