Crawl the top video ranking of station B


1. Thematic web crawler design scheme 1. Thematic web crawler name: Crawl the top video ranking of station B
2. Thematic web crawler crawling content: Calculate the overall score of all submitted videos, update the data daily , Volume, barrage, author)

3. Overview of thematic web crawler design scheme: find the website address, analyze the website source code, find the location of the data you need, extract the data, organize the data, visualize the data and other operations

Second, the structural characteristics of the
theme page analysis The structure and characteristics of the theme page analysis:

Find the data we need, search and locate

 

 

 

 

 

 

 

 

 

 The content we need is hidden in 'a', class _ = "title", 'span', class _ = "data-box", 'div', class _ = "pts"

Three, web crawler programming

1. Data crawling and collection

First crawl the web frame

 

 

 Test crawled content

 

 

 Then parse the content

 

 

 

 

 

 Test the parsed content

 

 

 

Analysis of the source code of the web page before

 

 

 Try to print what we need

 

 

 After no problem, package the steps just now

 

 

 The other data that needs to be crawled is the same

 

 Break down after finishing most

 

 After finishing all the data, export the excel file for easy viewing

 

 

 

 

After finishing the data, start data visualization

 

Draw histogram and scatter plot

 

 

Output result

 

 

Establish regression equations between variables

 

 Output result

 

 

four.

Attach the complete program code


#Import required modules 
import requests import bs4 import re import matplotlib.pyplot
as plt import seaborn as sns import pandas as pd plt.rcParams [ ' font.sans-serif ' ] = [ ' SimHei ' ] #Used for normal display Chinese plt.rcParams [ ' axes.unicode_minus ' ] = False # is used to display negative signs normally def get_url (url): # Get webpage content # Camouflage user headers = { ' User-Agent ' : 'Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 70.0.3538.25 Safari / 537.36 Core / 1.70.3706.400 SLBrowser / 10.0.4040.400 ' } try : #Try to get the webpage f = requests. Get (URL, headers = headers) # 200 If the status code is not an exception is thrown f.raise_for_status () # encoding f.encoding = f.apparent_encoding # returned pages to fetch return f.text the except: # 200 status code is not abnormal printing print ( ' An exception was generated ' ) def bs (text): #Parse the read web page soup= bs4.BeautifulSoup (text, ' html.parser ' ) return soup def cont_ent (soup): #Extract the top 100 titles m = soup.find_all ( ' a ' , class_ = " title " ) n = [] for i in m: n.append (i.text) return (n) #Organize into a listdef cont_ent2 (soup): #Extract playback volume, barrage, author x = soup.find_all ( ' span ' , class_ = " data-box " ) n =[] for i in x: n.append (i.text) return (n) #Organize into a listdef cont_ent3 (soup): #Extract composite score k = soup.find_all ( ' div ' , class_ = " pts " ) n = [] for i in k: #Because the score does not exceed 10 million, you can collect numbers within 10 million c = re.search ( ' \ d? \ d? \ d? \ d? \ d? \ d? \ d? \ d? ' , i.text) n.append (eval (c.group ())) return (n) #Organize into a list # Extract the amount of play before, the barrage, the author then organize def bofang (cone ): #Playback m = 3 n=[] for i in cone: if m%3==0: n.append(i) m+=1 return n def danmu(cone):#弹幕 m=2 n=[] for i in cone: if m%3==0: n.append(i) m+=1 return n def zuozhe(cone):#作者 m= 1 n= [] for i in cone: if m% 3 == 0 : n.append (i) m + = 1 return n def main (): # 哔 哩 哔哩 Hot search video ranking link url = ' https: // www.bilibili.com/ranking? ' #Get webpage some = get_url (url) #parse webpage soup = bs (some) ' '' with open ( ' try2.txt ' , ' w ' , encoding = ' utf-8 ' ) as f: f.write (soup.text) #Test whether there is an error during writing the code ' #Data processing # Extract title title = cont_ent (soup) #Extract the playback volume, barrage, author cone = cont_ent2 ( soup) #Extract comprehensive score score = cont_ent3 (soup) '' ' with open ( ' title.txt ' , ' w ' , encoding = ' utf-8 ' ) as f: for i in title: f.write (str ( title.index (i) + 1 ) +'.') f.write(i) f.write('\n') with open('cone.txt','w',encoding='utf-8') as f: for i in cone: f.write(i) f.write('\n') with open('score.txt','w',encoding='utf-8') as f: for i in score: f.write (i) f.write ( ' \ n ' ) '' ' #Export data to see if there is any error bfl = bofang (cone) dm = danmu (cone) zz = zuozhe ( cone) df = pd.DataFrame ({ ' rank ' : range ( 1 , 101 ), ' title ' : title, ' playback volume ' : bfl, ' pop-up volume ' : dm, ' author ': zz, ' Comprehensive score ' : score}) df.to_excel ( ' Top 100 popular videos in station b. 100.xlsx ' ) #Because of too much data, pick the top few charts #Bar chart plt.bar (range ( 1 , 21 ), score [: 20 ]) plt.xlabel ( ' rank ' ) plt.ylabel ( ' composite score ' ) plt.title ( ' before several composite score histogram ' ) plt.show () # scatter Figure plt.scatter (range ( 1 , 21 ), score [: 20 ]) plt.xlabel (' Ranking ' ) plt.ylabel ( ' Comprehensive score ' ) plt.title ( ' Comprehensive score histogram of the top few ' ) plt.show () #Regression analysis file_path = " Top 100.xlsx of top video rankings " dataf = pd.read_excel (file_path) sns.lmplot (x = ' rank ' , y = ' composite score ' , data = dataf) main ()

 

 

V. Conclusion

1. Ranking is based on comprehensive scores. The barrage and the amount of play have a certain impact. The quality of the attractive titles and videos is the key to the decision

2. Can consolidate the new knowledge learned before, it is best to review it every once in a while, the web crawler is very interesting.

Guess you like

Origin www.cnblogs.com/lanshj/p/12723146.html