B station must-watch hot video data inventory every week in 2020! Python data analysis

1. Data capture

The acquisition of the data set is the first step in our data analysis. The main ways to obtain data now are: ready-made data; write your own crawler to crawl the data; use existing crawler tools to crawl the required content, save it to the database, or save it locally in the form of a file. The blogger uses the crawler code written by himself to get the data. (The source code of the crawler can be found in the blogger, and you can reply in the comment area)

Crawler design ideas

1. First determine the URL address of the web page that needs to be crawled 2. Obtain the corresponding HTML page through the HTTP/HTTPS protocol 3. Extract the useful data in the HTML page a. If it is the required data, save it b. If it is other in the page URL, then proceed to the second step.

Basic crawler process

Initiating a request To initiate a request to the target site through the HTTP library is to send a Request, which can contain additional headers and other information, and wait for the server's response to obtain the response content. If the server responds normally, you will get a Response, and the content of the Response is what you want to get Page content, the type may be HTML, json string, binary data (such as picture and video) and other types. The content obtained by parsing the content may be HTML, which can be parsed by regular expressions or web page parsing libraries. It may be json, which can be directly converted to JSON parsing object analysis. It may be binary data, which can be saved or further processed. There are many ways to save data, which can be saved as text, saved to a database, or saved in a specific format file

Anti-reptile mechanism and countermeasures

1 Anti-crawling by analyzing the header information requested by the user. At most 2 of the applications in the website are anti-crawlers by verifying user behavior. It is better to analyze whether the same ip frequently visits the corresponding website in a short period of time. 3 Increase the difficulty of crawling through dynamic pages to achieve the purpose of anti-crawling. Countermeasure 1 Construct the headers information requested by these users in the crawler to disguise the crawler as a browser. 2. Using a proxy server and frequently switching the proxy server mode can generally overcome restrictions. 3. Use some software, such as selenium+phantomJS, to overcome anti-crawler methods: user-agent, proxy, verification code, dynamic data loading, encrypted data

Data selection and processing

1 Web page text such as HTML document json format text 2. The image obtained is a binary file and saved as a picture format 3. The binary file obtained by the video can be saved as a video format 4. Others can be obtained as long as they can be requested. Direct processing 2 json parsing 3 regular expressions 4 BeautifulSoup 5 PyQuery 6 XPath

2. Data cleaning

When the data is obtained, we need to clean the data we crawled to pave the way for subsequent data analysis. If the cleaning is not in place, it will inevitably affect the subsequent data analysis. The following will unify the data format and deal with null values.

Unified format

Remove the space in the data. Use strip() to process the crawled string when crawling the data with the crawler. Convert the Chinese data into Arabic numerals. For example, 17,000 becomes 17,000. The code is as follows

def get_int(s):
	if s[-1]=="万":
		s=s[0:-1]
		s=int(float(s)*10000)
	else:
		s=int(s)
	return s

The results of the far trip are as follows

if __name__ == '__main__':
	s="1.2万"
	price = get_int(s)
	print(price)#12000

Null value handling

When crawling data with crawlers, if the crawled value does not exist, an error will be reported. Use the exception handling statement try{}except:pass (try is the code for crawling video information) to skip non-existent video information data.

try: 
			html=requests.get(Link).text 
			doc=BeautifulSoup(html); 
			List=doc.find('div',{'class':'ops'}).findAll('span') 
			like=List[ 0].text.strip()# 
			like=self.getint(like) 
			coin=List[1].text.strip()#coin coin=self.getint(coin) 
			collection=List[2].text .strip()#Collection 
			collection=self.getint(collection) 
			print('Like',like ) print('drop 
			coin',coin) 
			print('collection',collection) 

		# #Combine the data into a dictionary 
			data=( 
				'Title':Title, 
				'link':Link, 
				'Up':Up, 
				'Play':Play, 
				'Like':like, 
				'Coin':coin,
				'Collection':collection,
		   }
		
		# 
			Save to csv file self.write_dictionary_to_csv(data,'blibli2.csv') 
			pass 
		except: 
			pass

3. Data analysis and visualization

Table parameter information is shown in the figure

 

Analyze video emissions

Analyze the popular broadcast volume of station B, and divide the broadcast volume of popular videos in 2020 into 4 levels, 10 million or more, for one level, 5 million to 10 million, and 5 million to 1 million for one level. Play volume is one level 1 million play volume is one level

l1=len(data[data['Play'] >= 10000000])
l2=len(data[(data['Play'] < 10000000) & (data['Play'] >=5000000)])
l3=len(data[(data['Play'] < 5000000) & (data['Play'] >=1000000)])
l4=len(data[data['Play'] < 1000000])

Then the data is visualized through the matplotlib library. Get the picture below.

plt.figure(figsize=(9,13)) #Adjust the graphic size 
labels = ['more than ten million','ten million to five million','five million to one million','less than one hundred万'] 
#define label sizes = [l1, l2, l3, l4] #each value 
colors = ['green','yellow','blue','red'] 
#each color definition explode = (0, 0,0,0) 
#Separate a block, the larger the value, the larger the gap will be. 
# Chinese garbled and coordinate axis negative sign processing plt.rcParams['font.sans-serif'] = ['KaiTi'] 
plt .rcParams['axes.unicode_minus'] = False 
patches,text1,text2 = plt.pie(sizes, 
                      explode=explode, 
                      labels=labels, 
                      colors=colors, 
                      autopct ='%3.2f%%', #The value keeps a fixed decimal Bit 
                      shadow = False,
                      #No shadow setting startangle =90, #Counterclockwise starting angle setting
                      pctdistance = 0.6) #number distance radius multiple distance from the center of the circle# 
patches the return value of the pie chart, the text of the label outside the pie chart of texts1, the text inside the pie chart of texts2 
# The x and y axis scale settings are consistent, and the pie chart is a circular 
plt. axis('equal') 
plt.title("Distribution of Popular Views at Station B") 
plt.legend() # The upper right corner shows 
plt.show()

 

As can be seen from the figure, most of the popular recommended videos that can be watched every week on station B are 5 million to 1 million views, and it is difficult to upload videos with less than 1 million views every week. Watch popular recommendations, and there are very few videos with more than 10 million views in a year. Let’s take a look at the top 10 videos that have been viewed well.

data.nlargest(10,columns='Play')

 

Then the data is visualized through the matplotlib library. Get the picture below.

d.plot.bar(figsize = (10,8),x='Title',y='Play',title='Play top 10') 
plt.xticks(rotation=60)#rotation angle 60 degrees 
plt. show()

 

It can be seen from the figure that Bilibili New Year's Festival is the most popular and the amount of broadcast is much higher than other videos, indicating that the 2020 New Year's Festival program at station B is relatively successful.

Analyze the author

Through data analysis, we can see that the author's works are the most popular, so as to determine that the author is the most popular in 2020. Divide the author and count the number of occurrences

d2=data.loc[:,'Up'].value_counts()
d2=d2.head(10)

Then the data is visualized through the matplotlib library. Get the picture below.

d2.plot.bar(figsize = (10,8),title='UP top 10')
plt.show()

 

It shows that the author with the most popular weekly times on station B is Liangfeng Kaze. There are 48 popular recommendations in 52 weeks a year, and his videos appear almost every week. From the data point of view, the most popular author in 2020 is Liangfeng Kaze.

Analysis of video parameters

Analyze the average proportion of likes, coin, and favorites of popular videos

data['Like Ratio'] = data['Like'] /data['Play'] 
data['Coin Ratio'] = data['Coin'] /data['Play'] 
data['Collection Ratio' ] = data['Collection'] /data['Play'] 
d3=data.iloc[:,8:11] 
d3=d3.mean()

 

Then the data is visualized through the matplotlib library. Get the picture below.

d3.plot.bar(figsize = (10,8),title='UP top 10')
plt.show()

 

The percentage of likes is the highest in 2020, reaching about 9%. It means that only one out of 10 people who watch the video at station B will like it. On average, only one out of 20 people can coin a video.

Analyze the title

Extract the high frequency of the title to see which type of title is more popular. First, traverse all the titles and store them in the string s

d4=data['Title']
s=''
for i in d4:
    s=s+i

Then visualize it with a word cloud

There are more popular games with titles like "Zhu Yi, Ban Buddha, Luo Xiang" or "League of Legends, Yuanshen" and other author names.

Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/112307724