1. Crawl the cool dog soaring list, make the singer and his song and duration into a table
Idea: crawl and parse page information, make excel table
Technical difficulties: parse the source code
二.1.url=https://www.kugou.com/yy/rank/home/1-6666.html?from=rank
2. Find the tags corresponding to the singer, song name and duration in the source code, and use find_all to traverse
Three 1: Code:
from bs4 import BeautifulSoup import requests import time import xlwt #创建Excel存储数据 class Spider: def __init__(self): self.workbook, self.worksheet = self.create_excel() self.nums = 1 def create_excel(self): workbook = xlwt.Workbook(encoding='utf-8') worksheet = workbook.add_sheet('Sheet1') title = ['Ranking ' , ' Singer and Song Title ' , ' Playing Time ' ] for index, title_data in enumerate (title): worksheet.write (0, index, title_data) return workbook, worksheet def get_html (self, url): headers = { ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 78.0.3904.108 Safari / 537.36 ' } # crawler request header Response = requests.get (URL) IF response.status_code == 200 is : #If the request status value is 200, output return response.text the else : return ' abnormal ' DEF get_Data (Self, HTML): Soup = the BeautifulSoup (HTML, ' lxml ' ) # with BeautifulSuop library parses the page ranks = soup.find_all ( ' span ' , the class_ = " pc_temp_num " ) # rank names = soup.find_all ( ' A ' , class_ = " pc_temp_songname " ) # singer and song times = soup.find_all (' Span ' , the class_ = " pc_temp_time " ) # play time # printing information for R & lt, n-, T in ZIP (ranks, names, Times): # ZIP function is used . R & lt r.get_text = () Replace ( ' \ n- ' , '' ) .replace ( ' \ t ' , '' ) .replace ( ' \ r ' , '' ) n = n.get_text () t = t.get_text (). replace ( ' \ n ' ,'').replace('\t', '').replace('\r', '') data = {'排名': r, '歌名-歌手': n, '播放时间': t} self.worksheet.write(self.nums, 0, str(r)) self.worksheet.write(self.nums, 1, str(n)) self.worksheet.write(self.nums, 2, str(t)) self.nums += 1 def main(self,): urls = [' Https://www.kugou.com/yy/rank/home/1-6666.html?from=rank ' .format (STR (I)) for I in Range (. 1, 24)] # for loop for URL in URLs: Print (URL) HTML = self.get_html (URL) self.get_data (HTML) the time.sleep ( . 1) # pause lS self.workbook.save ( ' Data.xls ' ) # after all the stored information storage For data.xls if __name__ == ' __main__ ' : #The main program main () spider = Spider () spider.main () is called when the program is executed
Output result:
Corresponding excel form:
2. Clean and process the data: (1) There are no invalid rows and columns in the table, skip this step.
(2) Duplicate value processing
import pandas as pd biaosheng=pd.DataFrame(pd.read_excel('data.xls')) biaosheng.duplicated()
result:
Because there are duplicate values, use the drop_duplicates method to delete duplicate values
import pandas as pd biaosheng=pd.DataFrame(pd.read_excel('data.xls')) biaosheng=biaosheng.drop_duplicates() biaosheng
result:
(3) There is no null value or missing value, skip this step
(4) Null value processing (the format is consistent, no spaces affect the data, skip this step)
(5) Outlier handling (songs of similar length, skip this step)
3. Text analysis (will not)
4. Data analysis and visualization: due to too much data, the top five of the list are analyzed here. Analyze the relationship between the duration of the songs sung by the top five singers (shown in seconds) and the ranking, which is reflected in the form of a histogram
import pandas as pd import matplotlib.pyplot as plt plt.rcParams [ ' font.family ' ] = [ ' sans-serif ' ] plt.rcParams [ ' font.sans-serif ' ] = [ ' SimHei ' ] plt.bar ( [ 346,235,189,242,250], [1,2,3,4,5 ]) plt.legend () plt.show ()
result:
5. Analyze the correlation coefficient
import pandas as pd import scipy.stats as stats x=[346,235,189,242,250] y=[1,2,3,4,5] stats.pearsonr(x, y)
Result: The first number here is the correlation coefficient.
Drawing
import seaborn as sns biaosheng = pd.DataFrame (pd.read_excel ( ' data.xls ' )) sns.regplot (biaosheng.play time, biaosheng.rank)
result:
6. Data persistence (not learned)
4.1. There is no special connection between the song duration and the ranking.
2. This data analysis integrates python data analysis. Python data visualization. Web crawler and other parts, with more content, because it is unskilled, you have to keep turning the book to complete this task, and there are many parts of the book that are not Yes, I have been searching on Baidu for a long time to find out how to use the code to achieve some functions. After this exercise, I understand the esoteric nature of python. I will work harder to learn python and try to make myself stronger.