Climb the cool dog soaring list

1. Crawl the cool dog soaring list, make the singer and his song and duration into a table

Idea: crawl and parse page information, make excel table

Technical difficulties: parse the source code

二.1.url=https://www.kugou.com/yy/rank/home/1-6666.html?from=rank

2. Find the tags corresponding to the singer, song name and duration in the source code, and use find_all to traverse

Three 1: Code:

from bs4 import BeautifulSoup
import requests
import time
import xlwt
#创建Excel存储数据
class Spider:
    def __init__(self):
        self.workbook, self.worksheet = self.create_excel()
        self.nums = 1

    def create_excel(self):
        workbook = xlwt.Workbook(encoding='utf-8')
        worksheet = workbook.add_sheet('Sheet1')
        title = ['Ranking ' , ' Singer and Song Title ' , ' Playing Time ' ]
         for index, title_data in enumerate (title): 
            worksheet.write (0, index, title_data) 
        return workbook, worksheet 

    def get_html (self, url): 
        headers = { ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 78.0.3904.108 Safari / 537.36 ' }   # crawler request header 
        Response = requests.get (URL)
         IF response.status_code == 200 is :   #If the request status value is 200, output
            return response.text
         the else :
             return  ' abnormal ' 


    DEF get_Data (Self, HTML): 
        Soup = the BeautifulSoup (HTML, ' lxml ' )   # with BeautifulSuop library parses the page 
        ranks = soup.find_all ( ' span ' , the class_ = " pc_temp_num " )   # rank 
        names = soup.find_all ( ' A ' , class_ = " pc_temp_songname " )   # singer and song 
        times = soup.find_all (' Span ' , the class_ = " pc_temp_time " )   # play time 

        # printing information 
        for R & lt, n-, T in ZIP (ranks, names, Times):   # ZIP function is used 
            . R & lt r.get_text = () Replace ( ' \ n- ' , '' ) .replace ( ' \ t ' , '' ) .replace ( ' \ r ' , '' ) 
            n = n.get_text () 
            t = t.get_text (). replace ( ' \ n ' ,'').replace('\t', '').replace('\r', '')
            data = {'排名': r, '歌名-歌手': n, '播放时间': t}
            self.worksheet.write(self.nums, 0, str(r))
            self.worksheet.write(self.nums, 1, str(n))
            self.worksheet.write(self.nums, 2, str(t))
            self.nums += 1

    def main(self,):
        urls = [' Https://www.kugou.com/yy/rank/home/1-6666.html?from=rank ' .format (STR (I)) for I in Range (. 1, 24)]   # for loop 
        for URL in URLs:
             Print (URL) 
            HTML = self.get_html (URL) 
            self.get_data (HTML) 
            the time.sleep ( . 1)   # pause lS 
        self.workbook.save ( ' Data.xls ' ) # after all the stored information storage For data.xls 


if  __name__ == ' __main__ ' :   #The main program main () 
    spider = Spider () 
    spider.main () is called when the program is executed

Output result:

Corresponding excel form:

2. Clean and process the data: (1) There are no invalid rows and columns in the table, skip this step.

(2) Duplicate value processing

import pandas as pd
biaosheng=pd.DataFrame(pd.read_excel('data.xls'))
biaosheng.duplicated()

result:

Because there are duplicate values, use the drop_duplicates method to delete duplicate values

import pandas as pd
biaosheng=pd.DataFrame(pd.read_excel('data.xls'))
biaosheng=biaosheng.drop_duplicates()
biaosheng

result:

(3) There is no null value or missing value, skip this step

(4) Null value processing (the format is consistent, no spaces affect the data, skip this step)

(5) Outlier handling (songs of similar length, skip this step)

3. Text analysis (will not)

4. Data analysis and visualization: due to too much data, the top five of the list are analyzed here. Analyze the relationship between the duration of the songs sung by the top five singers (shown in seconds) and the ranking, which is reflected in the form of a histogram

import pandas as pd
 import matplotlib.pyplot as plt 
plt.rcParams [ ' font.family ' ] = [ ' sans-serif ' ] 
plt.rcParams [ ' font.sans-serif ' ] = [ ' SimHei ' ] 
plt.bar ( [ 346,235,189,242,250], [1,2,3,4,5 ]) 
plt.legend () 
plt.show ()

result:

5. Analyze the correlation coefficient

import pandas as pd
import scipy.stats as stats
x=[346,235,189,242,250]
y=[1,2,3,4,5]
stats.pearsonr(x, y)

Result:    The first number here is the correlation coefficient.

Drawing

import seaborn as sns 
biaosheng = pd.DataFrame (pd.read_excel ( ' data.xls ' )) 
sns.regplot (biaosheng.play time, biaosheng.rank)

result:

6. Data persistence (not learned)

4.1. There is no special connection between the song duration and the ranking.

2. This data analysis integrates python data analysis. Python data visualization. Web crawler and other parts, with more content, because it is unskilled, you have to keep turning the book to complete this task, and there are many parts of the book that are not Yes, I have been searching on Baidu for a long time to find out how to use the code to achieve some functions. After this exercise, I understand the esoteric nature of python. I will work harder to learn python and try to make myself stronger.

 

Guess you like

Origin www.cnblogs.com/lsctj/p/12758158.html
Dog