Crawl the Kugou music website and get the top 500 song titles on the list!

Crawl network data

1.1 First read the website`

https://www.kugou.com/yy/rank/home/1-6666.html?from=rank`

1.2 Define the text part of the HTML code selected with beautifulsoup:

def explain_HTML(mylist, html):
    soup = BeautifulSoup(html,'html.parser')
    songs = soup.select('div.pc_temp_songlist > ul > li > a')
    ranks = soup.select('span.pc_temp_num')
    times = soup.select('span.pc_temp_time')
    for rank,song,time in zip(ranks,songs,times):
         data = [
             rank.get_text().strip(),
             song.get_text().split("-")[1],
             song.get_text().split("-")[0],
             time.get_text().strip()
         ]
         mylist.append(data)
12345678910111213

##In order to make the printed results more beautiful, the data format of the printing should be adjusted:

def print_HTML(mylist):
    for i in range(500):
        x = mylist[i]
        with open("D:\Anew\kugou.txt",'a',encoding = 'UTF-8') as f:
            f.write("{0:<10}\t{1:{4}<25}\t{2:{4}<20}\t{3:<10}\n".format(x[0],x[1],x[2],x[3],chr(12288)))
12345

In order to prevent crawling failures caused by fast crawling speed, the time.sleep(1) function is set up to prevent the above risks.

1.4 The crawling results are shown as follows:

1.5 Code display

import requests
from bs4 import BeautifulSoup
import time

def get_HTML(url):
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36",
        "referer": "https://www.kugou.com/yy/rank/home/1-6666.html?from=rank"
        }
    try:
        r = requests.get(url,headers = headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def explain_HTML(mylist, html):
    soup = BeautifulSoup(html,'html.parser')
    songs = soup.select('div.pc_temp_songlist > ul > li > a')
    ranks = soup.select('span.pc_temp_num')
    times = soup.select('span.pc_temp_time')
    for rank,song,time in zip(ranks,songs,times):
         data = [
             rank.get_text().strip(),
             song.get_text().split("-")[1],
             song.get_text().split("-")[0],
             time.get_text().strip()
         ]
         mylist.append(data)


def print_HTML(mylist):
    for i in range(500):
        x = mylist[i]
        with open("D:\Anew\kugou.json",'a',encoding = 'UTF-8') as f:
            f.write("{0:<10}\t{1:{4}<25}\t{2:{4}<20}\t{3:<10}\n".format(x[0],x[1],x[2],x[3],chr(12288)))
          

if __name__ == '__main__':
    url_0 = 'http://www.kugou.com/yy/rank/home/'
    url_1 = '-8888.html'
    mylist = []
    with open("D:\Anew\kugou.json",'a',encoding = "UTF-8") as f:
        f.write("{0:<10}\t{1:{4}<25}\t{2:{4}<20}\t{3:<10}\n".format("排名","歌曲","歌手","时间",chr(12288)))
    for j in range(1,24):
        url = url_0 + str(j) + url_1
        html = get_HTML(url)
        explain_HTML(mylist, html)
    print_HTML(mylist)
    time.sleep(1)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051

1.6 The document before json transcoding should be

The display in the online parsing json tool should be:

The verification code is shown as follows through code operation:

with open("D://Anew//Kugou.json", 'r', encoding="utf-8") as rdf:
    json_data=json.load(rdf)
    print('数据展示:', json_data)
123

Programmatically generate CSV files and convert them to JSon format

Encode information in the form of an array

import csv 
test = [["Name", "Gender","Hometown","Dept"], 
["Zhang Di", "Male", "Chongqing", "Computer Department"], 
["Lambo", "Male","Jiangsu","Communication Engineering Department"], 
["Huang Fei", "Male","Sichuan","Internet of Things Department"], 
["Deng Yuchun", "Female","Shaanxi"," Department of Computer Science"], 
["Zhou Li", "女","Tianjin","Art Department"], 
["Li Yun", "女","Shanghai","Foreign Language Department"] 
] 
with open('信息
    Input.csv ','w',encoding="utf8") as file: csvwriter = csv.writer(file, lineterminator='\n') 
    csvwriter.writerows(test) 
123456789101112

2.2 csv code conversion to json format

import csv,json

csvfile = open('D:\\Anew\\test.csv','r')
jsonfile = open('D:\\Anew\\test.json','w')

fieldnames = ('姓名','性别','籍贯','系别')
reader = csv.DictReader (csvfile,fieldnames)
for row in reader:
    json.dump(row,jsonfile)
    jsonfile.write('\n')
12345678910

Since the json file has its own encryption effect, you need to find a json formatting verification tool on the Internet. The BEJSON tool is used this time, and the verification display diagram is as follows:

2.3 Query information about girls in files

import json
with open('D:\\Anew\\名单.json','r') as f:
    for line in f.readlines():
        line=json.loads(line)
        if(line['性别']=='女'):
            print(line)
123456

Conversion of XML format files and JSon

3.1(1) Read the following XML format file, the content is as follows:

<?xml version=”1.0” encoding=”gb2312”> <Book> <Book Title> A Dream of Red Mansions

Click here to get the complete project code

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/108996831