Crawl network data
1.1 First read the website`
https://www.kugou.com/yy/rank/home/1-6666.html?from=rank`
1.2 Define the text part of the HTML code selected with beautifulsoup:
def explain_HTML(mylist, html): soup = BeautifulSoup(html,'html.parser') songs = soup.select('div.pc_temp_songlist > ul > li > a') ranks = soup.select('span.pc_temp_num') times = soup.select('span.pc_temp_time') for rank,song,time in zip(ranks,songs,times): data = [ rank.get_text().strip(), song.get_text().split("-")[1], song.get_text().split("-")[0], time.get_text().strip() ] mylist.append(data) 12345678910111213
##In order to make the printed results more beautiful, the data format of the printing should be adjusted:
def print_HTML(mylist): for i in range(500): x = mylist[i] with open("D:\Anew\kugou.txt",'a',encoding = 'UTF-8') as f: f.write("{0:<10}\t{1:{4}<25}\t{2:{4}<20}\t{3:<10}\n".format(x[0],x[1],x[2],x[3],chr(12288))) 12345
In order to prevent crawling failures caused by fast crawling speed, the time.sleep(1) function is set up to prevent the above risks.
1.4 The crawling results are shown as follows:
1.5 Code display
import requests from bs4 import BeautifulSoup import time def get_HTML(url): headers = { "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36", "referer": "https://www.kugou.com/yy/rank/home/1-6666.html?from=rank" } try: r = requests.get(url,headers = headers) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def explain_HTML(mylist, html): soup = BeautifulSoup(html,'html.parser') songs = soup.select('div.pc_temp_songlist > ul > li > a') ranks = soup.select('span.pc_temp_num') times = soup.select('span.pc_temp_time') for rank,song,time in zip(ranks,songs,times): data = [ rank.get_text().strip(), song.get_text().split("-")[1], song.get_text().split("-")[0], time.get_text().strip() ] mylist.append(data) def print_HTML(mylist): for i in range(500): x = mylist[i] with open("D:\Anew\kugou.json",'a',encoding = 'UTF-8') as f: f.write("{0:<10}\t{1:{4}<25}\t{2:{4}<20}\t{3:<10}\n".format(x[0],x[1],x[2],x[3],chr(12288))) if __name__ == '__main__': url_0 = 'http://www.kugou.com/yy/rank/home/' url_1 = '-8888.html' mylist = [] with open("D:\Anew\kugou.json",'a',encoding = "UTF-8") as f: f.write("{0:<10}\t{1:{4}<25}\t{2:{4}<20}\t{3:<10}\n".format("排名","歌曲","歌手","时间",chr(12288))) for j in range(1,24): url = url_0 + str(j) + url_1 html = get_HTML(url) explain_HTML(mylist, html) print_HTML(mylist) time.sleep(1) 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
1.6 The document before json transcoding should be
The display in the online parsing json tool should be:
The verification code is shown as follows through code operation:
with open("D://Anew//Kugou.json", 'r', encoding="utf-8") as rdf: json_data=json.load(rdf) print('数据展示:', json_data) 123
Programmatically generate CSV files and convert them to JSon format
Encode information in the form of an array
import csv test = [["Name", "Gender","Hometown","Dept"], ["Zhang Di", "Male", "Chongqing", "Computer Department"], ["Lambo", "Male","Jiangsu","Communication Engineering Department"], ["Huang Fei", "Male","Sichuan","Internet of Things Department"], ["Deng Yuchun", "Female","Shaanxi"," Department of Computer Science"], ["Zhou Li", "女","Tianjin","Art Department"], ["Li Yun", "女","Shanghai","Foreign Language Department"] ] with open('信息 Input.csv ','w',encoding="utf8") as file: csvwriter = csv.writer(file, lineterminator='\n') csvwriter.writerows(test) 123456789101112
2.2 csv code conversion to json format
import csv,json csvfile = open('D:\\Anew\\test.csv','r') jsonfile = open('D:\\Anew\\test.json','w') fieldnames = ('姓名','性别','籍贯','系别') reader = csv.DictReader (csvfile,fieldnames) for row in reader: json.dump(row,jsonfile) jsonfile.write('\n') 12345678910
Since the json file has its own encryption effect, you need to find a json formatting verification tool on the Internet. The BEJSON tool is used this time, and the verification display diagram is as follows:
2.3 Query information about girls in files
import json with open('D:\\Anew\\名单.json','r') as f: for line in f.readlines(): line=json.loads(line) if(line['性别']=='女'): print(line) 123456
Conversion of XML format files and JSon
3.1(1) Read the following XML format file, the content is as follows:
<?xml version=”1.0” encoding=”gb2312”> <Book> <Book Title> A Dream of Red Mansions
Click here to get the complete project code