a real python reptile, crawling cool music list and write txt file saved locally

a real python reptile, crawling cool music list and write txt file saved locally

First, the total code and run shot

#加载需要的库
import requests
from bs4 import BeautifulSoup
from lxml import etree
f = open("C:\\Users\LeeChoy\Desktop\kuwomusic.txt",'w+',encoding='utf-8')    #打开文件
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'} ##从浏览器复制张贴时注意删除agent:后面的空格
responce = requests.get("http://www.kuwo.cn/bang/index",headers=headers)   ##为使抓取数据稳定,伪装成浏览器
soup = BeautifulSoup(responce.text,'html.parser') #BeautifulSoup解析网页
html = etree.HTML(responce.text)    #XPath解析
#利用选择器获得数据
# 一、获得榜单的名字
names = soup.select( '#chartsContent > div.rightContentBox > ul > li > div.name > a')
#二、获取歌曲歌手
singers = soup.select('#chartsContent > div.rightContentBox > ul > li > div.artist > a')
#三、获取歌手的热度 xpath解析获得为一个列表
heats = html.xpath('//*[@id="chartsContent"]/div[2]/ul/li/div[4]/div/@style')
i=0    #为了写序号以及便于列表输出和储存数据
for name ,singer, heat in zip(names,singers,heats):
    print((str)(i+1)+" "+name.get_text()+"     "+singer.get_text()+"   "+heats[i].replace("width:",""))
    f.write((str)(i + 1) + " " + name.get_text() + "     "+singer.get_text() + "   "+heats[i].replace("width:","")+'\n')
    i+=1
f.close()  #关闭文件
operation result

① crawling results print

Here Insert Picture Description
② written TXT file
Here Insert Picture Description

Second, the specific process crawling

① load library needed

The required libraries can be selected according to the actual needs its own library crawling situation, there is no use of regular expressions, later updated to write a regular expression crawlers. Utilized herein library requests get () method requests a web page BeautifulSoup parsing and XPath library selection and acquisition information, beautifulsoup of select () xpath XPath library method and () method using a primary select the desired information.

import requests
from bs4 import BeautifulSoup
from lxml import etree
② disguised as a request to join up the browser, for a better data capture and maintain the stability of crawlers

Here at Google browser, for example (other browsers similar): first into cool music official website of the song charts. F12 entry checks, or right-click the mouse to enter the check table (element), to find the network (Network), to find a header request header (headers), - Agent Replication-the User.
Here Insert Picture DescriptionHere Insert Picture Description
Python added a request header requests using the get () method requests page ## Note deleting agent from the browser to copy posted: space after
:

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'} ##从浏览器复制张贴时注意删除agent:后面的空格
responce = requests.get("http://www.kuwo.cn/bang/index",headers=headers)   ##为使抓取数据稳定,伪装成浏览器
③, parses the page and get the appropriate information
soup = BeautifulSoup(responce.text,'html.parser') #BeautifulSoup解析网页
html = etree.HTML(responce.text)    #XPath解析
#利用选择器获得数据
# 一、获得榜单的名字
names = soup.select( '#chartsContent > div.rightContentBox > ul > li > div.name > a')
#二、获取歌曲歌手
singers = soup.select('#chartsContent > div.rightContentBox > ul > li > div.artist > a')
#三、获取歌手的热度 xpath解析获得为一个列表
heats = html.xpath('//*[@id="chartsContent"]/div[2]/ul/li/div[4]/div/@style')

The browser source code corresponding to the content of the page, find what we need to follow the prompts massive color pages, where select () and select xpath can be copied. Copy selected song selector name: #chartsContent> div.rightContentBox> ul> li: nth-child (1)> div.name > wherein A li: nth-child (1) representing a content of a ul tag li i.e. the first song, and we need all of the song, so deleting li: nth-child (1) can be. Similarly change the selector select the singer. For xpath () parse options, and beautifulsoup compared, xpath select attributes stronger. Example: The following is the first heat of the song HTML code we just need to value style characteristic properties in xpath @ can select attributes of html tags, and obtain a list.

<div class="heatValue" style="width:84%"></div>

// [@ ID = "chartsContent"] / div [2] / UL / Li [1] / div [. 4] / div select and remove the same principle Li [1] in [1],
i.e.: //
[@id = "chartsContent"] / div [ 2] / ul / li / div [4] / div select the style attribute
// * [@ id = "chartsContent "] / div [2] / ul / li / div [4] / div @ style
Here Insert Picture Description

④ print input and write to the file

open () function is the first attribute may be a relative address of a file or an absolute position. When the file is written using the w +, the file may not exist, I am here to set an absolute path for the desktop kuwomusic.txt file when the file does not exist, python will automatically generate a kuwomusic suffix txt file and then write data .

  #打开文件
f =open("C:\\Users\LeeChoy\Desktop\kuwomusic.txt",'w+',encoding='utf-8')   
i=0  #为了写序号以及便于列表输出和储存数据
for name ,singer, heat in zip(names,singers,heats):
    print((str)(i+1)+" "+name.get_text()+"     "+singer.get_text()+"   "+heats[i].replace("width:",""))
    f.write((str)(i + 1) + " " + name.get_text() + "     "+singer.get_text() + "   "+heats[i].replace("width:","")+'\n')
    i+=1
f.close()  #关闭文件

Caishuxueqian, if wrong, please correct me!

Guess you like

Origin blog.csdn.net/qq_41767945/article/details/90268100