python爬虫实战一、爬取酷我音乐榜单并写入txt文件保存到本地

一、总代码和运行截图

#加载需要的库
import requests
from bs4 import BeautifulSoup
from lxml import etree
f = open("C:\\Users\LeeChoy\Desktop\kuwomusic.txt",'w+',encoding='utf-8')    #打开文件
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'} ##从浏览器复制张贴时注意删除agent：后面的空格
responce = requests.get("http://www.kuwo.cn/bang/index",headers=headers)   ##为使抓取数据稳定，伪装成浏览器
soup = BeautifulSoup(responce.text,'html.parser') #BeautifulSoup解析网页
html = etree.HTML(responce.text)    #XPath解析
#利用选择器获得数据
# 一、获得榜单的名字
names = soup.select( '#chartsContent > div.rightContentBox > ul > li > div.name > a')
#二、获取歌曲歌手
singers = soup.select('#chartsContent > div.rightContentBox > ul > li > div.artist > a')
#三、获取歌手的热度 xpath解析获得为一个列表
heats = html.xpath('//*[@id="chartsContent"]/div[2]/ul/li/div[4]/div/@style')
i=0    #为了写序号以及便于列表输出和储存数据
for name ,singer, heat in zip(names,singers,heats):
    print((str)(i+1)+" "+name.get_text()+"     "+singer.get_text()+"   "+heats[i].replace("width:",""))
    f.write((str)(i + 1) + " " + name.get_text() + "     "+singer.get_text() + "   "+heats[i].replace("width:","")+'\n')
    i+=1
f.close()  #关闭文件

运行结果

①打印的爬取结果

在这里插入图片描述
②写入的TXT文件

二、具体爬取流程

①加载所需要的库

所需要的库可根据实际爬取情况选取自己需要的库，这里没有利用正则表达式，后面会更新写一个正则表达式爬虫。这里利用requests库get()方法请求网页BeautifulSoup和XPath库进行解析和信息的选择获取，主要利用的beautifulsoup的select()方法以及XPath库中的xpath()方法进行选取需要的信息。

import requests
from bs4 import BeautifulSoup
from lxml import etree

②加入请求头来伪装成浏览器，以便个更好的抓取数据和维护爬虫的稳定性

这里以谷歌浏览器为例（其他浏览器大同小异）：首先进入酷我音乐官网的新歌排行榜。F12进入检查，或者鼠表右击进入检查（元素），找到网络（network），再找到请头标头（headers），复制User-Agent.
在这里插入图片描述
python中加入请求头并利用requests的get（）方法请求网页 ##从浏览器复制张贴时注意删除agent：后面的空格
：

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'} ##从浏览器复制张贴时注意删除agent：后面的空格
responce = requests.get("http://www.kuwo.cn/bang/index",headers=headers)   ##为使抓取数据稳定，伪装成浏览器

③、解析网页以及获取相应的信息

soup = BeautifulSoup(responce.text,'html.parser') #BeautifulSoup解析网页
html = etree.HTML(responce.text)    #XPath解析
#利用选择器获得数据
# 一、获得榜单的名字
names = soup.select( '#chartsContent > div.rightContentBox > ul > li > div.name > a')
#二、获取歌曲歌手
singers = soup.select('#chartsContent > div.rightContentBox > ul > li > div.artist > a')
#三、获取歌手的热度 xpath解析获得为一个列表
heats = html.xpath('//*[@id="chartsContent"]/div[2]/ul/li/div[4]/div/@style')

浏览器中的源代码对应网页的内容，根据网页块状颜色的提示找到我们需要的内容，其中select（）和xpath选择可以复制。选择复制selector的歌名称：#chartsContent > div.rightContentBox > ul > li:nth-child(1) > div.name > a其中li:nth-child(1)代表ul标签中第一个li的内容即第一个歌名，而我们需要所有的歌名，因此删除li:nth-child(1)即可。同理改变select的歌手的selector。对于xpath()解析选择，和beautifulsoup相比，xpath选择属性的能力更强。例：以下是第一个歌曲的热度的HTML代码我们只需要style特征属性中的值xpath中@可以选择html的标签的属性，且获得为一个列表。

<div class="heatValue" style="width:84%"></div>

//[@id=“chartsContent”]/div[2]/ul/li[1]/div[4]/div和select原理一样去掉li[1]中的[1]
即：//[@id=“chartsContent”]/div[2]/ul/li/div[4]/div然后选取style属性
//*[@id=“chartsContent”]/div[2]/ul/li/div[4]/div@style
在这里插入图片描述

④打印输入以及写入文件

open（）函数的第一个属性为文件的地址可以是相对位置或者绝对位置。文件写入时利用w+，文件可以不存在，我这里设置的一个绝对路径为在桌面的kuwomusic.txt文件，当这个文件不存在时，python会自动生成一个kuwomusic后缀为txt的文件再写入数据。

  #打开文件
f =open("C:\\Users\LeeChoy\Desktop\kuwomusic.txt",'w+',encoding='utf-8')   
i=0  #为了写序号以及便于列表输出和储存数据
for name ,singer, heat in zip(names,singers,heats):
    print((str)(i+1)+" "+name.get_text()+"     "+singer.get_text()+"   "+heats[i].replace("width:",""))
    f.write((str)(i + 1) + " " + name.get_text() + "     "+singer.get_text() + "   "+heats[i].replace("width:","")+'\n')
    i+=1
f.close()  #关闭文件