需求:爬取酷狗网页的歌名,时长,链接。
方法一、使用bs4包
1.获取酷狗网站内容
#coding=utf-8
import requests,urllib
from bs4 import BeautifulSoup
import os
result=urllib.request.urlopen("http://www.kugou.com")
2.根据html结构获取目标标签内容
soup=BeautifulSoup(result.read(),'html.parser')
for i in soup.find_all("div"):
if i.get("id")=="SongtabContent":
s=i.find_all("li")
3.保存
with open(r"d://music.txt","w",encoding="utf-8") as f:#创建要写入文件对象
for i in s:
f.write("歌曲名称为: %s "%i.a.select(".songName")[0].text)
f.write("歌曲播放连接为: %s "% i.a.get("href"))
f.write("歌曲播放时间为: %s" %i.a.select(".songTime")[0].text)
f.write(os.linesep)
方法二、使用scrapy框架
1)创建目录
scrapy startproject test
(2)cd test下执行
scrapy genspider newsong www.kugou.com
(3)setting.py文件中下面三行去掉注释
ITEM_PIPELINES = {
'groad.pipelines.GroadPipeline': 300,
}
(4)编写items.py文件
import scrapy
class GroadItem(scrapy.Item):
songName = scrapy.Field()#歌曲名称
songTime = scrapy.Field()#歌曲播放时间
href_song = scrapy.Field()#歌曲播放连接
(5)newsong.py文件
import scrapy
from groad.items import GroadItem
class NewsongSpider(scrapy.Spider):
name = 'newsong'
allowed_domains = ['www.kugou.com']
start_urls = ['http://www.kugou.com/']
def parse(self, response):
item=GroadItem()
for i in range(1,len(response.xpath('//*[@id="SongtabContent"]/ul'))+1):
for j in range(1,len(response.xpath('//*[@id="SongtabContent"]/ul[%s]/li' % i))+1):
item['songName']=response.xpath('//*[@id="SongtabContent"]/ul[%s]/li[%s]/a/span[1]/text()' % (i,j)).extract()[0]
item['songTime'] = \
response.xpath('//*[@id="SongtabContent"]/ul[%s]/li[%s]/a/span[@class="songTime"]/text()' % (i, j)).extract()[0]
item['href_song'] = \
response.xpath('//*[@id="SongtabContent"]/ul[%s]/li[%s]/a/@href' % (i, j)).extract()[0]
yield item
(6)pipelines.py文件,保存item数据
import json
class GroadPipeline(object):
def __init__(self):
self.filename = open("e://downloads//newsong.txt", "w",encoding="utf-8")
def process_item(self, item, spider):
text = json.dumps(dict(item),ensure_ascii=False)+'\n'
self.filename.write(text)
return item
def close_spider(self, spider):
self.filename.close()
(7)执行
scrapy crawl newsong