基于Python的爬虫实战

需求：爬取酷狗网页的歌名，时长，链接。

方法一、使用bs4包

1.获取酷狗网站内容
#coding=utf-8
import requests,urllib
from bs4 import BeautifulSoup
import os

result=urllib.request.urlopen("http://www.kugou.com")

2.根据html结构获取目标标签内容
soup=BeautifulSoup(result.read(),'html.parser')
for i in soup.find_all("div"):
if i.get("id")=="SongtabContent":
s=i.find_all("li")

3.保存
with open(r"d://music.txt","w",encoding="utf-8") as f:#创建要写入文件对象
for i in s:
f.write("歌曲名称为: %s "%i.a.select(".songName")[0].text)
f.write("歌曲播放连接为: %s "% i.a.get("href"))
f.write("歌曲播放时间为: %s" %i.a.select(".songTime")[0].text)
f.write(os.linesep)

方法二、使用scrapy框架

1）创建目录
scrapy startproject test

（2）cd test下执行
scrapy genspider newsong www.kugou.com

（3）setting.py文件中下面三行去掉注释

ITEM_PIPELINES = {
'groad.pipelines.GroadPipeline': 300,
}

（4）编写items.py文件

import scrapy

class GroadItem(scrapy.Item):

songName = scrapy.Field()#歌曲名称
songTime = scrapy.Field()#歌曲播放时间
href_song = scrapy.Field()#歌曲播放连接

（5）newsong.py文件

import scrapy
from groad.items import GroadItem

class NewsongSpider(scrapy.Spider):
name = 'newsong'
allowed_domains = ['www.kugou.com']
start_urls = ['http://www.kugou.com/']

def parse(self, response):
item=GroadItem()
for i in range(1,len(response.xpath('//*[@id="SongtabContent"]/ul'))+1):
for j in range(1,len(response.xpath('//*[@id="SongtabContent"]/ul[%s]/li' % i))+1):
item['songName']=response.xpath('//*[@id="SongtabContent"]/ul[%s]/li[%s]/a/span[1]/text()' % (i,j)).extract()[0]
item['songTime'] = \
response.xpath('//*[@id="SongtabContent"]/ul[%s]/li[%s]/a/span[@class="songTime"]/text()' % (i, j)).extract()[0]
item['href_song'] = \
response.xpath('//*[@id="SongtabContent"]/ul[%s]/li[%s]/a/@href' % (i, j)).extract()[0]
yield item

（6）pipelines.py文件，保存item数据

import json

class GroadPipeline(object):

def __init__(self):
self.filename = open("e://downloads//newsong.txt", "w",encoding="utf-8")

def process_item(self, item, spider):
text = json.dumps(dict(item),ensure_ascii=False)+'\n'
self.filename.write(text)
return item

def close_spider(self, spider):
self.filename.close()

（7）执行
scrapy crawl newsong

白羊洞

发布了11 篇原创文章 · 获赞 2 · 访问量 6万+

私信关注

基于Python的爬虫实战

猜你喜欢