基于Python的爬虫实战

需求:爬取酷狗网页的歌名,时长,链接。

方法一、使用bs4包

1.获取酷狗网站内容
#coding=utf-8
import requests,urllib
from bs4 import BeautifulSoup
import os

result=urllib.request.urlopen("http://www.kugou.com") 


2.根据html结构获取目标标签内容
soup=BeautifulSoup(result.read(),'html.parser')
for i in soup.find_all("div"):
    if i.get("id")=="SongtabContent":
        s=i.find_all("li")


3.保存
with open(r"d://music.txt","w",encoding="utf-8") as f:#创建要写入文件对象
    for i in s:
        f.write("歌曲名称为: %s    "%i.a.select(".songName")[0].text)
        f.write("歌曲播放连接为: %s    "% i.a.get("href")) 
        f.write("歌曲播放时间为: %s"   %i.a.select(".songTime")[0].text) 
        f.write(os.linesep)

方法二、使用scrapy框架

1)创建目录
scrapy startproject test

(2)cd test下执行
scrapy genspider newsong www.kugou.com

(3)setting.py文件中下面三行去掉注释

ITEM_PIPELINES = {
    'groad.pipelines.GroadPipeline': 300,
}

(4)编写items.py文件

import scrapy


class GroadItem(scrapy.Item): 
   
    songName = scrapy.Field()#歌曲名称
    songTime = scrapy.Field()#歌曲播放时间
    href_song = scrapy.Field()#歌曲播放连接

   (5)newsong.py文件

import scrapy
from groad.items import GroadItem

class NewsongSpider(scrapy.Spider):
    name = 'newsong'
    allowed_domains = ['www.kugou.com']
    start_urls = ['http://www.kugou.com/']

    def parse(self, response):
        item=GroadItem()
        for i in range(1,len(response.xpath('//*[@id="SongtabContent"]/ul'))+1):
            for j in range(1,len(response.xpath('//*[@id="SongtabContent"]/ul[%s]/li' % i))+1):
                item['songName']=response.xpath('//*[@id="SongtabContent"]/ul[%s]/li[%s]/a/span[1]/text()' % (i,j)).extract()[0] 
                item['songTime'] = \
                response.xpath('//*[@id="SongtabContent"]/ul[%s]/li[%s]/a/span[@class="songTime"]/text()' % (i, j)).extract()[0]
                item['href_song'] = \
                response.xpath('//*[@id="SongtabContent"]/ul[%s]/li[%s]/a/@href' % (i, j)).extract()[0]
                yield item   

(6)pipelines.py文件,保存item数据

import json

class GroadPipeline(object):

    def __init__(self):
        self.filename = open("e://downloads//newsong.txt", "w",encoding="utf-8")

    def process_item(self, item, spider):
        text = json.dumps(dict(item),ensure_ascii=False)+'\n'
        self.filename.write(text)
        return item

    def close_spider(self, spider):
        self.filename.close()

(7)执行
scrapy crawl newsong

发布了11 篇原创文章 · 获赞 2 · 访问量 6万+

猜你喜欢

转载自blog.csdn.net/liguofang_527/article/details/104899731