Crawler Basics-Crawling Douban Musicians and Popularity

本文创作思想:1.确定我们要爬取的页面2.确定要爬取的数据(这里是音乐创作者的名称和喜爱人数)3.用xpath定位获取每一页的所有我们想要的数据(name_list和attribute_list,分别是音乐人的姓名和喜爱人数)4.遍历提取数据5.按照一个名字紧跟着一个喜爱人数的数字进行打印和存储

Note, do not let this program run all the time, otherwise it will be blocked ip, because the crawling speed is too fast

import requests
from lxml import etree
file = open('./music.txt','w',encoding='utf-8') #在当前文件夹下创建一个文档
name_data = []  #存储音乐人姓名
attribute_data =[] #存储喜爱人数
all_data =[]  #整合的数据,包括音乐人以及其对应的喜爱人数

request_header ={
    
    
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.74',
}
for page in range(1,375): #这里控制翻页,一共有374页,以下所有内容都在词范围内
    num = 0  #这是attribute_data[]的下标,此处是对下面for循环后的归零
    url ='https://music.douban.com/artists/genre_page/6/{}'.format(page)
    response = requests.get(url=url,headers=request_header).text
    tree =etree.HTML(response)
    name_list = tree.xpath('/html/body/div[3]/div[1]/div/div[1]/div[2]/div/div/div[2]/a/text()') #得到所有的name
    attribute_list =tree.xpath('/html/body/div[3]/div[1]/div/div[1]/div[2]/div/div/div[2]/div/text()') #得到所有的评分
    for attribute in attribute_list:
            attribute_data.append(attribute) #把喜爱人数上传到attribute_data
    
    for name in name_list: # 
            name_data.append(name) #把音乐人上传到name_data
    
    for data in name_data:  #名字加喜爱人数的方式进行存储
        all_data.append(data)  #上传一个音乐人姓名到all_data
        attr =attribute_data[num]
        all_data.append(attr)   #接着上传一个喜爱人数到all_data
        num +=1

    print(all_data)
    for each in all_data:
     file.write(each+'\n'+'\n')

Guess you like

Origin blog.csdn.net/weixin_47249161/article/details/114072403