Life is short, I use Python crawling goddess posters Joey Wang watercress

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/meiguanxi7878/article/details/102711943

Page analysis

After the search for "Joey" in the watercress movie, Joey Wang entered the home page, click on the picture of all filmmakers, filmmakers enter the picture page.
In this page, click Next, you can see changes in the browser's URL as follows:

https://movie.douban.com/celebrity/1166896/photos/?type=C&start=30&sortby=like&size=a&subtype=a

Postman continue to use to analyze the URL, you can easily learn, start page is similar to the number of pages in the control parameters, and in steps of 30, ie, the first page is the start = 0, the second page to start = 30, third pages start = 60, and so on.

Details page analysis

Use Network to view the picture information on the page:

Here we get two messages:

  • Links can get a tag for each image review information;
  • Links img tag can be used to save the poster goddess.

For both information url, you can return respectively:

在学习过程中有什么不懂得可以加我的
python学习交流扣扣qun,784758214
群里有不错的学习视频教程、开发工具与电子书籍。
与你分享python企业当下人才需求及怎么从零基础学习好python,和学习什么内容

def get_posters():
    comment_url_list = []
    picture_list = []
    for i in range(0, 40000, 30):
        url = 'https://movie.douban.com/celebrity/1166896/photos/?type=C&start=%s&sortby=like&size=a&subtype=a' % str(i)
        req = requests.get(url).text
        content = BeautifulSoup(req, "html.parser")
        chekc_point = content.find('span', attrs={'class': 'next'}).find('a')
        if chekc_point != None:
            data = content.find_all('div', attrs={'class': 'cover'})
            for k in data:
                ulist = k.find('a')['href']
                plist = k.find('img')['src']
                comment_url_list.append(ulist)
                picture_list.append(plist)
        else:
            break
    return comment_url_list, picture_list

After that, you can download the poster.

Comments Gets

Then we manually jump to the weekly poster details page, information continues to view comments.Here Insert Picture Description

Comments can be easily obtained information through BeautifulSoup, and then saved to the MongoDB.

def get_comment(comment_l):
    client = pymongo.MongoClient('mongodb://douban:[email protected]:49744/douban')
    db = client.douban
    mongo_collection = db.comment
    comment_list = []
    comment = []
    print("Save to MongoDB")
    for i in comment_l:
        response = requests.get(i).text
        content = BeautifulSoup(response, "html.parser")
        tmp_list = content.find_all('div', attrs={'class': 'comment-item'})
        comment_list = comment_list + tmp_list
        for k in comment_list:
            tmp_comment = k.find('p').text
            mongo_collection.insert_one({'comment': tmp_comment})
            comment.append(tmp_comment)
    print("Save Finish!")

Guess you like

Origin blog.csdn.net/meiguanxi7878/article/details/102711943