Xiaojie's learning process of requests + xpath post bar picture crawl

This blog describes how to crawl pictures of Baidu Post Bar. Using the most basic requests of crawler technology plus xpath location extraction.

I wrote this reptile because there are a lot of stickers and pictures, such as emoticons. I like to get the pictures inside, but if you save them one by one, it will be too slow. So this reptile was born.

I'm trying to crawl any picture of Post Bar, so the first page to enter is definitely the main page of Baidu Post Bar.

https://tieba.baidu.com/

Then I casually enter a post bar name, I use the emoji package as an example (who calls him more pictures).

Then enter the emoji package, his URL is: https://tieba.baidu.com/f?ie=utf-8&kw=%E8%A1%A8%E6%83%85%E5%90%A7&fr= search

I return to the homepage and just enter a post, this time I enter the Python crawler.

His URL is: https://tieba.baidu.com/f?ie=utf-8&kw=Python%E7%88%AC%E8%99%AB&fr=search

I found that these two URLs are only different in his kw attribute, and the webpage will automatically transcode the Chinese I entered. Then I tried to shorten the URL, and found that I can remove the following & fr = search and request it. So I can get the URL requesting the information on the post: https://tieba.baidu.com/f?ie=utf-8&kw= (the name of the bar you want to crawl, do not bring the word "bar")

Then I analyze the specific post page, I need to enter each post, and I only need pictures, so I just need to get the link of each post.

F12 enters the developer mode, arbitrarily locate a post, I can find:

<a rel="noreferrer" href="/p/5788129292" title="【表情吧官方群】欢迎大家加入!" target="_blank" class="j_th_tit ">【表情吧官方群】欢迎大家加入!</a>

Then enter this post to see his URL: https://tieba.baidu.com/p/5788129292

It is found that the content of the second half of his content is on the @href attribute above, so I extracted the href element, and then added "https://tieba.baidu.com" in front of it to enter each post. Below I test:

import requests  
from lxml import etree  #要用xpath需要导入的
from my_fake_useragent import UserAgent    #随机请求头的一个包
def get_one_page(url):
    headers = {
        'User-Agent': UserAgent().random(),
        'Referer': 'https: // tieba.baidu.com / index.html',
    }
    base = 'https://tieba.baidu.com'
    response = requests.get(url, headers).text
    html = etree.HTML(response)
    link_list = html.xpath('//ul[@id="thread_list"]//a[@rel="noreferrer"and@class="j_th_tit "]/@href')
    link_list = map(lambda link: base + link, link_list)
    for link in link_list:
        print(link)


def main():
    url = 'https://tieba.baidu.com/f?ie=utf-8&kw=%E8%A1%A8%E6%83%85%E5%90%A7'
    get_one_page(url)


if __name__ == '__main__':
    main()

 

The output part:

C:\Users\User\AppData\Local\Programs\Python\Python37\python.exe G:/Python/code/requeats/try.py
https://tieba.baidu.com/p/5788129292
https://tieba.baidu.com/p/4789404681
https://tieba.baidu.com/p/6478509408
https://tieba.baidu.com/p/6497831229
https://tieba.baidu.com/p/6497828481
……

Here I want to recommend a package, the my_fake_useragent package above, this package returns a User-Agent randomly, so that I do n’t have to work hard to build a User-Agent pool. If you want to install it, just pip install fake_useragent, why is this my_fake_useragent, because this is what I downloaded at the beginning of pycharm, so everyone should use fake_useragent.

And the referer test in my header above can be requested without adding, but if you crawl a lot, it is inevitable to check, so the more parameters the headers write, the better.

Back to the original question, I have extracted the URL of each post, and I need to enter each post to crawl pictures.

Because I want to find the rule of the block where the picture is located, I have to enter a post with at least two or three pictures and advertisements, and I do not want to advertise pictures.

Then I entered: https://tieba.baidu.com/p/6299996178

I will find this post has a lot of pictures, we need to find pictures in different areas to ensure that I am looking for a general pattern.

The first picture

<cc>            <div class="j_ueg_post_content p_forbidden_tip" style="display:none;">该楼层疑似违规已被系统折叠&nbsp;<a rel="noopener" href="###" class="p_forbidden_post_content_unfold" style="display:;">隐藏此楼</a><a rel="noopener" href="###" class="p_forbidden_post_content_fold" style="display:none;">查看此楼</a></div><div id="post_content_127945257402" class="d_post_content j_d_post_content " style="display:;">            长更,图源各处,自己搜集<br><img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=7015ef856381800a6ee58906813533d6/8a274b90f603738d083a1a28bc1bb051f819ec6a.jpg" size="52322" changedsize="true" width="560" height="560"><br><img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=cf7758d65be736d158138c00ab514ffc/b2ed38dbb6fd5266a76f2a1ea418972bd407361d.jpg" size="44412" changedsize="true" width="560" height="560"><br><img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=4d6f86f8721ed21b79c92eed9d6fddae/500163d0f703918f6a172ac05e3d269758eec498.jpg" size="11218" width="240" height="240"><br><img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=c3ee0de7accc7cd9fa2d34d109012104/97e47c1ed21b0ef474948e54d2c451da81cb3e65.jpg" size="28050" width="482" height="360"><br><img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=7d835c046809c93d07f20effaf3cf8bb/14599f2f07082838aba4ca1eb799a9014c08f131.jpg" size="79960" changedsize="true" width="560" height="479"></div><br>                        </cc>

The second picture, from another reply:

<cc>            <div class="j_ueg_post_content p_forbidden_tip" style="display:none;">该楼层疑似违规已被系统折叠&nbsp;<a rel="noopener" href="###" class="p_forbidden_post_content_unfold" style="display:;">隐藏此楼</a><a rel="noopener" href="###" class="p_forbidden_post_content_fold" style="display:none;">查看此楼</a></div><div id="post_content_127945275558" class="d_post_content j_d_post_content " style="display:;">            <img class="BDE_Image" src="https://imgsa.baidu.com/forum/w%3D580/sign=731151b14810b912bfc1f6f6f3fdfcb5/2eb30b7b02087bf4b4d141e3fdd3572c11dfcfa5.jpg" size="73764" changedsize="true" width="560" height="560"></div><br>                        </cc>

 

ad:

<div class="middle_image_content">
                    <a class="j_click_stats img_wrap" target="_blank" href="https://www.baidu.com/baidu.php?url=a00000jCx3p_KaPSJbxITB8iy2YQnH0i7MPwHjhCd4ZW9ROunXB4q7Su9mrP0PSLdibjxpor5i3bIc2IP5qPh0nU2MvXtTJEod0U44YTogcAYwcHPHbNGdATM7kr_jUPIVgfYjcKf7HG5NcuGhv1CkXAXAGu4U10zIYrxGx9O6kXpFI9h4or8i76wHQqj_kpvXFN1sVGIjqq7JXKDF7yYiv64_fk.7D_NR2Ar5Od66ux4g_3_ac2ampCPUn2XqauhZVm9kOSoQqpRLxOOqh5gOl33xOOS1ZWOqx1xjMhH7wX17ZRSqlTOsOhSZOOO_k35OyunN7enRIHeLG1lZmrW017MlZ4Eu6zxOOqbQOOgOOO_eXjOdo__o_vyUdBmSoG_A-muCyn--xuER0.U1Yk0ZDqdVj01tHJEUEHYfKY5ULnYOL73PAd0A-V5HRkn0KM5fKdpHY0TA-b5Hc30APGujYznWb0UgfqnHmdnHR0mhnqnfKopHYs0ANGujYdnj03nWbv0AFG5fKVm1Y0TgKGujYs0Z7Wpyfqn0KzuLw9u1Ys0AqvUjYznWf3PzYkPdt1nj0snjDVn-t1n1f4nadbX-t1PHTdPadbX-t1rjRkPzYkg1fknj6LQywlg1fznWnzQH7xPjR1nHRVnNtYPHbvPBYs0A7B5HKxn0K-ThTqn0KsTjYs0A4vTjYsQW0snj0snj0s0AdYTjY0uAwETjY0ThCqn0K1XWY0IZN15HTvPHbknjnYnH0dPHc3Pjc1rHR0ThNkIjYkPH6krHb1PjbkrHTd0ZPGujYsPAcYuHD1uyDzPvc3nW010ZK85H00ULnqP0KVIZ-suHY10A7bIZ-suHYkrjT0mgwspyfqn0KWTA-b5HDsnjD0TAkGujYsnj0z0APCpyfq0A7sTZu85fKYmg6q0AdYT1YkPWRkPfKEmLKW5HDWnansc10Wnansc1cYPinWc1D8nj0sc1D8nj0scznWnansc1D8nj0Wcznsc10Wnansc10Wnansc1ndnansc10WnankQW0sc1D8nj0sc1D8nj0sc108nj0sc108nj0sc108nj0sc1D8nj0Wnansc10WnankQW0snansc10WnanWnansc100TZfqn1ndn1b0TAq1pgwGUv3qn0KsULPGIA-EU-qWpA78uvNxThN9Tvq85H00ULFGgvdY5HDvPHDd0AdYgLKY5H00myPWIjY0uAPWujY0uAPzTjY0uANvIZ0q0ZP85fKsThqb5fKEmLKWgvk9m-q1IA7YIgnqn0KEmLKWgvP_pyPo5H0Wnansc10Wnansc1D8nj0sc1D8nj0s0AuYXgK-5H00myw1U-q15H00myw1U-qGT1Y0mhd_5H00Uy7WPHY0UvDq0A7bmv_qn0K_IjYs0ZPW5H00Ih-buyqGujY0mhN1UjYs0A-1uAsqn0KEUAw_5H00TZFEuv7bIWYs0A71XHYs0A7bTZTqnHDYn0K9uZKC5HmYn0KGTL0quLGCXZb0pZP_u1Ys0jDqn00z5f015f0Y5f0d5H00PWYs0jTqn0035H00rHY0TZFEudqYT1YkP1m1rHbYPHR4n1DYrjR3nWnzn0KsThqMgLK15HbznjRdnHTYPHbYn1c1PHc4rjT0TZFEudqYpHYs0ZKzUvIxTAbq0ZKWpHYs0ZPJpjYzPHR0mMNYmyTqn0K8ugfqn0KWThnqnWD4Pjb" data-locate="pb_图片">
                        <img class="BDE_Image" src="https://aod-image-material.cdn.bcebos.com/5/pic/e49b173f4f765829e251b905696dc386.jpg" ad-dom-img="true">
                        
                    </a>
                    <div class="ad_bottom_view">
                        <span class="now_date">2020-02-18 10:38</span>
                        <span class="label_text"> 广告</span>
                    </div>
                </div>

 

From the HTML code I intercepted here, I can know that the picture I need is directly in the @src attribute in the img tag of the @ class = "BDE_Image attribute under the CC tag.

Then I tried to crawl the pictures in the first page of the reply from a post (a bit haha).

import requests
from lxml import etree
from my_fake_useragent import UserAgent


def get_one_page(url):
    headers = {
        'User-Agent': UserAgent().random(),
        'Referer': 'https: // tieba.baidu.com / index.html',
    }
    base = 'https://tieba.baidu.com'
    response = requests.get(url, headers).text
    html = etree.HTML(response)
    link_list = html.xpath('//ul[@id="thread_list"]//a[@rel="noreferrer"and@class="j_th_tit "]/@href')
    link_list = map(lambda link: base + link, link_list)
    return link_list


def parse_one_page(link):
    headers = {
        'User-Agent': UserAgent().random(),
        'Cookie': '(填你自己的)',
    }
    imgs = []
    url = link
    response = requests.get(url, headers).text
    html = etree.HTML(response)
    img = html.xpath('//cc//img[@class="BDE_Image"]//@src')
    imgs.append(img)
    return imgs


def main():
    url = 'https://tieba.baidu.com/f?ie=utf-8&kw=%E8%A1%A8%E6%83%85%E5%90%A7'
    for page in get_one_page(url):
        for imgs in parse_one_page(page):
            for img in imgs:
                print(img)


if __name__ == '__main__':
    main()

 

The output part:

C:\Users\User\AppData\Local\Programs\Python\Python37\python.exe G:/Python/code/requeats/try.py
https://imgsa.baidu.com/forum/w%3D580/sign=9fa533f53cadcbef01347e0e9caf2e0e/f0dbfd039245d6889b38b895a9c27d1ed21b24ea.jpg
https://imgsa.baidu.com/forum/w%3D580/sign=cdf24ef44ba98226b8c12b2fba83b97a/facc14ce36d3d5399036c3423287e950342ab0ca.jpg
http://tiebapic.baidu.com/forum/w%3D580/sign=d106bd7a4e82b2b7a79f39cc01accb0a/7436b812c8fcc3ceb3bfec4d8545d688d43f2072.jpg
http://tiebapic.baidu.com/forum/w%3D580/sign=7cbc896bab315c6043956be7bdb0cbe6/c8cbaa64034f78f06b74bf8d6e310a55b3191c72.jpg
http://tiebapic.baidu.com/forum/w%3D580/sign=6c8ab7c8de177f3e1034fc0540ce3bb9/92ad86d6277f9e2fba4a38760830e924b899f373.jpg
http://tiebapic.baidu.com/forum/w%3D580/sign=c8b5121def1986184147ef8c7aec2e69/81379213b07eca801e96ae9b862397dda04483e5.jpg
http://tiebapic.baidu.com/forum/w%3D580/sign=fc409f2142fbb2fb342b581a7f4b2043/ae30fcfaaf51f3de11212e1883eef01f3b2979e5.jpg
http://tiebapic.baidu.com/forum/w%3D580/sign=6b8fd58ab0efce1bea2bc8c29f50f3e8/ad770eb30f2442a722536bd9c643ad4bd1130217.jpg
http://tiebapic.baidu.com/forum/w%3D580/sign=0e1dfe512d292df597c3ac1d8c305ce2/b1245bafa40f4bfb611ea52a144f78f0f63618a3.jpg
http://tiebapic.baidu.com/forum/w%3D580/sign=43e7890afcf81a4c2632ecc1e72b6029/a17ed009b3de9c829e8cc2f17b81800a18d8438c.jpg
http://tiebapic.baidu.com/forum/w%3D580/sign=7b203fe3dcef76093c0b99971edca301/cf259345d688d43ff040aa8c6a1ed21b0ff43b8c.jpg
……

So I extracted the URL of each image.

Then I want to add more operations to facilitate human control:

①What kind of post do you want to crawl? Enter the name of the post yourself. Of course, if you just want to crawl a certain post, you don't need this.

② I want to crawl a few pages of posts, because some posts have many pictures. If you just use them for fighting pictures, you do n’t need to crawl a lot, so I can set how many pages to crawl, and of course, set how many pictures to stop. The program is also very good.

③ Crawl each page of each post reply. Because the quality of posts is different, some posts are uninterested, and some posts have countless pictures, so I ca n’t find an appropriate value to constrain all posts, so here I extract each post. The total number of pages, and then use the total number of pages to loop through each page to crawl the reply.

④ I want to extract his name behind the picture URL to save, here I use os.path.split under the os package. For example: http://tiebapic.baidu.com/forum/w%3D580/sign=7b203fe3dcef76093c0b99971edca301/cf259345d688d43ff040aa8c6a1ed21b0ff43b8c.jpg This picture, when I save it, his name is cf259345d688d43ff040aa8c6a1ed21b0ff. 1,2,3,4 add up when the picture name is very low, and the picture suffix name is also different, we manually write it is very easy to make mistakes, such as gif pictures, if you add jpg he will not move, such pictures Meaningless. So I use os.path.split to directly extract the second half of the picture, which is a lot more convenient.

⑤ To save the picture, I use urlretrieve under urllib, from urllib.request import urlretrieve.

All codes:

import requests
from lxml import etree
from urllib.request import urlretrieve
from my_fake_useragent import UserAgent
import os


def get_one_page(url):
    headers = {
        'User-Agent': UserAgent().random(),
        'Referer': 'https: // tieba.baidu.com / index.html',
    }
    base = 'https://tieba.baidu.com'
    response = requests.get(url, headers).text
    html = etree.HTML(response)
    link_list = html.xpath('//ul[@id="thread_list"]//a[@rel="noreferrer"and@class="j_th_tit "]/@href')
    link_list = map(lambda link: base + link, link_list)
    return link_list
    # for link in link_list:
    #     print(link)


def parse_one_page(link):
    headers = {
        'User-Agent': UserAgent().random(),
        'Cookie': '(填你自己的)',
    }
    imgs = []
    url = link + "?pn=1"
    response = requests.get(url, headers).text
    html = etree.HTML(response)
    maxnumber = html.xpath('//li[@class="l_reply_num"]/span[2]/text()')[0]
    for i in range(1, int(maxnumber) + 1):
        try:
            url = link + "?pn={}".format(i)
            response = requests.get(url, headers).text
            html = etree.HTML(response)
            img = html.xpath('//cc//img[@class="BDE_Image"]//@src')
            imgs.append(img)
        except:
            break
    return imgs


def main():
    print('该爬虫功能为百度贴吧图片爬取!!!')
    kw = input('请输入所需爬取贴吧名:')
    number = input('请输入所需爬页数:')
    j = 1
    for i in range(int(number)):
        url = 'https://tieba.baidu.com/f?kw=' + kw + '&ie=utf-8&pn={}'.format(i * 50)
        try:
            for page in get_one_page(url):
                print(page)
                for imgs in parse_one_page(page):
                    for img in imgs:
                        print('正在保存第' + str(j) + '张图片。')
                        suffix = os.path.split(img)[1]
                        urlretrieve(img, 'C:\\Users\\User\\Desktop\\图片\\' + kw + '\\' + str(suffix))
                        j += 1
        except:
            break
    print('保存完毕!')


if __name__ == '__main__':
    main()

 

Here I use my own cookie, please add your own if you want to use it. What I saved is a folder on the desktop, when saving, it will output how many pages are saved. Of course, because I use the pictures myself, it is not necessary for work, so I don't use multithreading to write. Everyone can add multithreading if necessary.

The output part:

C:\Users\User\AppData\Local\Programs\Python\Python37\python.exe G:/Python/code/requeats/tieba.py
该爬虫功能为百度贴吧图片爬取!!!
请输入所需爬取贴吧名:表情包
请输入所需爬页数:1
https://tieba.baidu.com/p/6416580758
正在保存第1张图片。
正在保存第2张图片。
正在保存第3张图片。
正在保存第4张图片。
正在保存第5张图片。
正在保存第6张图片。
正在保存第7张图片。
正在保存第8张图片。
正在保存第9张图片。
正在保存第10张图片。
正在保存第11张图片。
……
……
……

 

The following is the crawling result. A page of posts actually crawled more than 2,000, which shows the output of emoticons.

Insert picture description here

Published 10 original articles · praised 0 · visits 53

Guess you like

Origin blog.csdn.net/z55947810/article/details/105583431