Crawler learning - a shallow crawl of website pictures

Last week, I introduced reptiles and crawling novels. Let’s take a look at the pictures today.

The picture website I climbed today is https://wallhaven.cc/ , there was an article recommending wallpaper websites before, and this website is there.

This website is more friendly, unlike the previous novel website - Biquge, there are garbled characters in each label, which increases the difficulty for you to deal with.

So use xpath analysis this time, no need for re.

As before, three steps:

  1. F12 observes the source code of the page and finds the URL of the picture
  • Determine the idea (this time is simple, you don’t need to jump to a lot of pages, you just need to jump)
  • start to practice
  • f12 analysis page source code

    f12 View the source code of the page.

    Screenshot 2022-11-13 162340

    First click on place No. 1, then click on any picture (for example, the place where the main circle of No. 2 is located), and see where the link of the picture is. No. 3 is the link address of the picture, and No. 4 is the jump address after clicking the picture.

    Clicked the link in place 3 and realized it was just a thumbnail and we wanted 1080P (or even 4K). Click on the picture (that is, the link corresponding to No. 4) to jump to the next page and find the link to the original picture.

    Screenshot 2022-11-13 163021

    After jumping to the next page, f12 checks the position corresponding to the picture and finds the link of the original picture, that is, the src attribute corresponding to the red line.

    It's simple now, first get all the No. 4 links on the homepage, jump to find the link to the original picture, and download it locally.

    key function

    def get_page_url(url):
      try:
          urlhtml = requests.get(url,headers)
          urlhtml.encoding = "utf-8"
          htmlcode = urlhtml.content
          html = etree.HTML(htmlcode)
          text = html.xpath('//div[@class="feat-row"]/span/a/@href')
      except:
          text = 'error'
      return text

    According to the name of the function, it is not difficult to guess, this is the link to jump to the page after clicking the picture on the home page (that is, the link No. 4 in the first step).

    Ahem, according to my original code writing style, the function name should be called def a() at this time, but it was changed after being beaten up by my instructor (Brother Nan mentioned in the previous article) a few times.

    Let me briefly talk about the usage of xpath in the function, and the others should be easy to understand.

    "//div[@class="feat-row"]" in "//div[@class="feat-row"]/span/a/@href" represents matching all class = feat-row in the text The div tag, then look for the span tag under this tag, then find the a tag, and finally find the href attribute in the a tag, and find the link we need step by step.

    def get_picture_url(picture):
      picture_html = picture.content
      picture_html1 = etree.HTML(picture_html)
      picture_url = picture_html1.xpath('//section[@id = "showcase"]/div/img/@src')
      return picture_url

    Just look at the name and find the link to the original picture.

    The syntax of the above function is basically the same, so there is no explanation. It is worth noting that the returned picture_url is a list, and we only need the first one (there is only one in the list), so [0] must be added when using it at the end, that is, picture_url[0].

    def get_picture_name(picture):
      picture_html = picture.content
      picture_html1 = etree.HTML(picture_html)
      picture_name = picture_html1.xpath('//section[@id = "showcase"]/div/img/@alt')
      return picture_name

    Get the name corresponding to the picture. You can also do without this, and name each picture yourself, I am too lazy to do it.

    Similarly, what is returned is a list, and only the first one is needed, don't forget to add [0].

    specific code

    import requests
    from lxml import etree
    headers={
      'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0'
    }


    def get_page_url(url):
      try:
          urlhtml = requests.get(url,headers)
          urlhtml.encoding = "utf-8"
          htmlcode = urlhtml.content
          html = etree.HTML(htmlcode)
          text = html.xpath('//div[@class="feat-row"]/span/a/@href')
      except:
          text = 'error'
      return text

    def get_picture_url(picture):
      picture_html = picture.content
      picture_html1 = etree.HTML(picture_html)
      picture_url = picture_html1.xpath('//section[@id = "showcase"]/div/img/@src')
      return picture_url

    def get_picture_name(picture):
      picture_html = picture.content
      picture_html1 = etree.HTML(picture_html)
      picture_name = picture_html1.xpath('//section[@id = "showcase"]/div/img/@alt')
      return picture_name
    if __name__ == '__main__':
      url = get_page_url('https://wallhaven.cc/')
      a = ["1","2","3"]
      b=0
      for i in url:
          picture_page = requests.get(i, headers)
          picture_page.encoding = 'utf-8'
          picture_name = get_picture_name(picture_page)
          picture_url = get_picture_url(picture_page)
          f = open(picture_name[0]+'.jpg','wb+')
          f.write(requests.get(picture_url[0],headers).content)
          f.close()
          print('打印完成一张')

          # print(picture_url)

    By the way, when writing the picture at the end, choose wb+ as the open mode, write binary, don't use a+.

    Operating status

    Screenshot 2022-11-13 165755

    Every time a picture is downloaded, it will be printed on the console.

    Since the website is located abroad and the picture resolution is too high, the operation may be very slow.

    Screenshot 2022-11-13 165731

    The file save path is under your current project path.

    Just learn, please don't crawl other people's websites concurrently, causing other people's servers to be paralyzed has nothing to do with me.

    Link to this article: https://xiaoliu.life/465.html

    Please indicate the reprint from: Xiao Liu who loves to work overtime

Guess you like

Origin blog.csdn.net/weixin_46630782/article/details/128097213