Web crawler combat (VI): my mother no longer have to worry about me no matter wallpaper

Actual background

Recent ready to participate in a steganalysis game , unsplash game is one of the training data set sources. Unsplash is a completely free, copyright-free high-quality pictures resource site, which is also a wide variety of images, the resolution is also good, that number is used to make public the background image is also very good choice, so they look real hands-on pictures Crawling.
Here Insert Picture Description

A crawling Method: Requests

  • Enter the photo site , press F12 to open the Developer Tools, observation Network, scroll, page down, can be found in the following figure photos? Page = 3 & per_page = 12
    Here Insert Picture Description
  • Observe the request URL, its structure is easy to see 12 pictures per page, is currently the third page, the page continued to decline, found to occur photos? Page = 4 & per_page = 12, only observable parameters different page, also verified the conjecture, then we continue to observe the link, not difficult to find, download links pictures hidden in them.
    Here Insert Picture Description
  • This page is very friendly couple reptiles thing! Hands flew to expand the code to write, as long as the change in value of the page in circulation among all the pictures you can crawl the entire Web page!
    Here Insert Picture Description
  • The program runs successfully! But its speed is really flattered by a 12 pictures require a lot of time cost, which shall not be more than 10 million Photo climbed these years? So I choose Scrapy framework crawling picture.

Crawling Method two: Scrapy

  • First, yesterday the same input command to establish the project, see if you can remember Scrapy combat . Then to write code for each component:

spider

  • This part is the main part reptile, start_urls set a link to this page request, and then use the content of the page will return json library into json format to extract images download links in them. And the use of scrapy.Request content unsplash net return of secondary resolved, pipelines and images to be output.
    Here Insert Picture Description

pipelines

  • This section is for outputting the stored images using the MD5 digest to generate a picture naming, which can complete the deduplication storage.
    Here Insert Picture Description

settings

  • Since the function of the pipelines has been written, it is necessary to cancel its comments in settings.py, and add a random agency head, with a certain delay, to enhance its ability to pretend browser, of course, do not forget items. py set fields.
    Here Insert Picture Description

Crawling results

  • After the completion of the preparation process, the project started to look at the results, ah, a lot of high-definition picture has been in the bag.
    Here Insert Picture Description
Published 135 original articles · won praise 490 · views 60000 +

Guess you like

Origin blog.csdn.net/lyc44813418/article/details/95493246