Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

 

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

Preamble content

 

Python crawler beginners introductory teaching (1): crawling Douban movie ranking information

Python crawler novice introductory teaching (2): crawling novels

Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data

Python crawler novice introductory teaching (4): crawling 51job.com recruitment information

Python crawler beginners' introductory teaching (5): Crawling the video barrage of station B

Python crawler novice introductory teaching (6): making word cloud diagrams

Python crawler beginners introductory teaching (7): crawling Tencent video barrage

 

Python crawler novice introductory teaching (8): crawl forum articles and save them as PDF

Basic development environment

  • Python 3.6
  • Pycharm
  • wkhtmltopdf

Use of related modules

  • re
  • requests
  • concurrent.futures

Install Python and add it to the environment variables, pip installs the required related modules.

One, clear needs

Who doesn't send a few emoticons to chat now? When chatting, emoticons are an important tool for us, and it is also a good helper to draw the distance between our friends. When chatting is in an awkward situation, just take an emoticon and make the embarrassment invisible.

In this article, I will use python to crawl the emoticon pictures in batches and keep them for future use.

Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation

 

2. Web page data analysis

Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation

 


As shown in the figure, all the image data on the Doutu website are contained in the a tag. You can try to request this web page directly to check whether the data returned by the response also contains the image address.

import requests


def get_response(html_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def main(html_url):
    response = get_response(html_url)
    print(response.text)


if __name__ == '__main__':
    url = 'https://www.doutula.com/photo/list/'
    main(url)

Ctrl + F  to search in the output result  .

Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation

 


Here is one point I want to pay attention to. The result returned to us when I use python to request the web page contains the picture url address:
data-original="picture url"
data-backup="picture url"

If you want to extract the URL address, you can use the parsel parsing library, or re regular expression. I used parcel before, so let's use regular expressions in this article.

urls = re.findall('data-original="(.*?)"', response.text)

Single page crawling complete code

import requests
import re


def get_response(html_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def save(image_url, image_name):
    image_content = get_response(image_url).content
    filename = 'images\\' + image_name
    with open(filename, mode='wb') as f:
        f.write(image_content)
        print(image_name)


def main(html_url):
    response = get_response(html_url)
    urls = re.findall('data-original="(.*?)"', response.text)
    for link in urls:
        image_name = link.split('/')[-1]
        save(link, image_name)


if __name__ == '__main__':
    url = 'https://www.doutula.com/photo/list/'
    main(url)

Multi-threaded crawling of all site pictures (if your memory is large enough)

Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation

 


The data on page 3631 has all expressions, hehehe

import requests
import re
import concurrent.futures


def get_response(html_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def save(image_url, image_name):
    image_content = get_response(image_url).content
    filename = 'images\\' + image_name
    with open(filename, mode='wb') as f:
        f.write(image_content)
        print(image_name)


def main(html_url):
    response = get_response(html_url)
    urls = re.findall('data-original="(.*?)"', response.text)
    for link in urls:
        image_name = link.split('/')[-1]
        save(link, image_name)


if __name__ == '__main__':
    # ThreadPoolExecutor 线程池的对象
    # max_workers  最大任务数
    executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
    for page in range(1, 3632):
        url = f'https://www.doutula.com/photo/list/?page={page}'
        # submit  往线程池里面添加任务
        executor.submit(main, url)
    executor.shutdown()

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/113356881