Python crawler actual combat, requests+tqdm module, crawling comic data (with source code)

foreword

What I will introduce to you today is Python crawling comic data, here is the code for those who need it, and some tips.

First of all, before crawling, you should pretend to be a browser as much as possible without being recognized as a crawler. The basic thing is to add a request header, but there will be many people crawling such plain text data, so we need to consider changing the proxy IP and random replacement Request headers to crawl comic data.

Before writing crawler code every time, our first and most important step is to analyze our web pages.

Through analysis, we found that the speed of crawling is relatively slow during the crawling process, so we can also improve the crawling speed of crawlers by disabling Google browser images, JavaScript, etc.

development tools

Python version: 3.6

Related modules:

requests module

re module

time module

bs4 module

tqdm module

contextlib module

Environment build

Install Python and add it to the environment variable, and pip installs the required related modules.

The complete code and files in the article can be obtained from comments and messages

Idea analysis

Open the page we want to crawl in the browser
Press F12 to enter the developer tool to see where the comic data we want is
here we need the page data

source code structure

add proxy

insert image description here

Manga download code implementation

# 下载漫画
for i, url in enumerate(tqdm(chapter_urls)):
    print(i,url)
    download_header = {
    
    
        'Referer':url,
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    name = chapter_names[i]
    # 去掉.
    while '.' in name:
        name = name.replace('.', '')
    chapter_save_dir = os.path.join(save_dir, name)
    if name not in os.listdir(save_dir):
        os.mkdir(chapter_save_dir)
    r = requests.get(url=url)
    html = BeautifulSoup(r.text, 'lxml')
    script_info = html.script
    pics = re.findall('\d{13,14}', str(script_info))
    for j, pic in enumerate(pics):
        if len(pic) == 13:
            pics[j] = pic + '0'
    pics = sorted(pics, key=lambda x: int(x))
    chapterpic_hou = re.findall('\|(\d{5})\|', str(script_info))[0]
    chapterpic_qian = re.findall('\|(\d{4})\|', str(script_info))[0]
    for idx, pic in enumerate(pics):
        if pic[-1] == '0':
            url = 'https://images.dmzj.com/img/chapterpic/' + chapterpic_qian + '/' + chapterpic_hou + '/' + pic[
                                                                                                             :-1] + '.jpg'
        else:
            url = 'https://images.dmzj.com/img/chapterpic/' + chapterpic_qian + '/' + chapterpic_hou + '/' + pic + '.jpg'
        pic_name = '%03d.jpg' % (idx + 1)
        pic_save_path = os.path.join(chapter_save_dir, pic_name)
        print(url)
        response = requests.get(url,headers=download_header)
        # with closing(requests.get(url, headers=download_header, stream=True)) as response:
            # chunk_size = 1024
            # content_size = int(response.headers['content-length'])
        print(response)
        if response.status_code == 200:
            with open(pic_save_path, "wb") as file:
                # for data in response.iter_content(chunk_size=chunk_size):
                    file.write(response.content)
        else:
            print('链接异常')
    time.sleep(2)

data storage

data storage

Result display

data storage

Result display

At last

In order to thank the readers, I would like to share with you some of my recent favorite programming dry goods, to give back to every reader, and hope to help you.

There are practical tutorials suitable for beginners~

Come and grow up with Xiaoyu!

① More than 100 Python PDFs (mainstream and classic books should be available)

② Python standard library (the most complete Chinese version)

③ Source code of reptile projects (forty or fifty interesting and classic hand-practicing projects and source codes)

④ Videos on basics of Python, crawlers, web development, and big data analysis (suitable for beginners)

⑤ Python Learning Roadmap (Farewell to Influential Learning)

Guess you like

Origin blog.csdn.net/Modeler_xiaoyu/article/details/128286397