Python crawler: In order to follow a comic, I actually used crawler, others say I don't talk about martial arts

1 Introduction

Hello everyone, I'm spicy strips, I can only watch spicy strips that can't be eaten~

I've been chasing xx anime recently, but I haven't read a few chapters, you know, a popup popped up that made me a distinguished customer of krypton gold, otherwise I wouldn't be able to continue reading, so I started to figure out this thing, the point Come here, just take a look, my purpose is to let you learn the technology, not to take advantage of the loopholes, after all, this is a real punishment, everyone who understands it~

To reiterate: the current article is limited to technical sharing, and it is strictly prohibited to use it for other purposes! ! ! !

2. Collection target

XX Manke

insert image description here

3. Tool preparation

Development tools: pycharm
Development environment: python3.7, Windows10
using toolkit: requests, csv

4. Effect display

db335cc011934114a41dc83f214c1f76.gif

5. Analysis of project ideas

There are four basic steps to a crawler:

  • Get the target resource address

    • Select the comics you want to read, here Latiao chooses Doubaqiang, I have been chasing this anime recently, the special effects are really in place, but the update speed is too slow, go and refresh the anime (big wife, second wife town building)
      image.png
      insert image description here

      The current page is the chapter page of the entire comic, and all the chapter data information needs to be extracted. The current page data is the page data, but our data is the picture information that needs to enter the details page, but the data of this website needs to be a diamond. We First go to parse the comics to try, enter the details page and grab the loaded data information file
      image.png** This data is the address of the comic image we want, now analyze the loading method of the URL: **https://www.kanman.com /api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id=25934&chapter_newid=dpcq_1h&isWebp=1&quality=middle
      It can be seen that the page conversion of the data is performed by chapter_newid. If you need to modify the second page directly, then you need a star diamond The chapters are also easily solved, and the avoidance of sensitive information is not emphasized here.

  • Send network requests to get data

    • Sending requests, the crawler has a core skill. It is more of imitating the client to send network requests. The Python request library uses requests more. It is more appropriate to use the framework scrapy, and bring the request header when sending the request. , disguise the crawler as a browser
  • Extract data

    • At present, the data we obtain is json data, and we can directly convert the data into the data type of the dictionary and extract the data through key-value pairs.
  • save data

    • We save the data directly to the current folder, and save a separate chapter as a separate file (or you can consider it yourself)

The current article is limited to technical sharing, and is strictly prohibited for other purposes! ! ! !

6. Easy source code sharing

import requests
import json
import time
import os


headers = {
    
    
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}

def parse(url):
    print(url)
    res = requests.get(url, headers=headers).json()
    print(res)
    chapter_name = res['data']['current_chapter']['chapter_name']
    chapter_img_url = res['data']['current_chapter']['chapter_img_list']
    print(chapter_name, chapter_img_url)
    path = '斗破苍穹/' + chapter_name
    if not os.path.exists(path):
        os.mkdir(path)
    download(chapter_img_url, path)  # 每次下载漫画时,需要传入相关的参数


def download(img_url_list, path):
    i = 0
    for img_url in img_url_list:

        res = requests.get(img_url, headers=headers)
        print(res.content)
        filename = path + "/" + str(i) + '.jpg'
        with open(filename, 'wb') as f:
            f.write(res.content)
        i += 1
        print(f'正在下载:{
      
      path}{
      
      str(i)}张')
        time.sleep(0.5)


if __name__ == '__main__':
    for i in range(1, 20):
        url = 'https://www.kanman.com/api/getchapterinfov2?product_id=1&productname=kmh&platformname=pc&comic_id=25934&chapter_newid=dpcq_{}h&isWebp=1&quality=middle'.format(i)
        parse(url)

7. Summary

Once again, the focus is on the knowledge points, don’t try it randomly, I don’t want to watch you go from entry to “jail”.
Second , you can see my personal card below for the detailed source code, as well as the source code packages of my previous 30 games, and also There are python learning routes and so on.

Guess you like

Origin blog.csdn.net/AI19970205/article/details/124026252