foreword
What I will introduce to you today is Python crawling comic data, here is the code for those who need it, and some tips.
First of all, before crawling, you should pretend to be a browser as much as possible without being recognized as a crawler. The basic thing is to add a request header, but there will be many people crawling such plain text data, so we need to consider changing the proxy IP and random replacement Request headers to crawl comic data.
Before writing crawler code every time, our first and most important step is to analyze our web pages.
Through analysis, we found that the speed of crawling is relatively slow during the crawling process, so we can also improve the crawling speed of crawlers by disabling Google browser images, JavaScript, etc.
development tools
Python version: 3.6
Related modules:
requests module
re module
time module
bs4 module
tqdm module
contextlib module
Environment build
Install Python and add it to the environment variable, and pip installs the required related modules.
The complete code and files in the article can be obtained from comments and messages
Idea analysis
Open the page we want to crawl in the browser
Press F12 to enter the developer tool to see where the comic data we want is
here we need the page data
add proxy
Manga download code implementation
# 下载漫画
for i, url in enumerate(tqdm(chapter_urls)):
print(i,url)
download_header = {
'Referer':url,
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
}
name = chapter_names[i]
# 去掉.
while '.' in name:
name = name.replace('.', '')
chapter_save_dir = os.path.join(save_dir, name)
if name not in os.listdir(save_dir):
os.mkdir(chapter_save_dir)
r = requests.get(url=url)
html = BeautifulSoup(r.text, 'lxml')
script_info = html.script
pics = re.findall('\d{13,14}', str(script_info))
for j, pic in enumerate(pics):
if len(pic) == 13:
pics[j] = pic + '0'
pics = sorted(pics, key=lambda x: int(x))
chapterpic_hou = re.findall('\|(\d{5})\|', str(script_info))[0]
chapterpic_qian = re.findall('\|(\d{4})\|', str(script_info))[0]
for idx, pic in enumerate(pics):
if pic[-1] == '0':
url = 'https://images.dmzj.com/img/chapterpic/' + chapterpic_qian + '/' + chapterpic_hou + '/' + pic[
:-1] + '.jpg'
else:
url = 'https://images.dmzj.com/img/chapterpic/' + chapterpic_qian + '/' + chapterpic_hou + '/' + pic + '.jpg'
pic_name = '%03d.jpg' % (idx + 1)
pic_save_path = os.path.join(chapter_save_dir, pic_name)
print(url)
response = requests.get(url,headers=download_header)
# with closing(requests.get(url, headers=download_header, stream=True)) as response:
# chunk_size = 1024
# content_size = int(response.headers['content-length'])
print(response)
if response.status_code == 200:
with open(pic_save_path, "wb") as file:
# for data in response.iter_content(chunk_size=chunk_size):
file.write(response.content)
else:
print('链接异常')
time.sleep(2)
data storage
Result display
At last
In order to thank the readers, I would like to share with you some of my recent favorite programming dry goods, to give back to every reader, and hope to help you.
There are practical tutorials suitable for beginners~
Come and grow up with Xiaoyu!
① More than 100 Python PDFs (mainstream and classic books should be available)
② Python standard library (the most complete Chinese version)
③ Source code of reptile projects (forty or fifty interesting and classic hand-practicing projects and source codes)
④ Videos on basics of Python, crawlers, web development, and big data analysis (suitable for beginners)
⑤ Python Learning Roadmap (Farewell to Influential Learning)