Python reptile of Tencent Comics - Jiang Dong's blog

Brief

This is a real reptile, crawling is Tencent comics .

Content crawling comic Tencent have all the information, the following three main

  • Comic basic information, including author, popularity, cover address, the number of collections, whether serial, comic books and other brief
  • Comic Section Name
  • All pictures manga chapters

My version is Python Python 3.7.3, Python library is primarily used for requestsand BeautifulSoup.

Comic Information

Open Tencent comics directory, found that 12 per page cartoon

What is needed now is to collect every comic name and url. You can look at the code, found class="ret-search-result"below 12 <li>tag, just recorded information 12 comic. class="ret-works-info"Under <a>the label titleand hrefis the comic name and url we need.

Python code implementation

import requests
from bs4 import BeautifulSoup

page_url = "https://ac.qq.com/Comic/all/search/hot/vip/1/page/1"
headers = {
    'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36'
                   ' (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36')
}
r = requests.get(page_url, headers=headers)
html = BeautifulSoup(r.text, 'lxml')
ret_search_result = html.find(attrs={'class': 'ret-search-result'})
ret_works_infos = ret_search_result.find_all(attrs={'class': 'ret-works-info'})
page_mhs = [[info.a.get('title'),
             'https://ac.qq.com' + info.a.get('href')] for info in ret_works_infos]
page_mhs

Output below

Selected 尸兄(我叫白小飞)as the cases we crawled comic information. The information collected from the top to know its url is the https://ac.qq.com/Comic/comicInfo/id/17114. We enter this page, you can find comic author , popularity , cover address , collection number , whether serial , comic briefly .

From the website you can easily see what we want to acquire.

mh_url = "https://ac.qq.com/Comic/comicInfo/id/17114"
r = requests.get(mh_url, headers=headers)
html = BeautifulSoup(r.text, 'lxml')
works_intro = html.find(attrs={'class': 'works-intro'})

tg_spic = works_intro.img.get('src')
typelianzai = works_intro.find(attrs={'works-intro-status'}).text  # 连载中

digi = works_intro.find(attrs={'class': 'works-intro-digi'}).find_all('span')
author = ' '.join(digi[0].em.text.split())  # 七度魚 图:七度魚 文:七度魚
hots = digi[1].em.text  # 227.7亿
collect = digi[2].em.text  # 2922012

short = works_intro.find(attrs={'class': 'works-intro-short'})
intro = ''.join(short.text.split())  # 即将毁灭的世界......每周更新...

Comic section

Tencent comics section has two parts. All part section, as in FIG.

The other part is最新20话

We just want to get all the chapters, so just crawling first part lass="chapter-page-all works-chapter-list".

mh_url = "https://ac.qq.com/Comic/comicInfo/id/17114"
r = requests.get(mh_url, headers=headers)
html = BeautifulSoup(r.text, 'lxml')
chapter_all = html.find(attrs={'class': 'chapter-page-all'})
chapters = [[a['href'], ''.join(a['title'].split(':')[1:])]
            for a in chapter_all.find_all('a')]
chapters

Output

[['/ComicView/index/id/17114/cid/3', '第1集'],
 ['/ComicView/index/id/17114/cid/4', '第2集'],
 ['/ComicView/index/id/17114/cid/2', '第3集'],
 ['/ComicView/index/id/17114/cid/1', '特别篇(4)'],
 ['/ComicView/index/id/17114/cid/5', '第5集'],
 ['/ComicView/index/id/17114/cid/6', '第6集'],
 ['/ComicView/index/id/17114/cid/7', '第7集'],
 ['/ComicView/index/id/17114/cid/8', '第8集'],
 ['/ComicView/index/id/17114/cid/9', '第9集'],
 ......

Chapter picture

Chapters acquisition is complete, it is to collect more detailed chapter in the picture. This is the most important step comic collection.

Tencent is also collecting comic in the most difficult step.

难在什么地方呢?它的图片不是一开始就全部渲染好的,只开头几张有渲染,其它都是需要偏移到图片位置才开始渲染。

看代码发现,没有偏移到的图片地址都是src="//ac.gtimg.com/media/images/pixel.gif"

怎么才能解决这个问题呢?有2个方向,

  • 移动到图片附近才渲染,那就让程序去移动。可以考虑使用selenium来进行滚动条的运动
  • 直接从源头出发,服务器一定有传给前端数据,那只要能够截取到这些数据,不就成功了

考虑效率和稳定性,我决定先尝试第二个方法。

这些图片地址是在什么时候传到前端?这个问题是我们需要首先解决的,这样才可以 让我们捕捉。

剧透下,其实数据是写在源代码里的。惊不惊喜,意不意外。

看每一章的源代码,都是有这个DATA,其实这里面就有这网页的全部信息在里面,都是通过这个解密出来的。

没错,这个就是加密的,并且是通过base64加密。可以解密,但是这里面其实是有部分代码是多余的,需要删除。不过这个可以偷偷懒,让程序帮你去尝试。

  1. 获取章节代码

     chapter_url = "https://ac.qq.com/ComicView/index/id/17114/cid/3"
     r = requests.get(chapter_url, headers).text
    
  2. 正则匹配出DATA,使用base64解码

     import re
     import base64
    
     for i, _ in enumerate(data):
         try:
             json = (base64.b64decode(data[i:].encode('utf-8'))
                           .decode('utf-8'))
             break
         except:
             pass
    
  3. 在json里,发现有个picture,里面记录了全部图片信息

  4. 正则匹配出全部图片地址

     pics = (re.compile(r'https:\/\/manhua.qpic.cn\/manhua_detail.*?.jpg\/0')
               .findall(json))
     picurls = [picurl.replace('\','') for picurl in pics]
     picurls
    

    输出

     ['https://manhua.qpic.cn/manhua_detail/0/11_12_51_0e08cec74efaa44df767571795df3975.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_373bb4e249c59f413eaec750ab30ca7c_5687.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_8a6a4f42430f4f1476c626ad147b2bca_1303.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_9a139a883466412f4df13a60fca62412_1304.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_8802859e635d3e31d938a9297e06f9fd_1305.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_7d88c57cb90a0a649bb485be06ff4357_1306.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_43f221bc4de497f43c3ad52531ba77de_1307.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_772ead08c4997a829426ae2216c41bf3_1308.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_a03af075ea974eb906cf6099bdab98ff_1309.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_12906a7c160061bfec4939a7866cd00c_1310.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_2e60ff16f3f05c5f66c7cff0c6d925fe_1311.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_c6d83dc105809c4a623f7e2e9b0cc6ab_1312.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_510c76cf15c4a1ed2fba170d9d3e81ee_2805.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_92f602f9b45fee7847a00f307be7d973_2806.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_542940e57be17f7870651533ec483a75_2807.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_540c96b6b5887f6fef35a978ac9d0c0e_2804.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/11_12_51_e6cf7f08d4b01a855724fc7e2a621131.jpg/0',
      'https://manhua.qpic.cn/manhua_detail/0/15_15_58_3024bdb1e364926df3b631fdab223d95_7806.jpg/0']
    

通过以上这4步,已经把这章的图片都爬取下来。

现在,漫画信息、漫画章节和章节图片都可以随意获得,剩下就是如何大规模批量采集,这里暂且不谈。


原文:大专栏  Python爬虫之腾讯漫画 - 江冬的博客


Guess you like

Origin www.cnblogs.com/petewell/p/11607229.html