2019-12-14 爬网页11-多进程下载漫画网站图片(requests+lxml+fake_useragent+multiprocessing)

想看漫画，但是不知道为什么网页上不能显示图片。
没办法，只好把漫画下载下来慢慢看了。

这个网站结构很简单。总目录–>章节–>页

总目录

https://www.dagumanhua.com/manhua/3883/

章节

每个章节链接就在上面链接中

<div class="cy_plist" id="play_0">
                <ul>
		                 <li><a href="/manhua/3883/623532.html" title="第813话 八品炼药师（上）" target="_blank"><p>第813话 八品炼药师（上）</p><i></i></a></li>
                                   <li><a href="/manhua/3883/623530.html" title="第814话 八品炼药师（下）" target="_blank"><p>第814话 八品炼药师（下）</p><i></i></a></li>
                                   <li><a href="/manhua/3883/622052.html" title="第812话 熊的宝藏（下）" target="_blank"><p>第812话 熊的宝藏（下）</p><i></i></a></li>
                                   <li><a href="/manhua/3883/622051.html" title="第811话 熊的宝藏（上）" target="_blank"><p>第811话 熊的宝藏（上）</p><i></i></a></li>
                                   <li><a href="/manhua/3883/619107.html" title="第810话 山脉之主（下）" target="_blank"><p>第810话 山脉之主（下）</p><i></i></a></li>
                                   <li><a href="/manhua/3883/619105.html" title="第809话 山脉之主（上）" target="_blank"><p>第809话 山脉之主（上）</p><i></i></a></li>
                                   <li><a href="/manhua/3883/617623.html" title="第808话 觅宝（下）" target="_blank"><p>第808话 觅宝（下）</p><i></i></a></li>

这里用lxml解析获得所有链接。

 html.xpath('//*[@id="play_0"]/ul/li/a/@href')

因为获得的是相对路径，所以链接地址需要拼接一下。得到绝对地址，例如：https://www.dagumanhua.com/manhua/3883/623532.html

页

章节中每页的链接就在上面链接中

<div class="pages">&nbsp;<b>1</b>&nbsp;<a href="/manhua/3883/623532_2.html">2</a>&nbsp;<a href="/manhua/3883/623532_3.html">3</a>&nbsp;<a href="/manhua/3883/623532_2.html">下一页</a></div>

这里只有显示了3页，已经足够发现每页地址的规律了。
从第二页开始就是在第一页后面加一个下划线和页码序号。

因为只有显示了3页，所以得想办法获得每章的总页数。在上面的html也可以找到。

function dPlayPre(){ 
var prepage = 1 -1; 
var totalpage = 33;

包含在js的函数中，那个totalpage就是了。
可以用正则过滤获得

re.compile(r'function.*?prepage.*?totalpage = (.*?);', re.S)

剩下的就是循环了，没什么可以说的。

fake_useragent

这个漫画网站很简单。所以只有采用了动态ua。
这次尝试了fake_useragent模块。

def get_headers():
  ua = UserAgent() # 实例化，实例化时需要联网但是网站不太稳定
  return {'User-Agent':ua.random}

使用当中如有问题，可以参考https://blog.csdn.net/qq_38251616/article/details/86751142

多进程下载

我要看的漫画图片比较多，逐次下载要等死人了，还是开多进程下载吧。
多进程，我选用multiprocessing模块
先看看CPU核数

>>> from multiprocessing import cpu_count
>>> print cpu_count()
4

4核，那就开4个进程。
下载图片的函数。下载后依次保存在对应的目录中。

def download_img(img_url,sn,chapter_direction):
  print 'image=',img_url
  response = requests.get(img_url,headers=get_headers(),timeout=5)
  response.encoding = 'UTF-8'

  #下载后文件名改为1.jpg，2.jpg......后面依次编号
  image_file = chapter_direction +'\\' + str(sn) + '.jpg'
  with open(image_file, 'wb') as f:
    f.write(response.content)

下载过程多进程设置如下

try:
p = Pool(4)  # 指定进程池中的进程数
for page_list_url_i in page_list_url: #每一页链接
  
  #每一页完整链接
  page_list_url_full = 'https://www.dagumanhua.com' + str(page_list_url_i)
  print 'page',i+1,'of',len(page_list_url),'=',page_list_url_full
  
  page_text = get_html(page_list_url_full) #每一页的html
  
  img_url1 = image_url(page_text).strip() #每一页图片链接
  
  i = i + 1
  #download_img(img_url1,i,chapter_name) #下载每一页图片

  #下载每一页图片
  #非阻塞异步, 它不会等待子进程执行完毕, 主进程会继续执行。它会根据系统调度来进行进程切换
  p.apply_async(download_img,args=(img_url1,i,chapter_name)) 

  #time.sleep(0.5)
print 'Waiting for all subprocesses done...'
p.close()  # 关闭进程池
p.join()  # 主进程等待进程池中的所有子进程结束
print 'All subprocesses done.'
except KeyboardInterrupt: #接收^c
print 'parent received control-c'
pool.terminate()
pool.join()

多进程执行过程截取部分如下

chapter_list_url_full= https://www.dagumanhua.com/manhua/3883/155457.html
chapter_direction= C:\todd\python_files\website\斗破苍穹\斗破苍穹 第2话 陨落的天才（中）
total pages= 12
page 1 of 12 = https://www.dagumanhua.com/manhua/3883/155457.html
page 2 of 12 = https://www.dagumanhua.com/manhua/3883/155457_2.html
page 3 of 12 = https://www.dagumanhua.com/manhua/3883/155457_3.html
image= http://img.baidu.com.manhuapi.com/c/20170823/rbuys5alm0v.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/dybfvnaha2o.jpg
pagei mage=4  hottp://img.baidu.com.manhuapi.com/c/20170823/mondjh1qz1y.jpgf
 12 = https://www.dagumanhua.com/manhua/3883/155457_4.html
pagei mage=5  hottp://img.baidu.com.manhuapi.com/c/20170823/ry1z3mv0yvr.jpgf
 12 = https://www.dagumanhua.com/manhua/3883/155457_5.html
page 6 of 12 = https://www.dagumanhua.com/manhua/3883/155457_6.html
page 7 of 12 = https://www.dagumanhua.com/manhua/3883/155457_7.html
page 8 of 12 = https://www.dagumanhua.com/manhua/3883/155457_8.html
page 9 of 12 = https://www.dagumanhua.com/manhua/3883/155457_9.html
page 10 of 12 = https://www.dagumanhua.com/manhua/3883/155457_10.html
image= http://img.baidu.com.manhuapi.com/c/20170823/bsgtncyxbyj.jpg
page 11 of 12 = https://www.dagumanhua.com/manhua/3883/155457_11.html
image= http://img.baidu.com.manhuapi.com/c/20170823/zjsxf1byjri.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/bmsnlb21rbe.jpg
page 12 of 12 = https://www.dagumanhua.com/manhua/3883/155457_12.html
Waiting for all subprocesses done...
image= http://img.baidu.com.manhuapi.com/c/20170823/rg4mmujk1ev.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/5nm144gvxad.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/j4mdidgwk1c.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/kbqkitmkqnx.jpg
image= http://img.baidu.com.manhuapi.com/c/20170823/fkhtyne05fb.jpg
All subprocesses done.

其中有几行有点奇怪，例如下面这句。这是因为不同进程同时在print，所以输出成这样了。

pagei mage=4  hottp://img.baidu.com.manhuapi.com/c/20170823/mondjh1qz1y.jpgf
 12 = https://www.dagumanhua.com/manhua/3883/155457_4.html

实际应该是以下2句组合而成的，自己体会一下，很容易理解的。

page 4  of 12 = https://www.dagumanhua.com/manhua/3883/155457_4.html
image= http://img.baidu.com.manhuapi.com/c/20170823/mondjh1qz1y.jpg

如果不采用多进程，输入的结果如下。对比多进程输出，就明白多进程是怎么回事情了。

chapter_list_url_full= https://www.dagumanhua.com/manhua/3883/155456.html
chapter_direction= C:\todd\python_files\website\斗破苍穹\斗破苍穹 第1话 陨落的天才（上）
total pages= 13
page 1 of 13 = https://www.dagumanhua.com/manhua/3883/155456.html
image= http://img.baidu.com.manhuapi.com/c/20170823/umyw23z5sr0.jpg
page 2 of 13 = https://www.dagumanhua.com/manhua/3883/155456_2.html
image= http://img.baidu.com.manhuapi.com/c/20170823/hs2qj4l4p4t.jpg
page 3 of 13 = https://www.dagumanhua.com/manhua/3883/155456_3.html
image= http://img.baidu.com.manhuapi.com/c/20170823/ty4slqkzpez.jpg
page 4 of 13 = https://www.dagumanhua.com/manhua/3883/155456_4.html
image= http://img.baidu.com.manhuapi.com/c/20170823/xq0gckj3yr2.jpg
page 5 of 13 = https://www.dagumanhua.com/manhua/3883/155456_5.html
image= http://img.baidu.com.manhuapi.com/c/20170823/i2a1ux2e5l3.jpg
page 6 of 13 = https://www.dagumanhua.com/manhua/3883/155456_6.html
image= http://img.baidu.com.manhuapi.com/c/20170823/wy1lqagtgoy.jpg
page 7 of 13 = https://www.dagumanhua.com/manhua/3883/155456_7.html
image= http://img.baidu.com.manhuapi.com/c/20170823/fcczrky2fdu.jpg
page 8 of 13 = https://www.dagumanhua.com/manhua/3883/155456_8.html
image= http://img.baidu.com.manhuapi.com/c/20170823/ej4grtbpyzb.jpg
page 9 of 13 = https://www.dagumanhua.com/manhua/3883/155456_9.html
image= http://img.baidu.com.manhuapi.com/c/20170823/gagpydhkstj.jpg
page 10 of 13 = https://www.dagumanhua.com/manhua/3883/155456_10.html
image= http://img.baidu.com.manhuapi.com/c/20170823/xzt3rktlecu.jpg
page 11 of 13 = https://www.dagumanhua.com/manhua/3883/155456_11.html
image= http://img.baidu.com.manhuapi.com/c/20170823/kgwbini0x4k.jpg
page 12 of 13 = https://www.dagumanhua.com/manhua/3883/155456_12.html
image= http://img.baidu.com.manhuapi.com/c/20170823/e5sn5wwvslv.jpg
page 13 of 13 = https://www.dagumanhua.com/manhua/3883/155456_13.html
image= http://img.baidu.com.manhuapi.com/c/20170823/fmgzeu12n1x.jpg

代码参考https://download.csdn.net/download/weixin_42555985/12033409

没人不认识我

发布了122 篇原创文章 · 获赞 7 · 访问量 2万+

私信关注