python reptile: crawling a website high-resolution images

Figure 1. Net crawling one thousand high-definition picture
import urllib.request
import re
import urllib.error

for i in range(1,10):
   # 千图网第一页
    pageurl='https://www.58pic.com/piccate/3-156-909-se1-p'+str(i)+'.html'
    data=urllib.request.urlopen(pageurl).read().decode("utf-8","ignore")
    
    #正则提取
    pat='(//www.58pic.com/newpic/.*?.html)'
    imglist=re.compile(pat).findall(data)
print(imglist)
['//www.58pic.com/newpic/34666756.html', '//www.58pic.com/newpic/34664475.html', '//www.58pic.com/newpic/34664471.html', '//www.58pic.com/newpic/34664397.html', '//www.58pic.com/newpic/34664383.html', '//www.58pic.com/newpic/34663375.html', '//www.58pic.com/newpic/34663183.html', '//www.58pic.com/newpic/34662278.html', '//www.58pic.com/newpic/34480033.html', '//www.58pic.com/newpic/34479938.html', '//www.58pic.com/newpic/34479937.html', '//www.58pic.com/newpic/34479855.html', '//www.58pic.com/newpic/34479854.html', '//www.58pic.com/newpic/34479549.html', '//www.58pic.com/newpic/34479548.html', '//www.58pic.com/newpic/34479381.html', '//www.58pic.com/newpic/34479010.html', '//www.58pic.com/newpic/34478964.html', '//www.58pic.com/newpic/34478963.html', '//www.58pic.com/newpic/34432574.html', '//www.58pic.com/newpic/34432554.html', '//www.58pic.com/newpic/34432517.html', '//www.58pic.com/newpic/34426270.html', '//www.58pic.com/newpic/34426034.html', '//www.58pic.com/newpic/34425959.html', '//www.58pic.com/newpic/34425710.html', '//www.58pic.com/newpic/34425658.html', '//www.58pic.com/newpic/34425570.html', '//www.58pic.com/newpic/34425469.html', '//www.58pic.com/newpic/34425122.html', '//www.58pic.com/newpic/34424954.html', '//www.58pic.com/newpic/34424934.html', '//www.58pic.com/newpic/34424029.html', '//www.58pic.com/newpic/34424028.html', '//www.58pic.com/newpic/34423912.html']
'''
    for j in range(0,len(imglist)):
        try:
            thisimg=imglist[j]+"/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0"
            #被网站强行裁剪的一小部分
            #thisimg=imglist[j]+"/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024"
            file="F:/jupyterpycodes/python_pachongfenxi/result/"+str(i)+str(j)+".jpg"
            urllib.request.urlretrieve(thisimg,filename=file)
            print("第"+str(i)+"页第"+str(j)+"个图片爬取成功")
        except urllib.error.URLError as e:
            if hasattr(e,"code"):
                print(e.code)
            if hasattr(e,"reason"):
                print(e.reason)
        except Exception as e:
            print(e)
'''



```python

2. packet capture analysis:

Network transmission is about to send and receive data packets to crawl operations do crawlers, data is not a - set in the HTML source code, it may be hidden in some of the URL, so we have to grab some data, need to capture, analyze the corresponding data in the hidden URL and analyze patterns and crawling.

3. Use the packet capture analysis conducted Fiddler

(No crawling source data) can only fetch data Fiddler default HTTP and HTTPS data grasp. To grasp HTTPS data, need to be set accordingly.
Reference Site https://ask.hellobi.com/blog/weiwei/5159

Published 47 original articles · won praise 35 · views 1815

Guess you like

Origin blog.csdn.net/weixin_43412569/article/details/104855097