python3爬虫JD图片

前言

python3爬虫京东图片,并保存图片文件至本地。


一、HTML正则表达式的匹配?

url="https://search.jd.com/Search?keyword="+key+"&wq="+key+"&page="+str(i*2-1)
'data-lazy-img="(.*?)"'

二、代码

1.引入库

import urllib.request
import re
import requests 

2.添加报头

headers = ("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0")
opener  =urllib.request.build_opener()
opener.addheaders = [headers]
urllib.request.install_opener(opener)

3.设置商品

keyname = "洋河"#输入商品名称
key = urllib.request.quote(keyname)

4.获取图片链接与保存图片至本地

for i in range(1,2):
    url = "https://search.jd.com/Search?keyword="+key+"&wq="+key+"&page="+str(i*2-1);
    data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
    print(data)
    pat = 'data-lazy-img="(.*?)"'
    imagelist = re.compile(pat).findall(data)
    for j in range(1,len(imagelist)):
        b1 = imagelist[j].replace('/n7', '/n0')
        print("第"+str(i)+"页第"+str(j)+"张爬取成功")
        newurl = "http:"+b1
        print(newurl)
        r = requests.get(newurl,stream=True)
        with open('C:/Users/lishu/Desktop/tensorflow/pc/yh/'+"第"+str(i)+"页第"+str(j)+"张"+".jpg", 'wb') as f:
            for html in r.iter_content():
                f.write(html)

5.全部代码

import urllib.request
import re
import requests
keyname = "洋河"#输入商品名称
key = urllib.request.quote(keyname)
headers = ("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0")
opener  =urllib.request.build_opener()
opener.addheaders = [headers]
urllib.request.install_opener(opener)
for i in range(1,2):#爬取页数
    url = "https://search.jd.com/Search?keyword="+key+"&wq="+key+"&page="+str(i*2-1);
    data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
    pat = 'data-lazy-img="(.*?)"'
    imagelist = re.compile(pat).findall(data)
    for j in range(1,len(imagelist)):
        b1 = imagelist[j].replace('/n7', '/n0')
        print("第"+str(i)+"页第"+str(j)+"张爬取成功")
        newurl = "http:"+b1
        print(newurl)
        r = requests.get(newurl,stream=True)
        with open('C:/Users/lishu/Desktop/tensorflow/pc/yh/'+"第"+str(i)+"页第"+str(j)+"张"+".jpg", 'wb') as f:
            for html in r.iter_content():
                f.write(html)

 


总结

主要针对urllib.request.urlretrieve()文件路径不能保存中文目录的情况,使用requests.get()保存图片到本地。

猜你喜欢

转载自blog.csdn.net/weixin_42748604/article/details/109098616
今日推荐