1. hummingbird net pictures - Introduction
National Day holiday is over, the new work began, and today we continue to crawl a site, this site http://image.fengniao.com/
, a hummingbird photography Daniel gathering place, this tutorial please for learning, not for commercial purposes, not surprisingly, hummingbird is copyrighted website.
2. Fengniao pictures - web analytics
The first step, to analyze website crawling crawling there a way to open the pages, looking for pagination
http://image.fengniao.com/index.php?action=getList&class_id=192&sub_classid=0&page=1¬_in_id=5352384,5352410
http://image.fengniao.com/index.php?action=getList&class_id=192&sub_classid=0&page=2¬_in_id=5352384,5352410
http://image.fengniao.com/index.php?action=getList&class_id=192&sub_classid=0&page=3¬_in_id=5352384,5352410
http://image.fengniao.com/index.php?action=getList&class_id=192&sub_classid=0&page=4¬_in_id=5352384,5352410
The above page to find a key parameter page=1
this is the page number, but another troublesome issue, he did not last page number, so we have no way to determine the number of cycles, so write code behind, you can only use while
the
The address data is returned in JSON format, this is for reptiles, very friendly! Province We analyzed using regular expressions.
Analysis of the first page of this document, check to see if there are anti-climb measures
It found that in addition HOST and User-Agent, no special point, large sites is willful, lacks anti-climb, are they likely do not care about this thing.
The second step, analysis Image detail page, we get to the top of JSON, find the key address
After the key address to open, this place has a relatively Sao operation, and the picture above URL election marked bad, happens to be an article, and what we want is Photos, again providing a new link http://image.fengniao.com/slide/535/5352130_1.html#p=1
Open the page, you may go directly to the law, find the following a bunch of links, but this operation is a bit complicated, we have access to the source code of these pages
http://image.fengniao.com/slide/535/5352130_1.html#p=1
http://image.fengniao.com/slide/535/5352130_1.html#p=2
http://image.fengniao.com/slide/535/5352130_1.html#p=3
....
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎
Web page source code found, so an area
Bold guess, this should be a picture of JSON, but he printed in the HTML, we only need to use regular expressions look like a match, after the match, and then download it.
The third step, start line and code.
3. hummingbird net pictures - writing code
from http_help import R # 这个文件自己去上篇博客找,或者去github找
import threading
import time
import json
import re
img_list = []
imgs_lock = threading.Lock() #图片操作锁
# 生产者类
class Product(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.__headers = {"Referer":"http://image.fengniao.com/",
"Host": "image.fengniao.com",
"X-Requested-With":"XMLHttpRequest"
}
#链接模板
self.__start = "http://image.fengniao.com/index.php?action=getList&class_id=192&sub_classid=0&page={}¬_in_id={}"
self.__res = R(headers=self.__headers)
def run(self):
# 因为不知道循环次数,所有采用while循环
index = 2 #起始页码设置为1
not_in = "5352384,5352410"
while True:
url = self.__start.format(index,not_in)
print("开始操作:{}".format(url))
index += 1
content = self.__res.get_content(url,charset="gbk")
if content is None:
print("数据可能已经没有了====")
continue
time.sleep(3) # 睡眠3秒
json_content = json.loads(content)
if json_content["status"] == 1:
for item in json_content["data"]:
title = item["title"]
child_url = item["url"] # 获取到链接之后
img_content = self.__res.get_content(child_url,charset="gbk")
pattern = re.compile('"pic_url_1920_b":"(.*?)"')
imgs_json = pattern.findall(img_content)
if len(imgs_json) > 0:
if imgs_lock.acquire():
img_list.append({"title":title,"urls":imgs_json}) # 这个地方,我用的是字典+列表的方式,主要是想后面生成文件夹用,你可以进行改造
imgs_lock.release()
The above link has been generated, the following picture is downloaded, it is very simple
# 消费者
class Consumer(threading.Thread):
def __init__(self):
threading.Thread.__init__(self)
self.__res = R()
def run(self):
while True:
if len(img_list) <= 0:
continue # 进入下一次循环
if imgs_lock.acquire():
data = img_list[0]
del img_list[0] # 删除第一项
imgs_lock.release()
urls =[url.replace("\\","") for url in data["urls"]]
# 创建文件目录
for item_url in urls:
try:
file = self.__res.get_file(item_url)
# 记得在项目根目录先把fengniaos文件夹创建完毕
with open("./fengniaos/{}".format(str(time.time())+".jpg"), "wb+") as f:
f.write(file)
except Exception as e:
print(e)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎
Code walk from the results