Getting Started with Python Reptile [6]: hummingbird network crawling one picture

1. hummingbird net pictures - Introduction

National Day holiday is over, the new work began, and today we continue to crawl a site, this site http://image.fengniao.com/, a hummingbird photography Daniel gathering place, this tutorial please for learning, not for commercial purposes, not surprisingly, hummingbird is copyrighted website.

Getting Started with Python Reptile [6]: hummingbird network crawling one picture

2. Fengniao pictures - web analytics

The first step, to analyze website crawling crawling there a way to open the pages, looking for pagination

http://image.fengniao.com/index.php?action=getList&class_id=192&sub_classid=0&page=1&not_in_id=5352384,5352410
http://image.fengniao.com/index.php?action=getList&class_id=192&sub_classid=0&page=2&not_in_id=5352384,5352410
http://image.fengniao.com/index.php?action=getList&class_id=192&sub_classid=0&page=3&not_in_id=5352384,5352410
http://image.fengniao.com/index.php?action=getList&class_id=192&sub_classid=0&page=4&not_in_id=5352384,5352410

The above page to find a key parameter page=1this is the page number, but another troublesome issue, he did not last page number, so we have no way to determine the number of cycles, so write code behind, you can only use whilethe

The address data is returned in JSON format, this is for reptiles, very friendly! Province We analyzed using regular expressions.

Getting Started with Python Reptile [6]: hummingbird network crawling one picture

Analysis of the first page of this document, check to see if there are anti-climb measures

Getting Started with Python Reptile [6]: hummingbird network crawling one picture

It found that in addition HOST and User-Agent, no special point, large sites is willful, lacks anti-climb, are they likely do not care about this thing.

The second step, analysis Image detail page, we get to the top of JSON, find the key address

Getting Started with Python Reptile [6]: hummingbird network crawling one picture

After the key address to open, this place has a relatively Sao operation, and the picture above URL election marked bad, happens to be an article, and what we want is Photos, again providing a new link http://image.fengniao.com/slide/535/5352130_1.html#p=1

Open the page, you may go directly to the law, find the following a bunch of links, but this operation is a bit complicated, we have access to the source code of these pages

http://image.fengniao.com/slide/535/5352130_1.html#p=1
http://image.fengniao.com/slide/535/5352130_1.html#p=2
http://image.fengniao.com/slide/535/5352130_1.html#p=3
....
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Web page source code found, so an area

Getting Started with Python Reptile [6]: hummingbird network crawling one picture

Bold guess, this should be a picture of JSON, but he printed in the HTML, we only need to use regular expressions look like a match, after the match, and then download it.

The third step, start line and code.

Getting Started with Python Reptile [6]: hummingbird network crawling one picture

3. hummingbird net pictures - writing code

from http_help import R  # 这个文件自己去上篇博客找,或者去github找
import threading
import time
import json
import re

img_list = []
imgs_lock = threading.Lock()  #图片操作锁

# 生产者类
class Product(threading.Thread):

    def __init__(self):
        threading.Thread.__init__(self)

        self.__headers = {"Referer":"http://image.fengniao.com/",
                          "Host": "image.fengniao.com",
                          "X-Requested-With":"XMLHttpRequest"
                          }
        #链接模板
        self.__start = "http://image.fengniao.com/index.php?action=getList&class_id=192&sub_classid=0&page={}&not_in_id={}"
        self.__res = R(headers=self.__headers)

    def run(self):

        # 因为不知道循环次数,所有采用while循环
        index = 2 #起始页码设置为1
        not_in = "5352384,5352410"
        while True:
            url  = self.__start.format(index,not_in)
            print("开始操作:{}".format(url))
            index += 1

            content = self.__res.get_content(url,charset="gbk")

            if content is None:
                print("数据可能已经没有了====")
                continue

            time.sleep(3)  # 睡眠3秒
            json_content = json.loads(content)

            if json_content["status"] == 1:
                for item in json_content["data"]:
                    title = item["title"]
                    child_url =  item["url"]   # 获取到链接之后

                    img_content = self.__res.get_content(child_url,charset="gbk")

                    pattern = re.compile('"pic_url_1920_b":"(.*?)"')
                    imgs_json = pattern.findall(img_content)
                    if len(imgs_json) > 0:

                        if imgs_lock.acquire():
                            img_list.append({"title":title,"urls":imgs_json})   # 这个地方,我用的是字典+列表的方式,主要是想后面生成文件夹用,你可以进行改造
                            imgs_lock.release()

The above link has been generated, the following picture is downloaded, it is very simple

# 消费者
class Consumer(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)
        self.__res = R()

    def run(self):

        while True:
            if len(img_list) <= 0:
                continue  # 进入下一次循环

            if imgs_lock.acquire():

                data = img_list[0]
                del img_list[0]  # 删除第一项

                imgs_lock.release()

            urls =[url.replace("\\","") for url in data["urls"]]

            # 创建文件目录
            for item_url in urls:
               try:
                   file =  self.__res.get_file(item_url)
                   # 记得在项目根目录先把fengniaos文件夹创建完毕
                   with open("./fengniaos/{}".format(str(time.time())+".jpg"), "wb+") as f:
                       f.write(file)
               except Exception as e:
                   print(e)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Code walk from the results

Getting Started with Python Reptile [6]: hummingbird network crawling one picture

Guess you like

Origin blog.51cto.com/14445003/2423289