Getting Started with Python Reptile [4]: US Space Network is not logged pictures crawling

US Space Network is not logged pictures ---- Introduction

Written on a little longer, and then continue to the finish reptiles US Space Network, this tutorial written reptiles may not give you much valuable technical point increase in the actual work, because it's just a Getting a set of tutorials, you automatically bypass the old bird on it, or band I know I can.

US Space Network is not logged pictures ---- reptiles analysis

First of all, we have to crawl to the N number of users personal home page, I get a link to the stitching

http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/list.html

Getting Started with Python Reptile [4]: US Space Network is not logged pictures crawling

In this page, we are looking for several key core points, we found 平面拍摄click into a picture list page.
Next to go from the beginning of the code.

Get a list of all pages

I have to get by on a blog post 70,000 (the actual test 50000+) user data to be read in python.

This place, I used a more useful python library pandas, if you are not familiar with, first copy my code on it, I have to write a comment complete.

import pandas as pd

# 用户图片列表页模板
user_list_url = "http://www.moko.cc/post/{}/list.html"
# 存放所有用户的列表页
user_profiles = []

def read_data():
    # pandas从csv里面读取数据
    df = pd.read_csv("./moko70000.csv")   #文件在本文末尾可以下载
    # 去掉昵称重复的数据
    df = df.drop_duplicates(["nikename"])
    # 按照粉丝数目进行降序
    profiles = df.sort_values("follows", ascending=False)["profile"]

    for i in profiles:
        # 拼接链接
        user_profiles.append(user_list_url.format(i))

if __name__ == '__main__':
    read_data()
    print(user_profiles)
Python资源分享qun 784758214 ,内有安装包，PDF，学习视频，这里是Python学习者的聚集地，零基础，进阶，都欢迎

Data already got, then we need to get the picture list page, find some law, see the information shown in the following priorities, find the right position, what is the regular expression.

Getting Started with Python Reptile [4]: US Space Network is not logged pictures crawling

Fast write a regular expression
<p class="title"><a hidefocus="ture".*?href="(.*?)" class="mwC u">.*?\((\d+?)\)</a></p>

Introducing re, requests module

import requests
import re

# 获取图片列表页面
def get_img_list_page():
    # 固定一个地址，方便测试
    test_url = "http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/list.html"
    response = requests.get(test_url,headers=headers,timeout=3)
    page_text = response.text
    pattern = re.compile('<p class="title"><a hidefocus="ture".*?href="(.*?)" class="mwC u">.*?\((\d+?)\)</a></p>')
    # 获取page_list
    page_list = pattern.findall(page_text)

Run get results

[('/post/da39db43246047c79dcaef44c201492d/category/304475/1.html', '85'), ('/post/da39db43246047c79dcaef44c201492d/category/304476/1.html', '2'), ('/post/da39db43246047c79dcaef44c201492d/category/304473/1.html', '0')]

Continue to improve the code, we found that the data obtained above, there have a "0", the need to filter out

# 获取图片列表页面
def get_img_list_page():
    # 固定一个地址，方便测试
    test_url = "http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/list.html"
    response = requests.get(test_url,headers=headers,timeout=3)
    page_text = response.text
    pattern = re.compile('<p class="title"><a hidefocus="ture".*?href="(.*?)" class="mwC u">.*?\((\d+?)\)</a></p>')
    # 获取page_list
    page_list = pattern.findall(page_text)
    # 过滤数据
    for page in page_list:
        if page[1] == '0':
            page_list.remove(page)
    print(page_list)

Entrance to obtain a list of pages, the following should all get a list of all the pages, this place needs to look at the links below

http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/1.html

This page has tabs, 4, Per data 4*7=28item
So, the basic formula is math.ceil(85/28)
the next link is generated, we want the link above, converted into

http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/1.html
http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/2.html
http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/3.html
http://www.moko.cc/post/da39db43246047c79dcaef44c201492d/category/304475/4.html

    page_count =  math.ceil(int(totle)/28)+1
    for i in range(1,page_count):
        # 正则表达式进行替换
        pages = re.sub(r'\d+?\.html',str(i)+".html",start_page)
        all_pages.append(base_url.format(pages))

When we go back to a sufficient number of links, for starters, you can first step in doing so, these links stored in a csv file, to facilitate subsequent development

# 获取所有的页面
def get_all_list_page(start_page,totle):

    page_count =  math.ceil(int(totle)/28)+1
    for i in range(1,page_count):
        pages = re.sub(r'\d+?\.html',str(i)+".html",start_page)
        all_pages.append(base_url.format(pages))

    print("已经获取到{}条数据".format(len(all_pages)))
    if(len(all_pages)>1000):
        pd.DataFrame(all_pages).to_csv("./pages.csv",mode="a+")
        all_pages.clear()

Let reptile fly for a while, I got here 80000+ pieces of data

Getting Started with Python Reptile [4]: US Space Network is not logged pictures crawling

Well, have a list of data, then, we continue to operate this data, is not it feel a bit slow, a bit of code to write LOW, OK, I admit this is a new handwritten 其实就是懒, I looked back with an article in the he gave into object-oriented and multi-threaded

Getting Started with Python Reptile [4]: US Space Network is not logged pictures crawling

We next analyzed again on crawling data

E.g http://www.moko.cc/post/nimusi/category/31793/1.html this page, we need to get to the address box of red, or why this? Since entering the inside is the complete list of pictures after clicking on this image.

Getting Started with Python Reptile [4]: US Space Network is not logged pictures crawling

We still use crawlers get
a few steps

We just cycle data list
Source crawl the web
Regular expression matching all links

def read_list_data():
    # 读取数据
    img_list = pd.read_csv("./pages.csv",names=["no","url"])["url"]

    # 循环操作数据
    for img_list_page in img_list:
        try:
            response = requests.get(img_list_page,headers=headers,timeout=3)
        except Exception as e:
            print(e)
            continue
        # 正则表达式获取图片列表页面
        pattern = re.compile('<a hidefocus="ture" alt="(.*?)".*? href="(.*?)".*?>VIEW MORE</a>')
        img_box = pattern.findall(response.text)

        need_links = []  # 待抓取的图片文件夹
        for img in img_box:
            need_links.append(img)

            # 创建目录
            file_path = "./downs/{}".format(str(img[0]).replace('/', ''))

            if not os.path.exists(file_path):
                os.mkdir(file_path)  # 创建目录

        for need in need_links:
            # 获取详情页面图片链接
            get_my_imgs(base_url.format(need[1]), need[0])

The code above several key places

        pattern = re.compile('<a hidefocus="ture" alt="(.*?)".*? href="(.*?)".*?>VIEW MORE</a>')
        img_box = pattern.findall(response.text)

        need_links = []  # 待抓取的图片文件夹
        for img in img_box:
            need_links.append(img)

Get to crawl directory, this place, I match two parts, primarily used to create folders
create folders need to use the os module, remember to import it

            # 创建目录
            file_path = "./downs/{}".format(str(img[0]).replace('/', ''))

            if not os.path.exists(file_path):
                os.mkdir(file_path)  # 创建目录

After obtaining the pictures to the details page link, during a visit to crawl all image links

#获取详情页面数据
def get_my_imgs(img,title):
    print(img)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}
    response = requests.get(img, headers=headers, timeout=3)
    pattern = re.compile('<img src2="(.*?)".*?>')
    all_imgs = pattern.findall(response.text)
    for download_img in all_imgs:
        downs_imgs(download_img,title)

The last way to write a picture to download, all the code is complete, the picture is saved local address, with a time stamp.


def downs_imgs(img,title):

    headers ={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}
    response = requests.get(img,headers=headers,timeout=3)
    content = response.content
    file_name = str(int(time.time()))+".jpg"
    file = "./downs/{}/{}".format(str(title).replace('/','').strip(),file_name)
    with open(file,"wb+") as f:
        f.write(content)

    print("完毕")
Python资源分享qun 784758214 ,内有安装包，PDF，学习视频，这里是Python学习者的聚集地，零基础，进阶，都欢迎

Run the code waiting to receive FIG.

Getting Started with Python Reptile [4]: US Space Network is not logged pictures crawling

Code running and found the error

Getting Started with Python Reptile [4]: US Space Network is not logged pictures crawling

The reason is that the path problem, there has been in the path ... this special character, we need a similar process in the above /way to handle it. Discretion about it.

Data acquisition that is like this

Getting Started with Python Reptile [4]: US Space Network is not logged pictures crawling

Code need to be perfected

Code is divided into two parts, and a process-oriented, very bad, a need for improved
Excessive partially overlapped network request code, the abstract, and add error processing is currently being given to possible
Single-threaded code is not efficient, you can refer to the previous two articles to improve
No analog logged in, only to climb up to take six pictures, which is why the first save data down the reason, to facilitate post direct transformation