Use python crawler (face detection) to capture high-value pictures

1 Data source

Know the pictures that appear in the answers to all the questions under the topic "beauty"

2 crawler

Python 3, and use third-party libraries Requests, lxml, AipFace, a total of 100 + lines of code

3 Necessary environment

Mac / Linux / Windows (Linux has not been tested, theoretically possible. Windows has reacted abnormally before, but later check is that Windows has restricted the characters in local file names and used regular filtering), no need to log in to know (ie No need to provide the account password), the face detection service requires a Baidu cloud account (ie Baidu Netdisk/Tieba account)

4 Face detection library

AipFace, provided by Baidu Cloud AI Open Platform, is a Python SDK that can perform face detection. Can be accessed directly via HTTP, free to use

http://ai.baidu.com/ai-doc/FACE/fk3co86lr

5 Test filter conditions

Filter all pictures without faces (such as landscape pictures, body photos without showing faces, etc.)
Filter all non-females (in the crawling, it is found that male pictures are basically stars, so they are not considered; there is a situation where AipFace gender recognition is inaccurate)
Filter all non-real characters, such as anime characters (AipFace Human confidence is less than 0.6).
Filter all pictures with low appearance scores (AipFace beauty attribute is less than 45, in order to save storage space; again, AipFace scores have no objectivity)
here. I want to recommend the Python development learning group I built: 810735403

6 Implementation logic

Initiate an HTTP request through Requests to obtain part of the discussion list under "Beauty".
Parse the HTML of each discussion captured through lxml, and obtain the corresponding src attribute of all img tags.
Initiate an HTTP request through Requests, and download the src attribute to point to the picture ( Regardless of moving pictures)
Face detection
is performed on the picture through AipFace request. Determine whether a face is detected, and use the "4 Detection Filter Condition" filter
to persist the filtered picture to the local file system. The file name is Yan Value + Author + Question name + serial number
Return to the first step, continue

7 Grabbing results

Directly stored in the folder (angelababy strength out of the country). In addition, for the pictures captured so far, except for baby, 88 points are the highest. Personally, I object to the ranking, and my wife is not the highest score.
Insert picture description here
Insert picture description here
Many people learn python and don't know where to start.
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
For these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and course source code!
QQ group:810735403

Insert picture description here
Insert picture description here
8 code

  • 8.1 Directly use Baidu Cloud Python-SDK code-removed
  • 8.2 Without using SDK, construct HTTP request version directly. The advantage of using this version directly is that it does not depend on the SDK version (Baidu Cloud now has two versions of the interface-V2 and V3. At this stage, Baidu Cloud supports both interfaces, so it is no problem to use the SDK directly. .Whenever Baidu does not support V2 in the future, you must upgrade the
    SDK or use this to directly construct the HTTP version)
#coding: utf-8

import time
import os
import re

import requests
from lxml import etree

from aip import AipFace

#百度云 人脸检测 申请信息
#唯一必须填的信息就这三行
APP_ID = "xxxxxxxx"
API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxx"
SECRET_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# 文件存放目录名,相对于当前目录
DIR = "image"
# 过滤颜值阈值,存储空间大的请随意
BEAUTY_THRESHOLD = 45

#浏览器中打开知乎,在开发者工具复制一个,无需登录
#如何替换该值下文有讲述
AUTHORIZATION = "oauth c3cef7c66a1843f8b3a9e6a1e3160e20"

#以下皆无需改动

#每次请求知乎的讨论列表长度,不建议设定太长,注意节操
LIMIT = 5

#这是话题『美女』的 ID,其是『颜值』(20013528)的父话题
SOURCE = "19552207"

#爬虫假装下正常浏览器请求
USER_AGENT = "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.5 Safari/534.55.3"
#爬虫假装下正常浏览器请求
REFERER = "https://www.zhihu.com/topic/%s/newest" % SOURCE
#某话题下讨论列表请求 url
BASE_URL = "https://www.zhihu.com/api/v4/topics/%s/feeds/timeline_activity"
#初始请求 url 附带的请求参数
URL_QUERY = "?include=data%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Danswer%29%5D.target.is_normal%2Ccomment_count%2Cvoteup_count%2Ccontent%2Crelevant_info%2Cexcerpt.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cvoteup_count%2Ccomment_count%2Cvoting%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dtopic_sticky_module%29%5D.target.data%5B%3F%28target.type%3Dpeople%29%5D.target.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B%3F%28target.type%3Danswer%29%5D.target.author.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Darticle%29%5D.target.content%2Cauthor.badge%5B%3F%28type%3Dbest_answerer%29%5D.topics%3Bdata%5B%3F%28target.type%3Dquestion%29%5D.target.comment_count&limit=" + str(LIMIT)

#指定 url,获取对应原始内容 / 图片
def fetch_image(url):
    try:
        headers = {
    
    
                "User-Agent": USER_AGENT,
                "Referer": REFERER,
                "authorization": AUTHORIZATION
                }
        s = requests.get(url, headers=headers)
    except Exception as e:
        print("fetch last activities fail. " + url)
        raise e

    return s.content

#指定 url,获取对应 JSON 返回 / 话题列表
def fetch_activities(url):
    try:
        headers = {
    
    
                "User-Agent": USER_AGENT,
                "Referer": REFERER,
                "authorization": AUTHORIZATION
                }
        s = requests.get(url, headers=headers)
    except Exception as e:
        print("fetch last activities fail. " + url)
        raise e

    return s.json()

#处理返回的话题列表
def process_activities(datums, face_detective):
    for data in datums["data"]:

        target = data["target"]
        if "content" not in target or "question" not in target or "author" not in target:
            continue

        #解析列表中每一个元素的内容
        html = etree.HTML(target["content"])

        seq = 0

        #question_url = target["question"]["url"]
        question_title = target["question"]["title"]

        author_name = target["author"]["name"]
        #author_id = target["author"]["url_token"]

        print("current answer: " + question_title + " author: " + author_name)

        #获取所有图片地址
        images = html.xpath("//img/@src")
        for image in images:
            if not image.startswith("http"):
                continue
            s = fetch_image(image)
            
            #请求人脸检测服务
            scores = face_detective(s)

            for score in scores:
                filename = ("%d--" % score) + author_name + "--" + question_title + ("--%d" % seq) + ".jpg"
                filename = re.sub(r'(?u)[^-\w.]', '_', filename)
                #注意文件名的处理,不同平台的非法字符不一样,这里只做了简单处理,特别是 author_name / question_title 中的内容
                seq = seq + 1
                with open(os.path.join(DIR, filename), "wb") as fd:
                    fd.write(s)

            #人脸检测 免费,但有 QPS 限制
            time.sleep(2)

    if not datums["paging"]["is_end"]:
        #获取后续讨论列表的请求 url
        return datums["paging"]["next"]
    else:
        return None

def get_valid_filename(s):
    s = str(s).strip().replace(' ', '_')
    return re.sub(r'(?u)[^-\w.]', '_', s)

import base64
def detect_face(image, token):
    try:
        URL = "https://aip.baidubce.com/rest/2.0/face/v3/detect"
        params = {
    
    
                "access_token": token
                }
        data = {
    
    
                "face_field": "age,gender,beauty,qualities",
                "image_type": "BASE64",
                "image": base64.b64encode(image)
                }
        s = requests.post(URL, params=params, data=data)
        return s.json()["result"]
    except Exception as e:
        print("detect face fail. " + url)
        raise e

def fetch_auth_token(api_key, secret_key):
    try:
        URL = "https://aip.baidubce.com/oauth/2.0/token"
        params = {
    
    
                "grant_type": "client_credentials",
                "client_id": api_key,
                "client_secret": secret_key
                }
        s = requests.post(URL, params=params)
        return s.json()["access_token"]
    except Exception as e:
        print("fetch baidu auth token fail. " + url)
        raise e

def init_face_detective(app_id, api_key, secret_key):
    # client = AipFace(app_id, api_key, secret_key)
    # 百度云 V3 版本接口,需要先获取 access token   
    token = fetch_auth_token(api_key, secret_key)
    def detective(image):
        #r = client.detect(image, options)
        # 直接使用 HTTP 请求
        r = detect_face(image, token)
        #如果没有检测到人脸
        if r is None or r["face_num"] == 0:
            return []

        scores = []
        for face in r["face_list"]:
            #人脸置信度太低
            if face["face_probability"] < 0.6:
                continue
            #颜值低于阈值
            if face["beauty"] < BEAUTY_THRESHOLD:
                continue
            #性别非女性
            if face["gender"]["type"] != "female":
                continue
            scores.append(face["beauty"])

        return scores

    return detective

def init_env():
    if not os.path.exists(DIR):
        os.makedirs(DIR)

init_env()
face_detective = init_face_detective(APP_ID, API_KEY, SECRET_KEY)

url = BASE_URL % SOURCE + URL_QUERY
while url is not None:
    print("current url: " + url)
    datums = fetch_activities(url)
    url = process_activities(datums, face_detective)
    #注意节操,爬虫休息间隔不要调小
    time.sleep(5)


# vim: set ts=4 sw=4 sts=4 tw=100 et:

9 Operation preparation

  • Install Python 3, Download Python
  • Install requests, lxml, baidu-aip library, all can be installed through pip, one line command
  • Apply for Baidu Cloud Testing Service, free of charge. Face recognition-Baidu AI
    Insert picture description here
    Insert picture description here
    fills in the AppID ApiKek SecretKey into the code
  • (Optional) Configure custom information, such as image storage directory, face value threshold, face confidence, etc.
  • (Optional) If the request fails, the response is as follows. You need to fill in
    AUTHORIZATION, which can be obtained from the developer tools (as shown in the figure below, the value is the same if you have changed several browsers, and the current situation is not logged in. I know that I am more open to crawlers, and I don’t know whether it will be replaced in the future)
{
    
    
    "error": {
    
    
        "message": "ZERR_NO_AUTH_TOKEN",
        "code": 100,
        "name": "AuthenticationInvalidRequest"
    }
}

Insert picture description here
Chrome browser; find a Zhihu link to enter, open the developer tools, and view the HTTP request header; no login required

- 运行 ^*^

10 Conclusion

Because it is face detection, some benefits may be screened out. Baidu Image Recognition API has another called Porn Recognition. This API can identify indescribable and sexy index levels, you can use this API to find benefits

https://cloud.baidu.com/product/imagecensoring

  • If you really don’t want to apply for Baidu cloud service, you can just comment out the face detection part and use it as a mere crawler
  • The face detection part can be replaced with other vendors’ services or local models. Baidu Cloud is used here because it doesn’t cost money
  • I have captured thousands of photos and the effect is quite good. If you are interested, you can post the code and try it out
  • This article is just an example of basic crawler + data filtering to obtain high-quality data. I hope those who are interested can run
    it. There are many places in the code that can be easily modified. Change the topic from the simplest data source and grab data fields. It is easy to add and delete to the picture filter condition modification. If you spend a little more time, change to grab someone's dynamics (such as the wheel, the data quality is high), and explore
    which headers and queries
    in the HTTP request are necessary. The code in the article only needs very local modifications. As for face detection, or other machine learning interfaces, it can provide a lot of functions for data filtering, but which filtering is highly reliable, reliable and usable. This is probably experience and trial and error. This is an extra Topic; by the way, hope everyone has good coding habits
  • Finally, it is stated again that there is a bad case in Yan worthy classification and gender filtering, please do not take it seriously

I still want to recommend the Python development learning group I built by myself:, the 810735403group is all learning Python development, if you are learning Python, you are welcome to join, I have compiled a copy of the latest Python advanced materials and advanced materials for 2021. Development tutorials, welcome to advanced partners and those who want to go deep into Python!

Guess you like

Origin blog.csdn.net/XIe_0928/article/details/112358404