Do not disturb girls, only Python crawlers suitable for boys. No money is given for everything inside

Author: Hourglass in the rain 

https://blog.csdn.net/qq_45906219/article/details/105889730

To be honest, it took a few days to work on a project that others have already done. I don’t know if it’s worth it, but after I started doing it myself, I realized that it must be worth it. Cui Da’s book is 2018. Years ago, but now the network update speed is too fast, the interface knowledge points in the books have changed, it took a lot of time to understand these, but I think it is worth it, so I strengthened the code and realized mine Function, let’s share it briefly.

It's been 2020 and I haven't crawled today's headlines. Do you look OUT as a crawler? But it’s okay. Although the current interface has changed, then I will talk about how to do today’s headline girl photo in 2020. This is an improved project. I participated in many of my own ideas, such as some that are difficult to understand, myself. I realized it in a simple way, and I feel that it is a good realization. You can check it out.

This girl thinks the nice comment area is nice, so I will teach you how to get it!

Project Introduction:

Using simple 进程池and Ajax数据爬取技术to today's headlines keyword pages 分析, and 处理then get 每个页面的链接to obtain all 图片的链接, then 打包下载,
the entire step I will use small modules of code to display, 实现不了的过来砍我! Is so in charge I tell you.


Project technology:

Simple process pool:

I don’t know much about the process here. Simply talk about the functions required by the project:

from multiprocessing import Pool  # 调用函数库
p = Pool(4)  #  构造一个进程池,单位为4
p.close()  # 关闭进程
p.join()   # 开启进程

Calling the join() method on the Pool object will wait for all child processes to complete. You must call close() before calling join(). After calling close(), you cannot continue to add new Processes.


Ajax data crawling:

A lot of information on the URL will not appear directly in the source code. For example, if you scan the webpage, those newly-brushed webpages are loaded one by one through the ajax interface. This is a kind of 异步加载方式, the original page does not contain a lot of data , The data is put in each interface, only when we request this ajax interface, and then the server background receives this interface information, will the data be returned, and then JavaScript analyzes the data and renders it on the browser page, this is what we see With the approached model,
more and more web pages are now using this asynchronous loading method, crawlers are not so easy to appear, this concept is also awkward, let's start the actual combat!


Project assignment:

Final job renderings:



Analyze the ajax interface to obtain data:

The data includes:

  • Title of each page

  • URL of each page

Target URL (today's headline keyword girl): https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3

How to know if it is an ajax interface, there are three main points:

The first point

Pay attention to my arrows, as long as you can't find the text or link to the article in here, search, it may be.

Second point

Find the URL of the arrow in this XHR, click to view the preview, at this time you can open the contents at will, and you will find many points that are the same as the article

Third point

Still in this picture, you can see that the interface in X-requested is XMLHttpRequets.
If the three points are met at the same time, then it is the Ajax interface, and then the data is loaded asynchronously.

get data

In the picture at the second point, we can see that there are 0, 1, 2, 3, 4, and so on. When you open it, you will find that they are all inside. In the picture, I marked it with an arrow in red, with a title and a page. Link, just get the link to this page, then it's very simple.

Programming:

Get the json file:

First request the first page:  https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3
but we can’t directly pass the page to the requests library like this, because this is an ajax interface, if you don’t add parameters, it is very likely that you can enter the verification code or pull the verification bar, anyway. Trouble, then we add parameters, the specific measures are as follows:

def get_page(offset):    # offset偏移,因为每个ajax都是加载固定的页面数
                     # 这里是20,在第三点图上可以看得到
    global headers  # 全局变量  我后面还要用
    headers = {
        'cookie': 'tt_webid=6821518909792273933; WEATHER_CITY=%E5%8C%97%E4%BA%AC; SLARDAR_WEB_ID=b4a776dd-f454-43c6-81cd-bd37cb5fd0ec; tt_webid=6821518909792273933; csrftoken=4a2a6afcc9de4484af87a2ff8cba0638; ttcid=8732e6def0484fae975c136222a44f4932; s_v_web_id=verify_k9o5qf2w_T0dyn2r8_X6CE_4egN_9OwH_CCxYltDKYSQj; __tasessionId=oxyt6axwv1588341559186; tt_scid=VF6tWUudJvebIzhQ.fYRgRk.JHpeP88S02weA943O6b6-7o36CstImgKj1M3tT3mab1b',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 Edg/81.0.416.68',
        'referer': 'https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3',
        'x-requested-with': 'XMLHttpRequest'
    }  # 头信息 加入参数
    params = {
        'aid': ' 24',
        'app_name': ' web_search',
        'offset': offset,
        'format': ' json',
        'keyword': ' 美女',
        'autoload': ' true',
        'count': ' 20',
        'en_qc': ' 1',
        'cur_tab': ' 1',
        'from': ' search_tab',
        'pd': ' synthesis',
        'timestamp': int(time.time())
    }
    url = 'https://www.toutiao.com/api/search/content/?' + urlencode(params)  # 构造url, 使用到了urlencode() 函数 
    url = url.replace('=+', '=')  # 这里必须注意现在的网址根本不一样
    # print(url)          
    try:
        r = requests.get(url, headers=headers, params=params)
        r.content.decode('utf-8')
        if r.status_code == 200:
            return r.json()  # 返回json格式 因为全是字典类型
    except requests.ConnectionError as e:
        print(e)

It must be noted here that the requested URL has been changed, I explained it in the code, take a closer look.

Get the title and URL:

def get_image(json):  # 获取图片
    if json.get('data'):  # 如果这个存在
        for item in json.get('data'):
            if item.get('title') is None:
                continue  # 如果标题是空值
            title = item.get('title')  # 获取标题
            if item.get('article_url') == None:
                continue
            url_page = item.get('article_url')
            # print(url_page)
            rr = requests.get(url_page, headers=headers)
            if rr.status_code == 200:
                pat = '<script>var BASE_DATA = .*?articleInfo:.*?content:(.*?)groupId.*?;</script>'  #  用正则大致匹配一下范围
                match = re.search(pat, rr.text, re.S)
                if match != None:
                    result = re.findall(r'img src=\\"(.*?)\\"', match.group(), re.S)
                
                #  print(i.encode('utf-8').decode('unicode_escape')
                    # 转换编码方式 把\u之类的改掉
                    yield {
                        'title': title,
                        'image': result
                    }

The web links obtained here are all in Unicode format. In the download section later, I gave a modification plan, which is also a hidden pit.

Download image:

def save_image(content):
    path = 'D://今日头条美女//'  # 目录
    if not os.path.exists(path):  # 创建目录
        os.mkdir(path)
        os.chdir(path)
    else:
        os.chdir(path)
    # ------------------------------------------


    if not os.path.exists(content['title']):  # 创建单个文件夹
        if '\t' in content['title']:  # 以title为标题创建单个文件夹
            title = content['title'].replace('\t', '')  # 去除特殊符号 不然创建不了文件名称
            os.mkdir(title + '//')
            os.chdir(title + '//')
            print(title)
        else:
            title = content['title']
            os.mkdir(title + '//')  # 创建文件夹
            os.chdir(title + '//')
            print(title)
    else:  # 如果存在
        if '\t' in content['title']:  # 以title为标题创建单个文件夹
            title = content['title'].replace('\t', '')  # 去除特殊符号 不然创建不了文件名称
            os.chdir(title + '//')
            print(title)
        else:
            title = content['title']
            os.chdir(title + '//')
            print(title)
    for q, u in enumerate(content['image']):  # 遍历图片地址列表
        u = u.encode('utf-8').decode('unicode_escape')
        # 先编码在解码 获得需要的网址链接
        #  开始下载
        r = requests.get(u, headers=headers)
        if r.status_code == 200:
            # file_path = r'{0}/{1}.{2}'.format('美女', q, 'jpg')  # 文件的名字和地址,用三目运算符来调试文件夹的名字
            # hexdisgest() 返回十六进制图片
            with open(str(q) + '.jpg', 'wb') as fw:
                fw.write(r.content)
                print(f'该系列----->下载{q}张')

When the U variable is encoded and decoded, the URL is much normal.

All code of the project:

# -*- coding '':''  utf-8 -*-''
# @Time      '':''  2020/5/1  9:34''
# @author    '':''  沙漏在下雨''
# @Software  '':''  PyCharm''
# @CSDN      '':''  https://me.csdn.net/qq_45906219''




import requests
from urllib.parse import urlencode  # 构造url
import time
import os
from hashlib import md5
from lxml import etree
from bs4 import BeautifulSoup
import re
from multiprocessing.pool import Pool




def get_page(offset):
    global headers
    headers = {
        'cookie': 'tt_webid=6821518909792273933; WEATHER_CITY=%E5%8C%97%E4%BA%AC; SLARDAR_WEB_ID=b4a776dd-f454-43c6-81cd-bd37cb5fd0ec; tt_webid=6821518909792273933; csrftoken=4a2a6afcc9de4484af87a2ff8cba0638; ttcid=8732e6def0484fae975c136222a44f4932; s_v_web_id=verify_k9o5qf2w_T0dyn2r8_X6CE_4egN_9OwH_CCxYltDKYSQj; __tasessionId=oxyt6axwv1588341559186; tt_scid=VF6tWUudJvebIzhQ.fYRgRk.JHpeP88S02weA943O6b6-7o36CstImgKj1M3tT3mab1b',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 Edg/81.0.416.68',
        'referer': 'https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3',
        'x-requested-with': 'XMLHttpRequest'
    }  # 头信息 加入参数
    params = {
        'aid': ' 24',
        'app_name': ' web_search',
        'offset': offset,
        'format': ' json',
        'keyword': ' 美女',
        'autoload': ' true',
        'count': ' 20',
        'en_qc': ' 1',
        'cur_tab': ' 1',
        'from': ' search_tab',
        'pd': ' synthesis',
        'timestamp': int(time.time())
    }
    url = 'https://www.toutiao.com/api/search/content/?' + urlencode(params)  # 构造url
    url = url.replace('=+', '=')  # 网址根本不一样
    # print(url)
    try:
        r = requests.get(url, headers=headers, params=params)
        r.content.decode('utf-8')
        if r.status_code == 200:
            return r.json()  # 返回json格式 因为全是字典类型
    except requests.ConnectionError as e:
        print(e)




def get_image(json):  # 获取图片
    if json.get('data'):  # 如果这个存在
        for item in json.get('data'):
            if item.get('title') is None:
                continue  # 如果标题是空值
            title = item.get('title')  # 获取标题
            # if item.get('image_list') is None:  # 进行判空
            #     continue
            # urls = item.get('image_list')  # 获得图片网址
            # for url in urls:  # 遍历这个urls
            #     url = url.get('url')
            #     # 使用正则拼接网址
            #     url = 'http://p1.pstatp.com/origin/' + 'pgc-image/' + url.split('/')[-1]
            if item.get('article_url') == None:
                continue
            url_page = item.get('article_url')
            # print(url_page)
            rr = requests.get(url_page, headers=headers)


            if rr.status_code == 200:
                pat = '<script>var BASE_DATA = .*?articleInfo:.*?content:(.*?)groupId.*?;</script>'
                match = re.search(pat, rr.text, re.S)
                if match != None:
                    result = re.findall(r'img src=\\"(.*?)\\"', match.group(), re.S)
                    # for i in result:
                    #     print(i.encode('utf-8').decode('unicode_escape')
                    # 转换编码方式 把\u之类的改掉
                    yield {
                        'title': title,
                        'image': result
                    }
            #  格式出错,这里产生了十六进制的数值, 网址获取不了,明天看
            # yield {
            #     'title': title,
            #     'image': url
            # }  # 返回标题和网址




def save_image(content):
    path = 'D://今日头条美女//'  # 目录
    if not os.path.exists(path):  # 创建目录
        os.mkdir(path)
        os.chdir(path)
    else:
        os.chdir(path)
    # ------------------------------------------


    if not os.path.exists(content['title']):  # 创建单个文件夹
        if '\t' in content['title']:  # 以title为标题创建单个文件夹
            title = content['title'].replace('\t', '')  # 去除特殊符号 不然创建不了文件名称
            os.mkdir(title + '//')
            os.chdir(title + '//')
            print(title)
        else:
            title = content['title']
            os.mkdir(title + '//')  # 创建文件夹
            os.chdir(title + '//')
            print(title)
    else:  # 如果存在
        if '\t' in content['title']:  # 以title为标题创建单个文件夹
            title = content['title'].replace('\t', '')  # 去除特殊符号 不然创建不了文件名称
            os.chdir(title + '//')
            print(title)
        else:
            title = content['title']
            os.chdir(title + '//')
            print(title)
    for q, u in enumerate(content['image']):  # 遍历图片地址列表
        u = u.encode('utf-8').decode('unicode_escape')


        # 先编码在解码 获得需要的网址链接
        #  开始下载
        r = requests.get(u, headers=headers)
        if r.status_code == 200:
            # file_path = r'{0}/{1}.{2}'.format('美女', q, 'jpg')  # 文件的名字和地址,用三目运算符来调试文件夹的名字
            # hexdisgest() 返回十六进制图片
            with open(str(q) + '.jpg', 'wb') as fw:
                fw.write(r.content)
                print(f'该系列----->下载{q}张')




def main(offset):
    json = get_page(offset)
    get_image(json)
    for content in get_image(json):
        try:
            # print(content)
            save_image(content)
        except FileExistsError and OSError:
            print('创建文件格式错误,包含特殊字符串:')
            continue




if __name__ == '__main__':
    pool = Pool()
    groups = [j * 20 for j in range(8)]
    pool.map(main, groups) # 传offset偏移量
    pool.close()
    pool.join()


Project repair:

  • Correctly splice the URL, fix the bug of data = None, which is reflected in the obtained json file

  • It is also great to add necessary parameters to avoid the appearance of verification codes and verification bars =. =

  • Perform encoding and decoding operations, modify the link mismatch problem in the \u format, I also mentioned it when I got the URL

  • When downloading pictures, the processing of sub-folders is used, so that the downloaded pictures will not be so messy

  • The download uses a simple thread pool problem to speed up the download speed

Remember to click a free like after reading

< END >

Sharing or watching is my greatest support

Guess you like

Origin blog.csdn.net/ityard/article/details/108806818