Python crawler self-study series (6)

Insert picture description here

Preface

This is the sixth chapter, soon. If you know everything, this article will be relatively easy.
(Is this really the case? I don't know!!!)
After all, this article is about verification code login.


Method 1: Cookie login

This is a relatively simple and rude way. First of all, if there is no need for a verification code, I used to crawl CSDN to grab my personal information. This way is great.

Insert picture description here

Because of the characteristics of cookie and session, it is possible to go up during the validity period of the cookie.

The reason for saying this is that it is naturally because of success (it is hard to say which websites have been crawled, understand it). However, due to the small number of tests, I dare not guarantee 100% success.
Anyway, get the cookie as soon as possible after logging in, and log in as soon as possible.

Here I would like to say something: Before doing "high-end operation", why not try this method for twenty minutes, what if it will be done?
I once wanted to use selenium to log in manually. This method can log in after a test, but crawling data is not that simple (maybe I only use selenium to do a little bit, not to grab data).


Method 2: Optical Character Recognition

Note: Don’t read this method if you don’t understand it. Not surprisingly, I will never use this method in my life, let alone use pytesseract.

Listen to the name, awesome.
It's actually OCR.

Download verification code image

First find a target URL, first find the simple verification code, black and white text.
The captcha image sample link , there are sixty gray-scale verification code images, which is enough.

First pull down the picture, you can see that this picture is embedded in the page. What to do with pictures like this?
It's not difficult to do this. Pictures are also files. As long as they are files, they must be stored in the folder of the website.
The embedded pictures are just written in a relative path.

Try https://captcha.com/images/captcha/botdetect3-captcha-ancientmosaic.jpg
it with the root directory of the website and the address of the picture: Can you download it now?

import requests

import os

import time

from lxml import etree

def get_Page(url,headers):

    response = requests.get(url,headers=headers)

    if response.status_code == 200:

        return response.text

    return None

def parse_Page(html,headers):

    html_lxml = etree.HTML(html)

    datas = html_lxml.xpath( './/div[@class="captcha_images_left"]|.//div[@class="captcha_images_right"]')

    # 创建保存验证码文件夹

    file = 'D:/YZM'

    if os.path.exists(file):

        os.chdir(file)

    else:

        os.mkdir(file)

        os.chdir(file)

    for data in datas:

    # 验证码名称

        name = data.xpath( './/h3')

    # 验证码链接

        src = data.xpath( './/div/img/@src')

    # print(len(src))

        count = 0

        for i in range(len(name)):

    # 验证码图片文件名

            filename = name[i].text + '.jpg'

            img_url = 'https://captcha.com/'+ src[i]

            response = requests.get(img_url,headers=headers)

            if response.status_code == 200:

                image = response.content

                with open(filename, 'wb') as f:

                    f.write(image)

                count += 1

                print( '保存第{}张验证码成功'.format(count))

                time.sleep( 1)

def main():

    url = 'https://captcha.com/captcha-examples.html?cst=corg'

    headers = {
    
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}

    html = get_Page(url,headers)

    parse_Page(html,headers)

if __name__ == '__main__':

    main()

Baidu text recognition

Python is a tesserocr library that handles OCR. It is said that the accuracy and the level of use are not as good as Baidu, so I will talk about this first.

Baidu OCR-API document

I won't talk about the interface capabilities. People have their own official announcements.

I will focus on how to use it.

First, you have to register an account.


The creation of Baidu artificial intelligence program and the acquisition of AKSK


Code implementation and testing

from aip import AipOcr  # pip install baidu-aip
import os

i = 0

j = 0

APP_ID = '你的 APP_ID '
API_KEY = '你的API_KEY'
SECRET_KEY = '你的SECRET_KEY'

client = AipOcr(APP_ID, API_KEY, SECRET_KEY)

# 读取图片

file_path = 'D:/YZM'

filenames = os.listdir(file_path)

# print(filenames)

for filename in filenames:
# 将路径与文件名结合起来就是每个文件的完整路径
    info = os.path.join(file_path,filename)

    with open(info, 'rb') as fp:
        # 获取文件夹的路径
        image = fp.read()

        # 定义参数变量
        options = {
    
    
            'detect_direction': 'true',
            'language_type': 'CHN_ENG',
        }

        # 调用通用文字识别接口
        result = client.basicGeneral(image,options)
        print(result)

        if result[ 'words_result_num'] == 0:
            print(filename + ':'+ '----')
            i += 1
        else:
            for word in result['words_result']:
                print(filename + ' : '+word['words'])
                j += 1

        print('共识别验证码{}张'.format(i+j))
        print('未识别出文本{}张'.format(i))
        print('已识别出文本{}张'.format(j))

It's terrible, just two of them are right. . .

It’s okay, don’t panic, let’s take a look at the imageDigital image processing

Digital image processing

from PIL import Image
import os


file_path = 'D:/YZM'
filenames = os.listdir(file_path)

for filename in filenames:
    info = os.path.join(file_path,filename)

    # 打开图片
    image = Image.open(info)

    # 传入'L'将图片转化为灰度图像
    gray = image.convert('L')

    bw = gray.point(lambda x: 0 if x < 1 else 255, '1')

    bw.save(filename)

After the treatment, the overall situation is better, but that's it.
It should be that I am not good at learning. After all, my digital image processing can be done without the efforts of teachers and classmates. A large part of the original black and white pictures are washed out after they are washed. .

But, since it's automated, don't wash it, just try more.


Method three:

The above method is smelly and long. I don't like it and waste my afternoon.
I am a real person, black cat and white cat, as long as the cat can catch mice is a good cat.

Insert picture description here

The first method requires a cookie, which will be frightening, and then the operation with selenium is not easy to grasp the data.
No matter how the second method is, it is just to identify, and finally you have to submit the data with the post method.

Since I can use selenium and cookies, what else can I not do? Climb down directly and fill in the picture. Violence. At that time, our "Beginner to Confused" friends of the "Reptilian Pangolin" team did just that, and they succeeded.

Just realize it, you don't care how I realize it.


Okay, let’s be honest, if you have to deal with a large number of verification codes, the second method is to dig deeper.


Let's look at something else next


Insert picture description here

Change taste

Slider verification code: classic crawler entry (18) | Swipe verification code recognition
Look at him, we have dealt with earlier, we are all big data students, there is actually a chance for me to appear in this article.

Cracking the verification code: classic crawler entry (19) | Increased difficulty, cracking the verification code


Let's put the link between scrapy and regular expression again. In the next few days, there are reasons to have to leave for a period of time, so it will be slowed down.

Scrapy: I want to learn Python secretly, and then stun everyone (the thirteenth day)
everyone is very motivated.

Regular: Today I will put the words here, and tomorrow I will have [Regular Expressions]
this article I personally like, but there is no traffic.


Insert picture description here

I'm beeping two sentences: This series was originally expected to write eight knowledge points and two projects for actual combat, but due to my own limited technology and some physical reasons, I will finish the sixth article. Now that I’ve gotten some knowledge, I will put two actual combat projects next. This will not be discounted, nor will it be the kind that everyone is doing. It's the hard bones encountered by the "Reptilian Pangolin" team. Let's go for a bite.

This article, although it's a bit ridiculous, but it's not very watery, rightInsert picture description here

Guess you like

Origin blog.csdn.net/qq_43762191/article/details/113099271