Lecture 22: The basic principles of captcha anti-crawler

When we browse the website, we often encounter various verification codes. In most cases, these verification codes will appear when logging in to the account, and may also appear in the process of visiting the page. Strictly speaking, these behaviors All count as verification code anti-reptiles.

In this lesson, we will introduce the basic principles of verification code anti-reptiles and common verification codes and solutions.

Verification code

The full name of the verification code is called Completely Automated Public Turing test to tell Computers and Humans Apart, which means a fully automatic Turing test to distinguish between computers and humans. After taking the first letter of their keywords, it becomes CAPTCHA. A public fully automatic program that distinguishes whether the user is a computer or a human.

What is it for? Of course many uses, such as:

  • Adding a verification code when registering a website can prevent malicious mass registration to a certain extent.
  • Adding a verification code when logging in to the website can prevent malicious password blasting to a certain extent.
  • Adding the verification code when posting comments on the website can prevent malicious irrigation to a certain extent.
  • The verification code is added to the website when voting, which can prevent malicious swiping of votes to a certain extent.
  • When the website is frequently visited or when the browsing behavior is abnormal, it is generally possible to encounter a crawler, which can prevent the crawling of the crawler to a certain extent.

In general, the above behaviors can be called verification code anti-crawler behaviors. Using verification codes can prevent various behaviors that can be simulated by programs. With the verification code, the machine will encounter some trouble if it wants to fully automate the execution. Of course, the size of the trouble depends on the difficulty of cracking the verification code.

Verification code anti-crawler

Then why does the verification code appear? In most cases, it is because of the high frequency of website visits or abnormal behavior, or to directly restrict certain automated actions. Classified as follows:

  • In many cases, such as login and registration, these verification codes are almost inevitable. Their purpose is to restrict malicious registration and malicious blasting. This is also a means of anti-climbing.
  • When some websites encounter excessively frequent behaviors, they may directly pop up a login window, asking us to log in to continue accessing. At this time, the verification code is directly bound to the login form, which is counted as an exception is detected. Use forced login to anti-climb.
  • If some of the more conventional websites encounter a situation where the frequency of access is slightly higher, a verification code will automatically pop up for the user to identify and submit to verify that the person currently visiting the website is a real person, which is used to restrict the behavior of some machines and implement anti- reptile.
    These situations can limit some automated behaviors of the program to a certain extent, so they can all be called anti-crawlers.

The principle of verification code anti-crawler

In Module 1, we have already talked about the basic concept of Session. It exists on the server side and is used to save the current user's session information. This information is very important for the verification code mechanism.

The server can store some values ​​in the Session object. For example, we want to generate a graphic verification code, such as the four-digit graphic verification code 1234.

First, the client must display a verification code, and the information related to this verification code must be obtained from the server. For example, if we request this interface for generating verification codes, we need to generate a graphic verification code with the content 1234. At this time, the server will save the four numbers 1234 in the Session object, and then return the result of 1234 to the client. Or you can directly return the generated verification code graphic, the client will present it, and the user can see the content of the verification code.

After the user sees the verification code, they will enter the content of the verification code in the form. When they click the submit button, the information will be sent to the server again, and the server will proceed with the submitted information and the verification code information stored in the Session In contrast, if they are the same, it means that the verification code is entered correctly, the verification is successful, and then it continues to be released and restored to normal status. If they are inconsistent, it means that the verification has failed and the verification will continue.

At present, most verification codes on the market are implemented based on this mechanism and are classified as follows:

  • For the graphic verification code, the server will save the content of the graphic to the session, and then return the verification code image or display it on the client itself, and verify the value of the verification code in the session and the value submitted by the user after the user submits the form.
  • For the behavior verification code, the server will do some calculations and store some Key, Token and other information in the Session. The user must first complete the verification of the client. If the verification is successful, the form can be submitted. When the verification of the client is completed, the client The terminal will send the key, token, code and other information calculated after verification to the server, and the server will perform another verification. If the server also passes the verification, it is considered a real pass.
  • For the mobile phone verification code, the server will generate a verification code information in advance, and then send the result of the verification code and the mobile phone number to be sent to the SMS sending service provider, let the service provider issue the verification code to the user, and the user This code is submitted to the server, and the server determines whether the verification code in the Session is consistent with the submitted verification code.

There are many other verification codes whose principles are basically the same.

Common verification codes

Let's take a look at some common verification codes on the market and briefly introduce some identification ideas.

Captcha

The most basic verification code is the graphic verification code, such as the picture below.
Insert picture description here
Generally speaking, there are several recognition ideas:

  • Using OCR recognition, such as libraries such as Tesserocr, or directly calling OCR interfaces, such as Baidu and Tencent, the recognition effect is better than Tesserocr.
  • The coding platform sends the verification code to the coding platform. Some powerful recognition algorithms are implemented in the platform or someone behind the platform specializes in recognition, which is fast and worry-free.
  • For deep learning training, this type of verification code can also use deep learning models such as CNN to train classification algorithms, but if there are many types or different writing methods, the recognition accuracy will have some impact.

Behavior verification code

Now we can see many types of behavior verification codes, which can be said to be very popular. For example, Jiyan, Tencent, NetEase Shield, etc. all have similar verification code services, and there are also many verification methods, such as sliding, Drag, click, logical judgment, etc., as shown in the figure.
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
The recommended identification schemes are as follows:

  • The coding platform, many of the verification codes in it are related to the coordinates, we can directly send a screenshot of the verification code to the coding platform, and someone behind the coding platform will help us find the corresponding location coordinates, and you can come after obtaining the location coordinates Simulated. There are two ways to simulate at this time. One is to simulate behavior, which is implemented using Selenium, etc. After the simulation is completed, you can usually log in or unlock a session block state and obtain valid Cookies. The other is to simulate at the JavaScript level, which is more difficult. After the simulation, you can directly obtain some Token values ​​submitted by the verification code.
  • In deep learning, the verification code can also be identified by using some image annotations and deep learning methods. In fact, it is mainly to identify the location, and the location can also be simulated.

SMS, scan code verification code

In addition, we may encounter some verification codes similar to SMS and scan codes. This operation will be more troublesome. Some solutions are as follows:

  • You don’t need your own mobile phone number, you can get it from some platforms. The platform maintains a mobile phone short message sending and receiving system, fill in the phone number, and get the SMS verification code through API.
  • In addition, you can also buy some professional code-receiving equipment or install some software to monitor SMS. It will have some mechanism to export some mobile phone SMS information to a certain interface or text or database, and then extract it.
  • For scan code verification, if you don’t need your own account, you can send the code to the coding platform and let the other party use your own account to scan the code, but in this case, you need to customize it and you can communicate with the platform. The other solution involves reversing and cracking related content. Generally, it is necessary to reverse the scanning and parsing logic in the mobile phone App, and then simulate it. I will not go into it here.

Basically, the verification codes are similar, some of which are incomplete, but the basic categories can be roughly classified.

Above we have introduced the basic principles of verification code anti-reptiles and some ideas for verification code identification. In the following class, I will introduce the scheme of using the coding platform and deep learning to identify the verification code.

Guess you like

Origin blog.csdn.net/weixin_38819889/article/details/107906628