Advanced crawler: anti-crawler technology-5 hidden information on the web

1. Pay attention to the information hidden on the web

In HTML forms, "hidden" fields can make the value of the field visible to the browser but invisible to the user (unless you look at the source code of the web page). As more and more websites begin to use cookies to store state variables to manage user status, hidden fields are mainly used to prevent crawlers from automatically submitting forms before finding another best use.

The example shown in the figure below is a hidden field on the Facebook login page. Although there are only three visible fields in the form (username, password, and a confirmation button), the form in the source code sends a lot of information to the server.

Insert picture description here

Hidden fields on Facebook login page

There are two main ways to prevent network data collection with hidden fields. The first is that a field on the form page can be represented by a random variable generated by the server. If the value is not on the form processing page at the time of submission, the server has reason to believe that the submission is not submitted from the original form page, but directly submitted to the form processing page by a web robot. The best way to circumvent this problem is to first collect the random variables generated on the page where the form is located, and then submit it to the form processing page.

The second way is "honey pot" (honey pot). If the form contains an implicit field with a common name (setting a honeypot trap), such as "username" (username) or "email address" (email address), web robots that are not well designed often do not care if this field is not Visible to the user, directly fill in this field and submit it to the server, which will be trapped by the server's honeypot. The server ignores the true values ​​of all hidden fields (or values ​​different from the default values ​​of the form submission page), and users who fill in hidden fields may also be blocked by the website.

In short, sometimes it is necessary to check the page where the form is located to see if there are any hidden fields preset by the server (honeypot traps) that have been omitted or mistaken. If you see some hidden fields, usually with large random string variables, then it is likely that the web server will check them when the form is submitted. In addition, there are other checks to ensure that these currently generated form variables are only used once or recently generated (this can avoid variables from being simply stored in a program for repeated use).

2. Avoid entering the honeypot

  Although it is easy to distinguish between useful and useless information with CSS properties during network data collection (for example, to obtain information by reading id and class tags), this can sometimes cause problems. If a field of the web form is set to be invisible to users through CSS, it can be considered that ordinary users cannot fill in this field when visiting the website because it is not displayed on the browser. If this field is filled in, the robot may have done it, so this submission will be invalid.

This method can be applied not only to web forms, but also to links, pictures, files, and any content that can be read by robots but cannot be seen by ordinary users in the browser. If a visitor visits a "implicit" content on the website, a server script will be triggered to block the user's IP address, remove the user from the website, or take other measures to prohibit the user from accessing the website. Actually,

Many business models are doing these things.

The web page used in the following example is at  http://pythonscraping.com/pages/itsatrap.html. This page contains two links, one is  implicit through CSS and the other is visible. In addition, the page also includes two hidden fields:

Insert picture description here

These three elements are hidden from users in three different ways:

The first link is hidden by a simple CSS property setting display: none

The phone number field name = "phone" is an implicit input field

The email address field name = "email" is to move the element 50,000 pixels to the right (it should exceed the boundary of the computer monitor) and hide the scroll bar

Because Selenium can get the content of the visited page, it can distinguish between visible and hidden elements on the page. Through is_displayed(), you can determine whether the element is visible on the page.

For example, the following code example is to get the content of the previous page, and then find the implicit link and implicit input field:

Insert picture description here

Selenium grabbed every implicit link and field, and the results are as follows:

Insert picture description here

Although you are unlikely to visit the implicit links you find, before submitting, remember to confirm the values ​​of the hidden fields that are already in the form and ready to be submitted (or let Selenium automatically submit them for you).

 

Guess you like

Origin blog.csdn.net/zhangge3663/article/details/108400489