Python application: what is a crawler?

Notes at the beginning of 2022. The basic excerpt of the "Beginning of Worms" section comes from Reference 1. It is recommended to read Reference 1. It is very comprehensive and interesting

what is a reptile

What are reptiles?

A crawler is a detection machine. Its basic operation is to simulate human behavior to wander around various websites, click buttons, check data, or recite the information it sees. Like a bug crawling around tirelessly in a building.

The biggest crawler is the search engine.

The Baidu you use every day actually uses this kind of crawler technology: it releases countless crawlers to various websites every day, captures their information, and then puts on light makeup and waits in a team for you to retrieve.

At the beginning of the worm, the nature is good?

Yes, reptiles are also good and evil.

Search engine crawlers like Google scan the entire webpage every few days for everyone to check, and most of the scanned websites are very happy. This is defined as a "good faith crawler".

However, crawlers like ticket grabbing software can't wait to masturbate tens of thousands of times per second to 12306. President Tie didn't feel very happy. This is defined as "malicious crawler". (Note, it’s useless for you to feel happy when grabbing tickets. If the scanned website feels unhappy, it is malicious.)

insert image description here

The figure above (from reference 1) shows the proportion of various industries being harassed by crawlers. Note that this figure shows the whole world.

go out

Among the crawlers in the domestic crawling industry, 90% of the traffic is directed at 12306. After all, this is the only railway company.

Do you still remember Wang Luodan and Bai Baihe 's "the worst picture verification code in history" when 12306 went online?

insert image description here

These things are not intended to make it difficult for honest ticket sellers, but to prevent crawlers (that is, ticket grabbing software) from clicking. As I said just now, the reptiles can only click mechanically, and they don't know Bai Baihe, so a large part of the reptiles are blocked from the door.

Of course, the so-called Dao is one foot tall and the devil is one foot high, not all reptiles will be blocked by Bai Baihe.

There is something called a coding platform . The platform employs a lot of people to manually identify the verification code. If the ticket grabbing software encounters a picture verification code that has not been seen before, the system will automatically send back these verification codes, mark them manually, and then send the result back. The whole process takes less than a few seconds.

Of course, such a coding platform also has a memory function. After looking back for a long time, the pictures in 12306 can basically be marked once, and the defense method is breached, then 12306 is naturally free to pick in front of the reptiles.

So do you know what 12306 is called every year before the Chinese New Year? The public data says: "At the peak time, the number of page views reached 81.34 billion in one day, and the highest number of hits in one hour was 5.93 billion, with an average of 1.648 million times per second." This is the data after adding verification code protection. It is conceivable how many reptiles are blocked outside.

Likewise, airlines are not faring much better.

insert image description here

The figure above shows the distribution ratio of aviation reptiles.

Take Air Asia, for example. This is a Malaysian low-cost airline. Its routes are basically from all over China to tourist destinations in Southeast Asia. Even mineral water on the plane has to be bought at its own expense. It can be called the first choice for poor people on vacation.

AirAsia often releases some particularly cheap tickets. The original intention was to attract tourists, but for scalpers, this is a business opportunity.

It is said that they played like this:

Scalpers who are technical geeks use crawlers to constantly refresh AirAsia’s ticketing interface. Once there is a cheap ticket, it doesn’t matter what it is.

AirAsia has a rule that if you take a photo for half an hour (I can’t remember the exact time), if you don’t pay for the ticket, it will automatically return to the ticket pool and continue to sell. But the scalpers wrote the precise time in the crawler script, and when it was half an hour, not more than a millisecond, he took the ticket again, and so on. Until someone booked the ticket from the scalpers, the scalpers then used the program to abandon the ticket in the AirAsia system, and then 0.00001 seconds later, they booked the ticket for you in your name.

social contact

The worst-hit area of ​​social reptiles is Weibo

insert image description here

The above are the Weibo addresses frequented by crawlers.

The code here actually points to an interface of Weibo. It can be used to get a list of a person's tweets, status of tweets, indexes, and so on.

What kind of show can you do if you get these?

If you think about it, if I can command a group of robots as I like, open someone’s Weibo, and then scroll to a certain article, and then frantically follow, like or leave a message, isn’t this the standard process of zombie fans going to work? . .

Zombie powder is just a routine operation of reptiles, and there are more fancy scenes, and they can even make money while lying down:

  1. I am a passer-by, and no one pays attention to my Weibo. I use a lot of crawlers to make myself a zombie fan of 100,000 people. A group of zombies like and comment on my Weibo, which is a joy.

  2. I went to an insurance company and said to him: You see I have so many fans, you can advertise with me. I will help you send a registration link of your app, and you will give me a dime for every person who registers your app through my link. The advertiser said, yes, that's it.

  3. I send out the registration link and no one clicks. . .

  4. Don't panic, I let the 100,000 reptiles continue to click on the registration link one after another, and then automatically complete the registration action.

  5. I lay in bed, counting the ten thousand dollars I had earned.

e-commerce

E-commerce ranks third in crawler harassment rankings.

There are several things called "price comparison platform", "aggregated e-commerce" and "rebate platform". They generally have the same principle:

When you search for a product, this type of aggregation platform will automatically put all the products of various e-commerce companies in front of you for you to choose. Such as Taobao, JD.com.

This is the work of reptiles. They go to the websites of various e-commerce companies to pick up the pictures and prices of the products, and then display them here.

For example, I found a price comparison platform called Buy Slowly , and then searched for the price comparison of iPhone 12:

insert image description here

As consumers, we may feel good about it, but for e-commerce platforms, JD.com and Tmall will definitely refuse to compare prices together.

However, robot crawlers simulate human clicks, and it is difficult for e-commerce companies to prevent such things from happening . They can't even learn from 12306. Think about it, if every time you click on a product detail, Taobao asks you to distinguish between Bai Baihe and Wang Luodan first, I don't believe you are still in the mood to continue buying.

Of course, e-commerce has another way to fight crawlers, and that is "web application firewall", or WAF for short.

Of course, these aggregation platforms have spent a lot of effort and money to maintain crawlers. They are not doing good deeds to help Taobao Jingdong sell goods. They also have profit means:

  1. Suppose there are several stores on Taobao that sell iPhones, but when users search here, I have the right to decide whose store ranks first and who ranks last, depending on who pays more. This set, Baidu has played badly. In addition, the store and Taobao are not acting in concert. Taobao does not want its content to be captured by the aggregation platform, but each store is happy to have an additional channel to help them sell goods;
  2. page-specific advertisements;

search engine

As you may know, when a search engine decides which web page ranks top, ( in addition to how much money is paid to the search engine ), the main indicator is to see which search results are clicked more times.

In this case, then I will send a crawler to search for a specific "keyword", and then desperately click on a link in the results, then the weight of this website in the search engine will naturally rise. This process is called SEO (Search Engine Optimization) .

As any search engine, outsiders are definitely not allowed to tamper with their own search results, otherwise they will lose their publicity. They fight SEO by tweaking their algorithms from time to time.

Especially for many gambling and pornographic websites, if the search engines dare to charge advertising fees to make them rank first, they will not be far from closing down. Therefore, pornography, gambling and drug websites can only use black SEO to force themselves to the front. Until they are discovered by search engines, they should be "downgraded" as soon as possible.

government department

insert image description here

The second place is the unified platform for appointment and registration in Beijing. That's the problem with scalpers.

Others, such as court announcements, Credit China, and Credit Anhui, why do crawlers crawl these information?

Because some information is only in the hands of government departments.

For example, who has been accused, which company has been administratively punished, and which person has been included in the dishonesty list. These information can be combined to make a credit record of a company or an individual.

Summarize

Some say technology is guilty, some say technology is innocent. However, in the "Network Security Law", there is basically no clause that "crawling public information on the Internet is considered illegal". It's a legal gray area.
I've seen some technical guys posting on the Internet before, as long as someone comes over and asks if they know about reptiles, don't say much, just ask.

programming for prisons

There has always been a joke in the reptile world, which is called " the reptiles play well, and they eat early in prison ", or some reptile tutorials are simply called " Guide to Getting to Jail Quickly ".

Because there are countless cases of reptiles going to jail, I heard an example some time ago in 18 years. The CTO and programmer of a reptile project were both arrested. Why? Because their company has a business, they need to visit the government residence permit website frequently to check the real estate address, code, etc. Manual query is too slow, so the company's products discussed and decided to use crawler software to do automatic query. In March 2018, the program was deployed and launched, and then something happened in April.

Around 10:34-12:00 on April 27, 2018, the contractor of the residence permit website found that its system was down. It was suspected that it had been attacked by humans. However, due to missing logs, it was impossible to locate the source of the IP, so it had to give up. As a result, on May 2, the system was attacked again. This time, the response was quicker. The operation and maintenance successfully intercepted the IP address, and then reported the case. Then on May 17, the Internet police locked the company's server IP, and then followed the vine to pull the company out.

Later, the programmer said that it was because a verification code was added to the residence permit website, but the company’s crawler program was not updated, causing the crawler to get out of control, with frequent requests as high as 183 times per second, which directly paralyzed the attack on the opposite website, causing all residence permit processing and other external services to fail to work normally, affecting the systems of nearly 100 police stations and reception points.

Finally, in August 2018, the CTO and the programmer were arrested. The court held that the two violated state regulations and interfered with the computer information system, causing the computer information system serving more than 50,000 users to fail to operate normally for more than one hour. The consequences were particularly serious, and criminal responsibility should be investigated for the crime of destroying the computer information system.

In the end, the CTO who was in charge of and authorized the programmers to develop the crawlers involved was the principal offender and was sentenced to three years in prison, while the programmer who developed the crawlers was an accomplice and was sentenced to one year and six months in prison. See reference 6 for details.

insert image description here

Regarding this matter, there are different opinions. Many people say that the development technology of the government website is too poor, or that the spam website has encountered sand sculpture development. The opposite website does not have a basic anti-crawler firewall, and the technology here is also a simple and violent crawl.

insert image description here

A Gentleman's Agreement for Reptiles

The first step of the crawler, check robots.txt

What is a gentleman's agreement

The crawlers of search engines are well-intentioned. They retrieve your web page information to serve other users. For this reason, they also define a robots.txt file as a gentleman's agreement.

Robots.txt is actually a game product between the website and the search engine, also known as the robots protocol, which is a text file placed in the root directory of the website. It is used to tell search engine robots (i.e. web spiders) which content under this website cannot be obtained by search engine robots and which can be obtained.

How did the gentleman's agreement come about?

The robots protocol was not formulated by a certain company. It first appeared in the 1990s, when Google did not exist. The origin of the Real Robots protocol was discussed and born in the public mailing group of Internet practitioners. Even today, related issues in the Internet field are still discussed and generated in some special mailing groups (of course, mainly in the United States).

On June 30, 1994, after discussions between search engine personnel and webmasters crawled by search engines, an industry standard, the robots.txt protocol, was officially released. Prior to this, relevant personnel have been drafting this document, and after it was released on the world Internet technology mail group, this protocol was adopted by almost all search engines, including the earliest altavista, infoseek, later google, bing, and China's baidu, soso, sogou and other companies have also adopted and strictly followed .

A robot, also known as a spider, is a general term for a computer program that a search engine automatically obtains web page information. The core idea of ​​robots is to require crawlers not to retrieve content that webmasters do not want to be directly searched .

Since the days of search engines, the Robots protocol has been the most effective way so far, maintaining the balance between websites and search engines with self-discipline , so that the interests between the two will not be excessively skewed.

What is a gentleman's agreement?

The content format of robots.txt:

  • User-agent : Define the name of the crawler, for example, Twitter is called Twitterbot, Baidu is called Baiduspider, and Google is called Googlebot. If User-agent is *, it means that all crawlers are targeted.
  • Disallow: The address that crawlers are not allowed to access, and the description of the address conforms to the rules of regular expressions;
  • Allow: The address that allows crawlers to access

Take an example, the following is excerpted from Baidu's robots protocol:

User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh

Disallow: /baidu means that Baidu crawlers are not allowed to access all directories and files under the baidu directory. This is a regular matching process, which can be simply understood as: URLs containing /baidu cannot be accessed.

How to check the robots protocol of a website

In the URL search bar of the browser, enter the root domain name of the website, and then enter /robots.txt. For example, Baidu’s robots.txt website is: https://www.baidu.com/robots.txt, Bing

Others and so on.

Shopping sites like Taobao basically ban search engines:

For example, Taobao :

User-agent: Baiduspider
Disallow: /

User-agent: baiduspider
Disallow: /

Baidu's crawlers are directly prohibited from crawling it, but other crawlers are not prohibited.

Tmall is more ruthless, it is banned:

User-agent: * 
Disallow: /

So sometimes we see something like this in the search results:

insert image description here

Many foreign websites will write like this:

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /

Just don’t try to climb me except Google

Cases of Violation of the Gentleman's Agreement

There are many such cases, and I am personally more interested in the domestic 360 search case.

In August 2012, 360 Comprehensive Search was accused of violating the robots protocol. It not only grabs a large amount of content from Baidu and Google without authorization, but also records the background orders and discount codes of well-known domestic websites, and even some users' email addresses, account numbers, and passwords are also quietly recorded by 360 through the browser. What's more serious is that 360 even seized the information on the enterprise intranet, resulting in a large amount of enterprise intranet information being leaked. At the end of 2012, Baidu engineers passed a test called "Hunting Ghosts on Ghost Festival", which proved that the 360 ​​browser had the behavior of uploading private content such as "isolated island pages" to 360 search without permission .

Later, Baidu sued Qihoo 360 for violating the Robots agreement by grabbing and copying the contents of Baidu Zhizhi , Baidu Encyclopedia , Baidu Tieba and other websites under Baidu. The case was heard in Beijing No. 1 Intermediate People's Court on the morning of October 16, 2013 . What's more interesting is that on the day of the trial, 360 sued Baidu for "forced redirection", and it has been officially accepted by the Beijing Higher People's Court. In this case, 360 claimed as much as 400 million yuan. 360 claimed that Baidu maliciously blocked the access of 360 search engine users, intercepted 360 users, and forced them to search on the Baidu homepage, and these technical means only discriminatoryly treated 360 search engine users. These actions not only seriously affected user experience, but also constituted unfair competition and caused heavy losses to 360. Baidu responded that this is a protective measure launched by Baidu against anonymous visits to some websites and illegal crawling of Baidu content, resulting in incomplete search experience for netizens.

These two cases were a bit far-fetched, and the verdict was pronounced about two years later. On August 7, 2014, the Beijing First Intermediate People's Court made a first-instance verdict that 360 compensated Baidu 700,000 yuan, but rejected Baidu's other claims. Another case I can't find results online.

Interestingly, 360 believes that 360’s search of these pages is not suspected of infringing on Baidu’s rights and interests. In fact, it has brought a large number of users and traffic to Baidu. Baidu should be grateful to 360.

The robots agreement is completely a self-regulatory treaty. The current domestic situation is that basically only large search engines will abide by it, and no one will care about it for small businesses.

references

  1. In layman's terms, what exactly is a web crawler? 12306, the highest number of hits in 1 hour is as high as 5.93 billion, which is too scary.
  2. Baidu Encyclopedia-robots protocol
  3. Baidu v. Qihoo 360 Violation of Robots Agreement
  4. The first step of the crawler: view robots.txt
  5. What should the company do if the company asks to crawl the website that Robots.txt declares that crawling is not allowed?
  6. The out-of-control of web crawlers led to the sentencing of CTOs and programmers. How can technical practitioners avoid business risks and protect their legitimate rights and interests?

Guess you like

Origin blog.csdn.net/wlh2220133699/article/details/131684679