[Question] Have you ever done a web crawler? how do you do it How do you decide what to climb and what not to climb?

When I interviewed a start-up company, the interviewer asked me this question, and I was a little stunned.

I thought to myself: "What does he want? Do you want me to be a reptile? It's very risky to be a reptile now."

Me: "Yes, I have done similar work, but the crawling is all public data and there is no legal problem"

Interviewer: "Then how do you know that the data you crawled is not legally risky?"

Me: "Uh...Because we have consulted the company's legal team and done relevant research before we started this project, and the code has passed a third-party code audit, so there is no data problem"

The interviewer was silent after that.

To be honest, I didn't know the purpose of asking this question at the time (maybe I wanted to set up a solution or I wanted to investigate how I solved this problem), but I think he probably didn't expect me to be so round passed. After so many years, I think I should be able to re-answer this question now.

First, what do I do?

Now when it comes to making a web crawler, I immediately think of doing it in Python. After all, Python's library for this aspect is really easy to use. But since I regard myself as "a Java practitioner who is not doing business properly", then I will briefly talk about it based on Java.

The first step is definitely to plan the IP pool for proxy output (big pit, you need to cooperate with other tools to find available IP continuously), and then the dynamic control of multi-threading (big pit, too sensitive to elaborate), Then there is PhantomJS (HtmlUtil) + Selenium's browserless page structure extraction (big pit, adjustments need to be made to deal with the asynchronous rendering scene of the page), and then the page data is parsed through Jsoup, and then data cleaning (big pit, data Structure adaptation), storage... I think this should be able to meet the needs of basic web crawlers.

So the most critical thing is, which data can be crawled? Which ones cannot be climbed?

To know the answer, you must refer to the "lessons learned from the past". To find the official referee, I will choose the China Judgment Documents Network ( https://wenshu.court.gov.cn/ ), as long as you register an account, you can check it for free.

After extensive data collection and analysis, the following conclusions were drawn:

Behavior Summarize
Web crawling offenses include 1. Crawling other people’s website data for commercial purposes, such as using it to provide free novel reading services for its own APP, cracking the verification code of the other party’s website to obtain data for use on its own website, obtaining data for profit, etc. These are all violations of copyright and data ownership rights;
2. Obtaining data by illegal means, such as using technology to crack the anti-crawling mechanism of other websites or obtaining website interface data through technical means such as packet capture and decompilation. These are acts of illegally obtaining computer information system data;
3. The amount of crawled data is large or seriously interferes with the functions of other websites, such as crawling a large amount of personal information or crawling too much traffic to paralyze the website. These are seriously beyond the limit of reasonable use, and corresponding legal responsibilities shall be borne;
4. Providing programs or tools specially used to illegally intrude into computer information systems, such as providing crawler programs that bypass the anti-crawling measures of the target website, is a hazard to the security of computer information systems 5. Failure
to take confidentiality measures leading to the disclosure of commercial secrets, such as mining commercial data hidden in website codes, is a violation of commercial secret protection;
follow the rules 1. If you want to crawl, you can crawl completely public information, without destroying or bypassing any anti-climbing mechanism;
2. The crawled data is only used for legitimate academic research purposes, not for commercial purposes;
3. Do not give the target when crawling The server brings too much load, and the crawling frequency needs to be limited;
4. Comply with the robots protocol and respect the crawling rules of the website owner;
5. Use the crawled data reasonably, and do not exceed the scope of use;
absolutely not 1. Unauthorized use for commercial purposes;
2. Destroying or bypassing the anti-crawler technical measures of the website;
3. Crawling sensitive data such as personal privacy and business secrets;
4. Providing programs or tools to intrude or destroy computer information systems ;
5. Frequent crawling of data brings heavy load to the server;
6. Failure to take confidentiality measures leads to data leakage;

So from a technical point of view, we must abide by the following 7 points:

  1. Identify and abide by the robots protocol, and reasonably set the crawling frequency, path, etc. according to the requirements of the crawling objects;
  2. Do not use methods such as cracking verification codes and simulated logins to bypass anti-climbing technology, which is illegal access to computer information systems;
  3. Obtaining interface data without decompiling software, monitoring network data traffic, etc., is illegal data acquisition;
  4. Do not provide tools or services specifically for cracking anti-climbing and illegally entering computer information systems;
  5. Resolutely do not crawl personal privacy information, commercial secrets, national defense construction, data related to national affairs and cutting-edge science and technology;
  6. Use distributed, asynchronous and other technologies to control the frequency of crawling to avoid overloading the target server;
  7. Immediately stop when receiving a notice that the site is prohibited from crawling;

That's basically all of the above, and I would like to remind everyone, don't do things that touch the bottom line of the law and endanger national security . It is recommended not to touch the web crawler-related business lightly until you have figured it out, so as not to cause disaster. If something really happens, you still need to consult a professional legal team for advice, and don't think about it yourself.

Guess you like

Origin blog.csdn.net/kida_yuan/article/details/132504209