The crawler is well written, and the prison is entered early? I think it's too much punishment, the days are getting more and more judgmental

foreword

Your anonymous comments on the Internet may have no privacy in front of the crawlers. On some social platforms, there seem to be hidden corners where you can spit freely. If there are people with ulterior motives, use crawlers to analyze big data. Behavior is tantamount to streaking. In fact, in cyberspace, the use of crawlers and the level of security defenses have already exceeded the imagination of ordinary people . So today I will discuss with you some real defense cases that happened in reality - will writing crawlers really go to jail?


1. First knowledge of reptiles, what is a reptile?

To put it simply, a crawler is a detection machine . Its basic principle is to simulate human behavior to wander around various websites, click buttons, check data, or capture back the information it sees.

If you use a more professional term, the simplest crawler is to use technology to simulate human operations through programs, obtain a lot of data from the server, first parse out the information you need, and then let the program decide what to do next.

【----Help technology learning, all the following learning materials are free at the end of the article! ----】

insert image description here

It is possible to get a certain label link in the webpage and continue to jump to the next page. It is also possible to store the acquired information directly locally or in a database. But these are still reptiles in the traditional sense. Maybe digging through the anonymous community and searching for so-called extreme speeches can improve efficiency.


Second, the aspects involved in reptiles?

In today's crawlers, the game between masters and big companies is no longer limited to the web, but also involves client-side reverse engineering, dynamic debugging and analysis, etc. The real attack and defense scenarios and the huge benefits brought by crawlers far exceed people's expectations. imagine . The most basic thing about crawlers is crawling. We are used to the process of returning data from an http request and analyzing the data we need, which is called a "crawl" .

For example, if you manually search for "Zhang San" in Baidu, this is a request from your http to the Baidu server . The "grabbing" of a crawler is a simulated human behavior, but it is different from your manual request, it is initiated by a machine or a program in batches .

insert image description here
For example, Baidu, which you use every day, uses such crawler technology . It sends countless crawlers to various websites to retrieve their information every day, and then puts on makeup and waits for you to retrieve them.

insert image description here

When 12306 was first launched a few years ago, there were a lot of ticket grabbing software on the market. You may not know that these ticket-grabbing software also use crawler technology , which is equivalent to helping you spread out countless avatars who are constantly refreshing the remaining train tickets on the 12306 website. Once you find a ticket, you will immediately take it and call you "" Come and pay."

As we all know, there is no distinction between good and evil in technology itself, the key lies in who is in the hands.

insert image description here

3. Malicious reptiles

Many government affairs websites, news aggregation media, etc. have crawler applications behind them. It can be said that without crawlers, there would be no "Internet". There is a fairy-like big up in the knowledge area. He will use crawlers to crawl down the bullet screen comments of friends, and then analyze and compare them, and optimize the writing in a targeted manner, which greatly improves work efficiency . But reptiles like grabbing tickets want to masturbate tens of thousands of times to 12306, and this kind of crawler is defined as a "malicious crawler".

insert image description here

4. Anti-reptile

However, where there are crawlers, there are anti-crawlers, which is an offensive and defensive process . An attack can be defined as, under the premise of unauthorized, simulate the operation of a real person through technical means, obtain the target system, and display the information to the real user. Defense can be defined as intervention, interception, and source tracing of attacks. In the field of network security, there is no absolute impregnability. All offensive and defensive methods are "one foot higher than the magic, and one foot higher than the road." The same is true for the offensive and defensive game that is bought against reptiles. ​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​

From the IP ban on the web side , to the verification code to identify whether the requester is a human or a robot , and then to "poisoning" content, the game between crawlers and anti-crawlers will never end.

insert image description here


If we compare our protective net and website APP to a city , the most primitive crawler mechanically initiates a request to open the door. If there is no defensive strategy, the city gate will open wide. Facing the unguarded city gate, the crawler may make an inch. , began to frantically in and out. **In order to guard the city gate and ensure the stability of the server, the gatekeeper must enable IP blocking. If the crawler visits too frequently, it will remember the current IP address of the crawler, and directly block the IP of such malicious crawlers. **However, this kind of defense method usually only works against crawlers on the web side. The verification code that pops up when the user logs in to the account is another powerful tool for anti-crawlers.

insert image description here

It was proposed by the computer genius "Louis von An" and his little friends , and its full name is " Fully Automatic Public Turing Test to Distinguish Computers and Humans ".

Whether it is a puzzle or looking for numbers and words , the system asks questions to the originator of the request, and the one who can answer correctly is a human being , and otherwise it is judged to be a machine. **Of course, the verification code is not designed for anti-crawlers, it was originally to solve the problem of machines sending spam in batches. We also said earlier that crawler behavior is also a batch behavior of machines, so when a crawler is found, a verification code pops up to prevent crawler behavior.

insert image description here

With the continuous evolution of network attack and defense, "coding platforms" have emerged to identify verification codes , and even use AI to code to break through defenses.

Here it is necessary for the system to have stronger insights, and to distinguish whether it is a human or a machine through clues . After several generations of evolution, the verification code has reached its peak in the application of the 12306 system. The difficulty and pass rate of the questions are daunting.

For example, find Pleasant Goat in the picture on the screen. Of course, this verification code is not intended to deliberately make things difficult for people who buy tickets honestly, but is forced to do nothing by crawlers.

insert image description here

5. "Honeypot" technology

It is not necessary to intercept reptiles across the board. In the field of network security, there is a technology called "honeypot", and its English name is "honeypot".

insert image description here

You can imagine a scenario where honey is put in jars as bait to lure insects into traps, and to lure insects into traps, put it on top of the crawler defense technology, which can be called "content poisoning". When the crawler discovers that a suspected unauthorized crawler is crawling data, it does not open it directly, but gives an error message.

For example, in order to crawl the prices of commodities on peer platforms, peers deliberately display wrong prices to crawlers in order to reverse crawls, while consumers see the correct prices through normal searches.

insert image description here

Here is a real case:

Zhang San was going to secretly crawl Li Si's page content, Li Si had been prepared for a long time, and added a paragraph of copywriting "Search results come from Li Si" in the search results, because Zhang San did not encounter verification during the crawling process Since Li Si thinks that everything will be fine and easy to crawl and use, everyone should have heard of the following things. Such copywriting is generously displayed on the front-end page of Zhang San’s company, which is equivalent to announcing In the world, Zhang San's self-developed search engine actually crawled all the content from Li Si. This wave of operations can be said to be less offensive but extremely insulting.

insert image description here

6. What legal risks will you face when writing reptiles?

There is also a joke like this in the market, " Crawlers are well written, and they go to prison early ".

insert image description here

Web crawler itself is a kind of information collection technology, which is a neutral technology, and the technology itself is not divided into illegal and illegal.

Just like a kitchen knife, it can be used to commit crimes, and it can also be used to cut vegetables. The kitchen knife itself is just a technique or tool, and it does not matter whether it is legitimate or not. Crawler technology itself is not a malicious technology, but it may be illegal if combined with some malicious behavior or adding some malicious programs or tools.

Specifically, from the perspective of criminal law, if some specific technical measures of the crawled party are forcibly broken through, it may constitute a crime of illegally obtaining data in a computer information system . The crime of taking control of a computer system. If the use of crawlers interferes with the function and normal operation of the target website, resulting in an increase in its access traffic, slowing down the system response, and affecting system operation, it may also constitute a crime of destroying the computer information system.

insert image description here

From the perspective of the target data you want to crawl, if the data itself is the personal information of citizens, business secrets, or state secrets, even if you use legal technology, but your target data is not protected by law, you can crawl These data may also constitute the crime of violating citizens' personal information, violating commercial secrets, and illegally obtaining state secrets.

7. If the information is public, is crawling illegal?

The answer is: if the information is disclosed illegally or is not disclosed automatically or actively by citizens, crawling this information is also suspected of violating citizens' personal information. In recent years, there have been many cases of Internet companies using crawler technology to go to court.

The more well-known case is the " Dianping v. Baidu case ", because Baidu crawled the user comment information on Dianping;

insert image description here

Another example is the "Sina Weibo v. Maimai case". The reason for Sina Weibo's prosecution is: Maimai obtained the information of users who use Sina Weibo without authorization and the user's consent . The act violated the "Anti-Unfair Competition Law", which is an unfair act. In the end, Baidu compensated Dianping 3.23 million yuan; Maimai compensated Sina Weibo 2 million yuan.

insert image description here

8. I’m just a small code farmer

The "Civil Code", which came into effect on January 1 this year, also has relevant regulations on intellectual property rights, and reptile behavior may also infringe copyright and other intellectual property rights. Speaking of this, some friends will ask, I am just a small code farmer of a company, do I have to take responsibility for writing crawler code according to the company's requirements?
insert image description here

This depends on several situations. If the company's crawling behavior constitutes unfair competition, it must bear civil liability, which is fully borne by the company. If it is suspected of violating the criminal law, it is a public prosecution case, and it cannot be solved by giving money privately. It may constitute a unit crime . In addition to fines for your company, the directly responsible executives of your company, such as your supervisor or leader, or the vice president in charge, or even the legal representative of the company will be sentenced to criminal punishment. As for whether you want to go to jail or not, it depends on whether you have played a leading role in the process and whether you have played a role in promoting it.

From this point of view, writing reptiles is really possible to go to jail, and the better the writing, the greater the risk .

The last thing I want to say is that when we use reptiles, we must always follow such a belief in our hearts. We are using some methods and tools within our ability to make the world more perfect and peaceful little by little. good.

1. Introduction to Python

The following content is the basic knowledge necessary for all application directions of Python. If you want to do crawlers, data analysis or artificial intelligence, you must learn them first. Anything tall is built on primitive foundations. With a solid foundation, the road ahead will be more stable.All materials are free at the end of the article!!!

Include:

Computer Basics

insert image description here

python basics

insert image description here

Python introductory video 600 episodes:

Watching the zero-based learning video is the fastest and most effective way to learn. Following the teacher's ideas in the video, it is still very easy to get started from the basics to the in-depth.

2. Python crawler

As a popular direction, reptiles are a good choice whether it is a part-time job or as an auxiliary skill to improve work efficiency.

Relevant content can be collected through crawler technology, analyzed and deleted to get the information we really need.

This information collection, analysis and integration work can be applied in a wide range of fields. Whether it is life services, travel, financial investment, product market demand of various manufacturing industries, etc., crawler technology can be used to obtain more accurate and effective information. use.

insert image description here

Python crawler video material

insert image description here

3. Data analysis

According to the report "Digital Transformation of China's Economy: Talents and Employment" released by the School of Economics and Management of Tsinghua University, the gap in data analysis talents is expected to reach 2.3 million in 2025.

With such a big talent gap, data analysis is like a vast blue ocean! A starting salary of 10K is really commonplace.

insert image description here

4. Database and ETL data warehouse

Enterprises need to regularly transfer cold data from the business database and store it in a warehouse dedicated to storing historical data. Each department can provide unified data services based on its own business characteristics. This warehouse is a data warehouse.

The traditional data warehouse integration processing architecture is ETL, using the capabilities of the ETL platform, E = extract data from the source database, L = clean the data (data that does not conform to the rules), transform (different dimension and different granularity of the table according to business needs) calculation of different business rules), T = load the processed tables to the data warehouse incrementally, in full, and at different times.

insert image description here

5. Machine Learning

Machine learning is to learn part of the computer data, and then predict and judge other data.

At its core, machine learning is "using algorithms to parse data, learn from it, and then make decisions or predictions about new data." That is to say, a computer uses the obtained data to obtain a certain model, and then uses this model to make predictions. This process is somewhat similar to the human learning process. For example, people can predict new problems after obtaining certain experience.

insert image description here

Machine Learning Materials:

insert image description here

6. Advanced Python

From basic grammatical content, to a lot of in-depth advanced knowledge points, to understand programming language design, after learning here, you basically understand all the knowledge points from python entry to advanced.

insert image description here

At this point, you can basically meet the employment requirements of the company. If you still don’t know where to find interview materials and resume templates, I have also compiled a copy for you. It can really be said to be a systematic learning route for nanny and .

insert image description here
But learning programming is not achieved overnight, but requires long-term persistence and training. In organizing this learning route, I hope to make progress together with everyone, and I can review some technical points myself. Whether you are a novice in programming or an experienced programmer who needs to be advanced, I believe that everyone can gain something from it.

It can be achieved overnight, but requires long-term persistence and training. In organizing this learning route, I hope to make progress together with everyone, and I can review some technical points myself. Whether you are a novice in programming or an experienced programmer who needs to be advanced, I believe that everyone can gain something from it.

Data collection

This full version of the full set of Python learning materials has been uploaded to the official CSDN. If you need it, you can click the CSDN official certification WeChat card below to get it for free ↓↓↓ [Guaranteed 100% free]

insert image description here

Guess you like

Origin blog.csdn.net/weixin_49895216/article/details/132533171