As a crawler, how to achieve tens of millions of data crawling?

Reptilian, reptilian soul! Reptile street orgy

The threshold for crawlers is not high, and it is high. Every practical operation for the rest of your life will make you collapse at any time

In this era of big data intelligence, data is money! Therefore, more and more companies attach great importance to data, and then obtain some public data through crawler technology and other means to empower the company's business and projects

In the initial stage of the accumulation of original capital, it is normal to use some low-level methods. I
Smile like
  believe everyone knows a little bit . It brings together multiple industrial chains in the current domestic market and nearly 10,000 industry sectors. Its data It is approaching 100 million and the value of data is also very high!

But how does it have so much data? I am also curious

Later, after searching for information, it was found that the main data sources in the early days of Cha Cha came from one of the original capital accumulation methods .

The data collected by crawlers is cleaned and incorporated into the fork library, and finally processed through certain analysis and algorithms, and then opened to the market and users, providing fork search

I’m going to point out that the current valuation of Cha Cha is about to approach 10 small goals, right?

No reptiles, no brothers, come and crawl me if you are a brother! Seeing that Chacha has reached so many small goals, hundreds of thousands of original capitals in the market have once again directed low-level methods to Chacha.

I have to say this again! Cha Cha is full now, all the brothers are still hungry! Last time, a reader commented to me that some anti-crawler strategies of Xia are a bit uncomfortable~

Speaking of this, many friends may think that if you build a crawler yourself, you can make a company? The buds of primitive accumulation of capital have begun to sprout?

Too Young Too Naive

Everyone knows that crawlers who built a simulated browser in the early years would be cordially called an elder by the brothers of the gang ! But now the time has changed, things are no longer human

Now this crawler engineer can be called omnipotent, back to this question: how to realize the collection of tens of millions of data?

The collection of tens of millions of data first depends on who is our goal? If it’s a small goal like cross, there are still some difficulties, and you have to use capital to fight with capital.

If it's just a news website with many stations, it's so easy ! This kind of website is nothing more than tens of millions of URLs

As a crawler, the most important thing is to do a good pre-requirement analysis, estimate the data size of the website and the collected data sources, and filter out the useless target data, because the more data collected, the time-consuming The longer, the more resources are needed

We can't put too much pressure on the website, otherwise it will become DDos and drink tea carefully.
Insert picture description here
If it is multi-site collection, first of all, see if there are common features in the website features. Try to avoid the situation of a site developing a crawler alone! Writing a set of templates like news information can basically solve more than 90% of the problems. Isn't it good to be an Xpath engineer directly?
Insert picture description here
The other is that the code should be robust! You don't know about high availability, high scalability, and high performance , and it doesn't matter much. But it’s fine to hear these three beliefs

In addition, after getting the station analysis, you can also quickly iterate a crawler and run for a while to test the water. After all, many anti-reptiles are not identifiable by the naked eye.

If one-time crawler is developed and runs directly on the line to crawl the entire station without BUG, ​​that is the ancestor of the gang! Must be called a giant with the highest etiquette

In terms of storage, when the magnitude of our collection reaches tens of millions, it is impossible to store it in a table. At this time, we must use sub-tables to store

When data is written to the database, strategies such as batch insertion can be implemented to ensure that data storage is not limited to the impact of database performance and other aspects

Like multi-site collection, it is inevitable to use large bandwidth, memory, CPU and other resources. At this time, we must design a distributed collection system to manage and schedule our resources reasonably to maximize the advantages and functions of crawlers. Is it not good to deploy multi-node collaborative incremental collection with one click?
Insert picture description here
As mentioned earlier, why are the current crawler engineers all omnipotent or on the way to become omnipotent?

Because a reptilian must know at least one subject or more, learning to crawl is just learning to walk

Do you need to know the http protocol? Which protocol can save you bandwidth and time?

database? Understand? How to optimize data storage? Should you know a little bit about distributed database?

algorithm? Should I also understand crawler task scheduling?

distributed? redis? kafka? You have to know a little bit? Otherwise, how do crawlers collaborate? After all, the big guys are using it

JS? Not this? How to become a high-level crawler? The only way to reverse push and reverse!

Should you understand the basic decryption knowledge?

Do you understand the verification code cracking? Do you need to understand machine learning? Machine learning is now used to crack the verification code!

Should I learn ios development? Shouldn't you also learn Android development? Otherwise, how to decompile the encryption algorithm of the hidden interface of the app?

Therefore, how to achieve tens of millions of data collection is actually not related to coding, it has a lot to do with our ability to deal with problems and design capabilities. Now many websites on the market are easy to defend and difficult to attack! You must have but not limited to the following capabilities:
  1. Web crawler detection, kill your IP. You know you are being targeted but you don't know where it is! UA? behavior? How to effectively avoid it?
  2. Return of similar junk data from a certain network? Some poison in the data! How to distinguish?
  3. The capital requirement is to crawl hundreds of millions of data a day, and the bandwidth of a machine is limited. How to use distributed strategies to improve crawling efficiency?
  4. Should the data be cleaned? How to clean? Do you understand end-to-end pipeline cleaning?
  5. How to identify and monitor website data updates? How to design rules?
  6. How to design and store massive data?
  7. How to collect JS loading?
  8. How to crack data and parameter encryption?
  9. How to deal with different verification codes? What better way to improve the recognition rate?
  10. How to collect APP application? How to mine the data interface?

Ten questions for the soul? ? ?
Insert picture description here
  Well, here it is time to say goodbye to everyone. Thank you young readers who took the time to read. It's not easy to create. If you feel something, please give a thumbs up and leave. Your support is the driving force for my creation, and I hope to bring you more quality articles

Guess you like

Origin blog.csdn.net/qiulin_wu/article/details/109437483