Earn 6,000 a month after learning reptiles for 1 month? Don't be fooled, the master tells you the real situation of reptiles

A few days ago, a fan reported to me that someone from a certain organization told him that he could take orders within a month of learning reptiles, and asked this guy to sign up for the reptile course of that organization. The tuition fee was earned back, and I couldn't help but be speechless after hearing it.

insert image description here

With an objective attitude, even if I don’t believe it, I didn’t draw conclusions, but went to see their curriculum system, and the result was as I expected. Most of the courses are about Python introductory knowledge (functions, etc.), requests and XPath Waiting for the content, isn't this all the knowledge of some junior crawlers? Can earn 6000 a month? Why not teach young people to grab money on the street?

If you just learn this stuff, you may starve to death, and crawlers can earn 6,000 a month, but your technical level must be reached.

Today, I will explain to you what technologies should be learned for the primary, intermediate, high, and peak levels of reptiles, and combined with my experience in private work over the years, tell you how much you can get for what you have learned.


1. Primary reptiles

According to my understanding of crawlers over the years, the level of junior crawlers looks like this:

insert image description here

What can this level do? It is crawling some basic websites, and GG involves a little anti-crawling.

For example, if we are going to crawl an article on a certain website, this website does not have an anti-crawling mechanism, then it is enough to use libraries such as requests, use XPath, BeautifulSoup, PyQuery or regular expressions to parse the source code of the webpage, and add a text Write and save and you're done.

The difficulty is not great, it is nothing more than a few method calls and cyclic storage. If the storage is slightly expanded, you can connect to MySQL, MongoDB, Elasticsearch, Kafka, etc. to save data and achieve persistent storage. It will be more convenient to query or operate in the future.

One month is about the level of junior crawlers. It is quite difficult to earn 6,000 a month. You have to improve your crawler skills.

insert image description here


2. Intermediate reptiles

The level of intermediate reptiles can be regarded as the basic level of professional reptiles. In addition to the knowledge points of junior reptiles, you should also master the following knowledge points:

insert image description here

1. Crawling method

When your requests are not useful (the ones crawled down are different from those displayed on the webpage), you should think that the data source may be Ajax, and you must understand JavaScript when you analyze the website; if you want to bypass the analysis of Ajax and some To crawl data through the process of JavaScript logic, we have to use Puppeteer, Pyppeteer, Selenium, Splash, etc. to simulate browser crawling.

2. Crawling speed

In addition to the crawling method, there is also the crawling speed. At this time, you need to have a knowledge reserve of multi-process, multi-thread, and coroutine.

insert image description here

3. Climb APP

If you only know how to crawl web pages, then you are not at the level of an intermediate crawler. You have to know how to crawl APPs, and APPs also occupy half of the country .

At this time, you have to be able to capture packets with Charles and Fiddler, and use them for simulation after capture; if the interface is encrypted, you can use mitmproxy to directly monitor the interface data or use Hook, for example, you can also get it on Xposed.

Another important point when crawling APP is automatic crawling . If you manually poke the crawler by yourself, it’s useless to pay more money. This is not a personal job... A better solution is the adb tool and Appium. Do you think you should learn it?

insert image description here


3. Advanced reptiles

Senior crawlers have great advantages whether in the workplace or part-time jobs. Senior crawlers should master the following technologies:

insert image description here

1. Enterprise level crawler

But anyone who has been in contact with large-scale crawlers will understand that although multi-threading, multi-process and coroutine can speed up crawling, it is still a stand-alone crawler, which is inferior to more advanced distributed crawlers. Many, distributed crawlers can be regarded as enterprise-level crawlers.

The focus of distributed crawlers lies in resource sharing, so what we need to master are RabbitMQ, Celery, Kafka, and use these basic queues or components to achieve distribution; the second is our famous Scrapy crawler framework, which is also currently The most commonly used crawler framework is essential for the understanding and mastery of Scrapy's Redis, Redis-BloomFilter, and Cluster.

After mastering these things, your crawler can reach the enterprise-level high-efficiency crawler.

insert image description here

2. Technology to deal with anti-climbing

Another focus that should be considered at the advanced reptile level is anti-crawling.

The common operation of the webpage anti-crawling mechanism is the verification code , such as slider verification, physical check, addition and subtraction, etc. There are endless tricks. At this time, you have to know how to deal with these common verification codes.

There is also the common IP detection in anti-climbing . If it fails, your account will be banned, so countermeasures are also necessary. No matter whether you use a free proxy or a paid proxy to change the proxy IP, it is all possible.

And to deal with anti-climbing, the diversion technology avoids the account being blocked . The diversion technology has to build a pool, Cookies pool, Token pool, Sign pool, all are fine. After having a pool, your probability of being blocked will also be reduced, and you don’t want to climb A public account turned out to be blocked by WeChat?

insert image description here


4. A higher level of reptiles (the pinnacle of reptiles)

For higher-level reptiles, the following 4 points are necessary:

insert image description here

1. JS reverse engineering

Why learn JS reverse crawling? In the confrontation between anti-crawling and anti-anti-crawling, it is also possible to use Selenium to crawl, but the efficiency is still low. After all, it simulates the entire process of web page rendering, and the real data may only be hidden in a small interface. , so JS reverse engineering is a higher-level crawling technology, especially for data crawling on large websites, such as Duoduo and Moubao. If you can use JS reverse engineering to crawl down, it is undoubtedly one of the proofs of superb technology. But not everyone can practice JS reverse engineering, it really burns hair.

Not to mention the reverse engineering of the APP, the web page can be reversed, and the APP can also be reversed, then you are worthy of the word "awesome".

2. Intelligent crawler

What is an intelligent crawler? For example, in general, to write a crawler that crawls a novel website, you need to write different extraction rules according to different websites to extract the desired content. And if you use intelligent analysis, no matter which website, you only need to pass the url of the webpage to it, and the algorithm can intelligently identify the title, content, update time and other information without repeatedly writing extraction rules.

In short, intelligent crawlers are the combination of crawlers and machine learning technology to make crawlers more intelligent . Otherwise, if we need to crawl 10,000 websites, do we have to write 10,000 crawler scripts?

insert image description here

3. Reptiles and operation and maintenance

When did reptiles have a relationship with operation and maintenance? They have always had an inseparable relationship, but your crawler needs or level have not been met, so they will not be considered.

The relationship between crawlers and operation and maintenance is mainly reflected in the aspects of deployment and distribution, data storage and monitoring.

For example, how to quickly deploy a crawler to 100 hosts to run? For example, how to monitor the memory and CPU usage of some crawlers? For example, how to set up an alarm mechanism for crawlers to ensure the safety of crawler projects?

Kubernetes, Prometheus, and Grafana are technologies that crawlers use more in terms of operation and maintenance. I often use them as escorts when working on larger crawler projects.

insert image description here

4. The pinnacle of reptiles

What is the pinnacle? There may never be a peak...As long as I don't have the hairstyle of a strong man (totally bald), I dare not say that I have seen the peak...

I vaguely feel that reptiles have achieved the ultimate. They are capable of full-stack and data analysis. Maybe they are still masters of algorithms. Maybe they can make achievements in artificial intelligence. Is this the pinnacle of reptiles?

That's all for today's sharing, I hope everyone can become a man at the top of the pyramid!

Thank you for your reading and liking. I have collected a lot of technical dry goods, which can be shared with friends who like my articles. If you are willing to take the time to settle down and learn, they will definitely help you. The dry goods include:

insert image description here

Click on the business card at the end of the article to take it away
insert image description here

Guess you like

Origin blog.csdn.net/zhiguigu/article/details/130193479