When you want to learn reptiles and write a reptile, how much do you have to know

In today's era of big data, web crawlers have become an important means of obtaining data.

But learning to crawl well is not that simple. First of all, there are too many knowledge points and directions. It is related to computer networks, programming foundation, front-end development, back-end development, app development and reverse, network security, database, operation and maintenance, machine learning, data analysis, etc. Like a big net, it connects some mainstream technology stacks. Because of the many directions covered, the things to learn are also very scattered and messy. Many beginners are not sure what to learn. When they encounter anti-climbing during the learning process, they do n’t know what method to use. Let ’s do this article. Summarize and summarize.

Beginner Reptile

Some of the most basic websites often do not carry any anti-climbing measures. For example, for a blog site, if we want to crawl the entire site, we will follow the list page to the article page, and then crawl down the article time, author, text and other information.

How to write the code? It is enough to use Python requests and other libraries. Write a basic logic, and then obtain the source code of an article. When parsing, use XPath, BeautifulSoup, PyQuery or regular expressions, or rough string matching. Pull out the content, add a text and save it.

The code is very simple, just call a few methods. The logic is very simple, several cycles plus storage. Finally, we can see an article and save it in our computer. Of course, some students may not be able to write code or are too lazy to write, so the use of basic visual crawling tools, such as a claw fish, a certain ethnic collector can also crawl down the data by visual selection.

If the storage is slightly expanded, you can connect to MySQL, MongoDB, Elasticsearch, Kafka, etc. to save data and achieve persistent storage. It will be more convenient to query or operate in the future.

Anyway, regardless of the efficiency, a website with no anti-climbing can be done in the most basic way.

Here, do you say you can crawl? No, it's still far away.

Ajax, dynamic rendering

With the development of the Internet, the front-end technology is also constantly changing, and the data loading method is no longer purely server-side rendering. Now you can see that the data of many websites may be transmitted through the interface, or even if it is not the interface, it is some JSON data, which is then rendered by JavaScript.

At this time, if you want to use requests to crawl, then it is not useful, because the source code of requests crawled is rendered by the server, and the browser sees the page and the results obtained by requests are not the same. The real data is executed by JavaScript, the data source may be Ajax, it may be some Data in the page, it may be some ifame pages, etc., but in most cases it may be obtained by the Ajax interface.

So in many cases, you need to analyze Ajax, and then use the program to simulate after knowing how to call these interfaces. However, some interfaces carry encryption parameters, such as tokens, signs, etc., which are not easy to simulate, what about?

One way is to analyze the JavaScript logic of the website, pull the code in it, find out how these parameters are structured, find out the idea and then use a crawler to simulate or rewrite it. If you solve it, then the direct simulation method will be much more efficient. It requires some JavaScript foundation. Of course, some website encryption logic is too good. You may not be able to solve it for a week, and finally give up Too.

If it doesn't work out or doesn't want to solve it, what should we do? At this time, there can be a simple and crude method to crawl directly by simulating the browser, such as Puppeteer, Pyppeteer, Selenium, Splash, etc., so that the source code of the crawl is the real web code, and the data is naturally extracted. At the same time, it also bypasses the process of analyzing Ajax and some JavaScript logic. In this way, it can be seen and climbed without difficulty. At the same time, the browser is simulated, and there are few legal problems.

But in fact, this latter method will also encounter various anti-climbing situations. Now many websites will recognize the webdriver. Seeing that you are using tools such as Selenium, you can directly kill or not return data, so you encounter this kind of website. You have to specifically solve this problem.

Multi-process, multi-thread, coroutine

The above situation is relatively simple to simulate with a single-threaded crawler, but there is a problem that the speed is slow.

Crawlers are IO-intensive tasks, so they may be waiting for the network's response in most cases. If the network responds slowly, you have to wait. But this spare time actually allows the CPU to do more. then what should we do? Open more threads.

So at this time, we can add multi-process and multi-thread in some scenarios. Although multi-thread has a GIL lock, the impact on the crawler is not that great, so multi-process and multi-thread can be doubled. To improve the crawling speed, the corresponding library has threading and multiprocessing.

Asynchronous coroutines are even more powerful. Use aiohttp, gevent, tornado, etc. Basically, you can do as much concurrency as you want, but still be comfortable, don't hang up other people's websites.

In short, with these few, the crawler speed will be raised.

But speed is not necessarily a good thing. Anti-climbing is definitely coming, blocking your IP, blocking your account, playing verification code, and returning fake data, so sometimes turtle speed climbing seems to be a solution?

distributed

Multi-threading, multi-process, and coroutine can be accelerated, but in the end it is still a stand-alone crawler. To truly achieve scale, you have to rely on distributed crawlers.

What is the core of distributed? Resource Sharing. Such as crawling queue sharing, deduplication sharing, etc.

We can use some basic queues or components to achieve distribution, such as RabbitMQ, Celery, Kafka, Redis, etc., but after many people try to implement a distributed crawler by themselves, there will always be some problems with performance and scalability. Of course, the exception is particularly good. Many enterprises actually have a set of distributed crawlers developed by themselves, which are closer to the business. This is of course the best.

At present, mainstream Python distributed crawlers are still based on Scrapy. Docking Scrapy-Redis, Scrapy-Redis-BloomFilter, or using Scrapy-Cluster, etc., they are based on Redis to share the crawl queue, and they will always encounter some of them. Memory problem. So some people also consider receiving other message queues, such as RabbitMQ, Kafka, etc., to solve some problems, and the efficiency is not bad.

In short, to improve crawling efficiency, distributed must still be mastered.

Captcha

Reptiles inevitably encounter anti-climbing, verification code is one of them. If you want to anti-climb, you must first understand the verification code.

Now you can see that many websites will have a variety of verification codes, such as the simplest graphic verification code. If the text of the verification code is regular, OCR or the basic model library can recognize it. If you do n’t want to do this, It can be directly connected to a coding platform, the accuracy is still there.

However, you may not see any graphic verification codes now. They are all behavior verification codes, such as a test, a shield, etc., and there are many abroad, such as reCaptcha and so on. Some are a little simpler, such as sliding, you can find a way to identify the gap, such as image processing comparison, deep learning recognition are all possible. The trajectory writes a simulation of normal human behavior, plus a little jitter. After you have the trajectory, how to simulate it? If you are awesome, you can directly analyze the JavaScript logic of the verification code and enter the trajectory data, then you can get some encrypted parameters inside, and directly take these parameters to the form or interface. It can be used directly. Of course, you can also drag it in the way of simulating the browser, you can also get the encryption parameters in a certain way, or directly use the method of simulating the browser to do the login together and climb with the Cookie.

Of course, dragging is just a verification code, as well as text selection, logical reasoning, etc. If you really do n’t want to do it, you can find a coding platform to solve it and simulate it, but after all, some experts will choose to train deep learning by themselves. Related models, collect data, label, train, and train different models for different businesses. With this core technology, there is no need to spend money to find a coding platform. After studying the logic simulation of the verification code, the encryption parameters can be solved. However, some verification codes are very rare, and some of them have not been solved.

Of course, some verification codes may be popped up too frequently. If you change the IP, you can solve it.

Sealed IP

Enclosing IP is also a headache, and the effective method is to change the proxy.

There are many kinds of agents, free on the market, and too many charges.

First of all, you can use the free agents on the market, build an agent pool yourself, collect all the free agents in the whole network, and then add a tester to keep testing. The test URL can be changed to the URL you want to crawl. This test can generally be used directly to crawl your target website. I have also built a proxy pool myself, and now I have connected some free proxies. I have regularly crawled and tested regularly. I have also written an API to fetch it and put it on GitHub: https://github.com/Python3WebSpider/ProxyPool. Docker images provide Kubernetes scripts that you can use directly.

The same is true for paid agents. Many merchants provide an agent extraction interface. You can get hundreds of agents with a single request. We can also connect them to the agent pool. But this agent is also divided into various packages. The quality of open agents, exclusive agents, etc. and the probability of being blocked are also different.

Some merchants also use the tunnel technology to set up a proxy, so that we do n’t know the address and port of the proxy. The proxy pool is maintained by them, such as a cloud, which is more worry-free to use, but the controllability is Worse.

There are more stable agents, such as dial-up agents, cellular agents, etc., the access cost will be higher, but to a certain extent, it can also solve some IP blocking problems.

But behind these is not simple, why a good Gao An agent is inexplicably unable to climb, some of the things behind it will not say much.

## 封 ACCOUNT

Some information needs to be simulated to log in to crawl. If you crawl too fast, your website will directly block your account, so you have nothing to say. For example, if you climb the public account, others will seal your WX number, then it's all over.

One solution is of course to slow down the frequency and control the rhythm.

Another way is to look at other terminals, such as mobile page, App page, and wap page, to see if there is a way to bypass login.

Another better way is to diverge. If you have enough numbers, build a pool, such as the Cookie pool, Token pool, and Sign pool. Anyway, no matter what pool it is, cookies and tokens from multiple accounts are placed in this pool. When you use it, you will randomly take one from it. If you want to ensure that the crawling efficiency remains the same, then the frequency of cookies and tokens corresponding to each account becomes 1/5 of the 100 accounts compared to 20 accounts, then the probability of being blocked It was reduced accordingly.

Weird anti-climbing

The above mentioned are some of the more mainstream anti-climbing, and of course there are many wonderful anti-climbing. For example, return fake data, return picture data, return out of order data, return curse data, return data asking for mercy, that all depends on the specific situation.

These anti-climbing must also be careful. I have seen an anti-climbing that directly returns rm -rf /. It is not without it. If you happen to have a script that simulates the execution of the return result, the consequences will be imagined.

JavaScript reverse

Speaking of the top. With the advancement of front-end technology and the enhancement of anti-climbing awareness of websites, many websites choose to work hard on the front end, that is, encrypt or obfuscate some logic or code on the front end. Of course, this is not only to protect the front-end code from being easily stolen, but more importantly, anti-climbing. For example, many Ajax interfaces will carry some parameters, such as sign, token, etc., which were also mentioned above. This kind of data can be crawled by Selenium and other methods mentioned above, but in general it is too inefficient. After all, it simulates the entire process of web page rendering, and the real data may only be hidden in a small interface.

If we can really find out the logic of some interface parameters and use code to simulate the execution, then the efficiency will be doubled, and the above-mentioned anti-climbing phenomenon can be avoided to a certain extent.

But what is the problem? It's difficult.

Webpack is on the one hand, the front-end code is compressed and transcoded into some bundle files, the meaning of some variables has been lost, and it is not easy to restore. Then some websites add some obfuscator mechanism to turn the front-end code into something you do not understand completely, such as string breaking and scrambling, variable hexadecimalization, control flow flattening, infinite debugging, console disabling, etc. Wait, the code and logic of the front end has changed beyond recognition. Some use WebAssembly and other technologies to directly compile the front-end core logic, then you can only pull it slowly. Although there are some skills, it will still take a lot of time. But once it's solved, everything will be fine. How to say? Just like the Orsay question, it can't be solved as GG.

Many companies recruiting crawler engineers will ask if there is a reverse JavaScript foundation, which websites have been hacked, such as a treasure, a lot, a certain article, etc., and if they find something they need, they may directly hire you. The logic of each website is different, and the difficulty is different.

App

Of course, crawlers are not just web crawlers. With the development of the Internet era, more and more companies now choose to put data on the App, and even some companies have only the App and no website. So the data can only be crawled through the App.

How about crawling? The basic one is the packet capture tool. Charles and Fiddler use a shuttle. After grabbing the interface, you can directly use it to simulate.

What if the interface has encryption parameters? One method you can handle while climbing, such as mitmproxy directly listening to the interface data. On the other hand, you can take Hook, for example, you can get it on Xposed.

How to automate when crawling? You can't use your hand to poke. In fact, there are many tools, Android's native adb tools are also available, Appium is now a more mainstream solution, of course, there are other wizards can be achieved.

Finally, sometimes I really do n’t want to go through the automated process. I just want to pull out some of the interface logic inside, then I have to reverse it. IDA Pro, jdax, FRIDA and other tools come in handy. Of course this The process is as painful as the reverse of JavaScript, and may even have to read assembly instructions. It is not impossible to get a hair loss in a case.

Intelligent

All of the above are familiar, congratulations, you have exceeded 80% or 90% of reptile players. Of course, those who specialize in JavaScript reverse and App reverse are all men standing on the top of the food chain. This is strictly speaking Not counting the reptile category, we are not included in this kind of god, anyway, I am not.

In addition to some of the above skills, in some occasions, we may also need to combine some machine learning techniques to make our crawler more intelligent.

For example, many blogs and news articles now have a high similarity in page structure and similar information to be extracted.

For example, how to distinguish whether a page is an index page or a detail page? How to extract the article link of the detail page? How to parse the content of the article page? These can actually be calculated by some algorithms.

Therefore, some intelligent parsing technologies are also in operation, such as the extraction of detail pages, and the GeneralNewsExtractor written by a friend performs very well.

Suppose I have a demand, I want to crawl 10,000 news website data, do I need to write XPath one by one? Write me to death. If there is an intelligent analysis technology, it is a matter of minutes to complete this under the condition of tolerating certain errors.

In short, if we can learn this one, our reptile technology will be even more powerful.

O & M

This piece is also a highlight. Reptiles and operation and maintenance are also closely related.

For example, after writing a crawler, how to quickly deploy to 100 hosts to run.

For example, how to flexibly monitor the running status of each crawler.

For example, the crawler has some code changes, how to quickly update.

For example, how to monitor the memory consumption and CPU consumption of some crawlers.

For example, how to scientifically control the timing of the crawler,

For example, if there is a problem with the crawler, how to receive notifications in time and how to set up a scientific alarm mechanism.

In this, everyone has their own methods of deployment, such as using Ansible, of course. If you use Scrapy, there is Scrapyd, and then with some management tools, you can also complete some monitoring and timing tasks. But I still use Docker + Kubernetes more, plus a set of DevOps, such as GitHub Actions, Azure Pipelines, Jenkins, etc., to quickly achieve distribution and deployment.

Everyone uses crontab for scheduled tasks, some uses apscheduler, some uses management tools, and some uses Kubernetes. In my case, Kubernetes is more. Scheduled tasks are also very easy to implement.

As for monitoring, there are many, and some specialized crawler management tools come with some monitoring and alarm functions. Some cloud services also bring some monitoring functions. I use Kubernetes + Prometheus + Grafana. It is clear at a glance what CPU, memory, and running status. The alarm mechanism is also convenient to configure in Grafana. It supports Webhooks, emails, and even a pin.

Data storage and monitoring, Kafka, Elasticsearch personally feel very convenient, I mainly use the latter, and then cooperate with Grafana, data crawling volume, crawling speed and other monitoring are also clear.

Conclusion

At this point, some of the knowledge points covered by the crawler are almost the same. How to sort out whether it is computer network, programming foundation, front-end development, back-end development, app development and reverse, network security, database, operation and maintenance, machine learning All covered? The above summary can be regarded as the path from the reptile Xiaobai to the reptile master. There are actually many points that can be studied in each direction. Each point is refined, and it will be very amazing.

Reptiles often learn to learn and become a full-stack engineer or a full-dry engineer, because you may really do everything. But there is no way. They are forced by reptiles. If it is not trapped in life, who would like to be talented?

But after having talent? Touch the top of the head, lying trough, what about my hair?

Well, everyone understands.

Last but not least, cherish life and cherish every hair.

The latest Python tutorial in 2020:

If you want to learn Python or are learning Python, there are a lot of Python tutorials, but is it the latest?

Maybe you have learned something that people might have learned two years ago, and here I share a wave of the latest Python tutorials for 2020.

 

 

 

 

The above series of editors are ready for everyone to pack, I hope to help you who are learning!

How to get it, you can get it for free by editing the "Information" of the private letter!

Guess you like

Origin www.cnblogs.com/python0921/p/12691806.html