10 recommended Python crawler frameworks, which one are you using?

There are many programming environments for implementing crawler technology, and Java, Python, C++, etc. can all be used for crawling. But many people choose Python to write crawlers. Why? Because Python is really suitable for crawling, the rich third-party library is very powerful, and a few lines of code can achieve the functions you want. More importantly, Python is also a good expert in data mining and analysis. So, what framework is generally better for Python crawlers?

Many people learn python and don't know where to start.
Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
Many people who have done case studies do not know how to learn more advanced knowledge.
For these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and course source code! ??¤
QQ group: 232030553

Generally speaking, the Python crawler framework will only be used when a relatively large demand is encountered. The main purpose of doing this is to facilitate management and expansion. In this article, I will recommend ten Python crawler frameworks.

1. Scrapy: Scrapy is an application framework written to crawl website data and extract structured data. It can be used in a series of programs including data mining, information processing or storing historical data. It is a very powerful crawler framework that can meet simple page crawling, such as the situation where the url pattern can be clearly known. Use this framework to easily crawl down data such as Amazon product information. But for slightly more complicated pages, such as Weibo's page information, this framework cannot meet the needs. It features: built-in support for HTML and XML source data selection and extraction; provides a series of reusable filters (ie Item Loaders) shared between spiders, and provides built-in support for intelligent processing of crawling data.

2. Crawley: High-speed crawling of the content of the corresponding website, supporting relational and non-relational databases, and data can be exported as JSON, XML, etc.

3. Portia: is an open source visual crawler tool that allows users to crawl websites without any programming knowledge! Simply annotate the pages you are interested in, and Portia will create a spider to extract data from similar pages. To put it simply, it is based on the scrapy kernel; visually crawls the content without any development expertise; dynamically matches the content of the same template.

4. Newspaper: can be used to extract news, articles and content analysis. Use multithreading, support more than 10 languages, etc. The author gets inspiration from the simplicity and power of the requests library, and uses Python to develop a program that can be used to extract the content of the article. Support more than 10 languages ​​and all are unicode encoding.

5. Python-goose: an article extraction tool written in Java. The information that the Python-goose framework can extract includes: the main content of the article, the main image of the article, any Youtube/Vimeo videos embedded in the article, meta description, meta tags.

6. Beautiful Soup: Well-known, integrates some common crawler requirements. It is a Python library that can extract data from HTML or XML files. It can realize idiomatic document navigation, find and modify the way of document through your favorite converter. Beautiful Soup will help you save hours or even days of working time. The disadvantage of Beautiful Soup is that it cannot load JS.

7, mechanize: Its advantage is that it can load JS. Of course it also has shortcomings, such as serious lack of documentation. However, through the official example and the method tried by human flesh, it is still barely usable.

8. Selenium: This is a driver that calls a browser. Through this library, you can directly call the browser to complete certain operations, such as entering a verification code. Selenium is an automated testing tool that supports various browsers, including mainstream interface browsers such as Chrome, Safari, and Firefox. If you install a Selenium plug-in in these browsers, you can easily test the Web interface. Selenium supports browsing器Drive. Selenium supports multiple language development, such as Java, C, Ruby, etc. PhantomJS is used to render and parse JS, Selenium is used to drive and connect with Python, and Python performs post-processing.

9. cola: It is a distributed crawler framework. For users, they only need to write a few specific functions without paying attention to the details of distributed operation. Tasks are automatically distributed to multiple machines, and the whole process is transparent to users. The overall design of the project is a bit bad, and the coupling between modules is high.

10. PySpider: A powerful web crawler system written by a Chinese with powerful WebUI. Written in Python language, distributed architecture, supports multiple database backends, powerful WebUI supports script editor, task monitor, project manager and result viewer. Python script control, you can use any html parsing package you like.

Guess you like

Origin blog.csdn.net/Python_sn/article/details/112662930