- Must reading list -
You don’t need a lot of books and tutorials. Regarding python crawlers, these 8 books are enough.
- Website Blog -
This project collects some login methods of major websites and crawler programs of some websites for the purpose of researching and sharing simulated login methods and crawler programs of major websites.
URL: https://awesome-python
The author of "Python3 Web Crawler and Development Practice" shares some of his own crawler cases and experiences on this blog. The content is very rich.
Website: https://cuiqingcai.com
Scraping.pro
Scraping.pro is a professional collection software evaluation website. It contains various top foreign collection software evaluation articles, such as scrapy, octoparse, etc.
Website: http://www.scraping.com/
Compared with scraping.pro
, Kdnuggets covers a wider scope, including business analysis, big data, data mining, data science, etc.
Website: https://www.kdnuggets.com/
Octoparse
Octoparse is a powerful free collection software. Its blog provides a wide range of content and is easy to understand. It is more suitable for preliminary website collection users.
Website: https://www.octoparse.com
Big Data News
Big data news is similar to Kdnuggets, covering mainly the big data industry, and website collection is a sub-column below it.
Website: https://www.bigdatanews
Analytics Vidya
Similar to Big data news, Analytics Vidhya is a more professional data collection website, covering data science, machine learning, website collection, etc.
Website: https://www.analyticsvidhya
- Crawler framework -
Scrapy
It is an application framework written to crawl website data and extract structured data. It can be used in a series of programs including data mining, information processing or storing historical data.
Website: https://scrapy.org
PySpider
Pyspider is a powerful web crawler system implemented in python. It can write scripts, schedule functions and view crawling results in real time on the browser interface.
The backend uses a commonly used database to store crawling results, and can also set tasks and task priorities regularly.
URL: https://pyspider
Crawley
Crawley can crawl the content of the corresponding website at high speed, supports relational and non-relational databases, and the data can be exported to JSON, XML, etc.
Website: http://crawley-cloud.com/
Portia
Portia is an open source visual crawler tool that allows you to crawl websites without any programming knowledge!
Website: https://portia
Newspaper
Newspaper can be used to extract news, articles and content analysis. Use multi-threading, support more than 10 languages, etc.
Website: https://newspaper
Beautiful Soup
Beautiful Soup is a Python library that can extract data from HTML or XML files.
It enables customary document navigation, search, and modification methods through your favorite converter.
URL: https://BeautifulSoup/bs4/doc/
Grab
Grab is a Python framework for building web scrapers.
You can build web scrapers of varying complexity, from simple 5-line scripts to complex asynchronous website scrapers that handle millions of web pages.
URL: http://grab-spider-user-manual
Cola
Cola is a distributed crawler framework. For users, they only need to write a few specific functions without paying attention to the details of distributed operation.
Project address: https://github.com/chineking/cola
- tool -
(1)Fiddler
Fiddler is the best visual packet capture tool on the Windows platform, and it is also the most well-known HTTP proxy tool.
The function is very powerful. In addition to clearly understanding each request and response, you can also set breakpoints, modify request data, and intercept response content.
Link: https://www.telerik.com/fiddler
(2)Charles
Charles is one of the best packet capture and analysis tools on the macOS platform.
It also provides a GUI interface, which is simple and simple. Its basic functions include HTTP and HTTPS request packet capture, and supports the modification of request parameters. The latest Charles 4 also supports HTTP/2.
Link: https://www.charlesproxy.com/
(3)AnyProxy
AnyProxy is Alibaba's open source HTTP packet capture tool, implemented based on NodeJS.
The advantage is that it supports secondary development and can customize request processing logic. If you can write JS and need to do some customized processing, then AnyProxy is very suitable.
GitHub address: https://alibaba/anyproxy
(4)mitmproxy
mitmproxy is a packet capture tool based on Python that supports SSL. It is cross-platform and provides command line interactive mode.
GitHub address: https://mitmproxy/
This is a summary of tools for python crawlers. Almost everything you can think of can be found here.
URL: https://lartpang/spyder_tool
This website can be used as a crawler test (http and https). It will return some information about the crawler machine and can also be used for online testing.
Website: httpbin.org
This website can quickly convert curl commands into python requests (other languages are also available), and curl commands can be quickly obtained through browser developer tools.
Website: https://curl.trillworks.com
Sometimes we see Chinese on the webpage, but when viewing the source code of the webpage, it displays Unicode characters. In this case, we need to convert the Unicode characters to Chinese online.
URL: https://unicode_chinese/
This tool is a chrome extension used to assist in analyzing and debugging xpath.
Link: https://xpath-helper/
at last:
[For those who want to learn crawlers, I have compiled a lot of Python learning materials and uploaded them to the CSDN official. Friends in need can scan the QR code below to obtain them]
1. Study Outline
2. Development tools
3. Python basic materials
4. Practical data