Python crawler resource summary: book list, website blog, framework, tools, projects (with resources)

Insert image description here

  • Must reading list -

You don’t need a lot of books and tutorials. Regarding python crawlers, these 8 books are enough.
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here

  • Website Blog -

Insert image description here
This project collects some login methods of major websites and crawler programs of some websites for the purpose of researching and sharing simulated login methods and crawler programs of major websites.

URL: https://awesome-python

Insert image description here
The author of "Python3 Web Crawler and Development Practice" shares some of his own crawler cases and experiences on this blog. The content is very rich.

Website: https://cuiqingcai.com

Scraping.pro
Insert image description here

Scraping.pro is a professional collection software evaluation website. It contains various top foreign collection software evaluation articles, such as scrapy, octoparse, etc.

Website: http://www.scraping.com/

Compared with scraping.pro
Insert image description here
, Kdnuggets covers a wider scope, including business analysis, big data, data mining, data science, etc.

Website: https://www.kdnuggets.com/

Octoparse
Insert image description here
Octoparse is a powerful free collection software. Its blog provides a wide range of content and is easy to understand. It is more suitable for preliminary website collection users.

Website: https://www.octoparse.com

Big Data News
Insert image description here
Big data news is similar to Kdnuggets, covering mainly the big data industry, and website collection is a sub-column below it.

Website: https://www.bigdatanews

Analytics Vidya

Insert image description here
Similar to Big data news, Analytics Vidhya is a more professional data collection website, covering data science, machine learning, website collection, etc.

Website: https://www.analyticsvidhya

  • Crawler framework -

Scrapy

Insert image description here

It is an application framework written to crawl website data and extract structured data. It can be used in a series of programs including data mining, information processing or storing historical data.

Website: https://scrapy.org

PySpider

Insert image description here
Pyspider is a powerful web crawler system implemented in python. It can write scripts, schedule functions and view crawling results in real time on the browser interface.

The backend uses a commonly used database to store crawling results, and can also set tasks and task priorities regularly.

URL: https://pyspider

Crawley
Insert image description here
Crawley can crawl the content of the corresponding website at high speed, supports relational and non-relational databases, and the data can be exported to JSON, XML, etc.

Website: http://crawley-cloud.com/

Portia
Insert image description here
Portia is an open source visual crawler tool that allows you to crawl websites without any programming knowledge!

Website: https://portia

Newspaper
Insert image description here
Newspaper can be used to extract news, articles and content analysis. Use multi-threading, support more than 10 languages, etc.

Website: https://newspaper

Beautiful Soup
Insert image description here
Beautiful Soup is a Python library that can extract data from HTML or XML files.

It enables customary document navigation, search, and modification methods through your favorite converter.

URL: https://BeautifulSoup/bs4/doc/

Grab
Insert image description here
Grab is a Python framework for building web scrapers.

You can build web scrapers of varying complexity, from simple 5-line scripts to complex asynchronous website scrapers that handle millions of web pages.

URL: http://grab-spider-user-manual

Cola
Insert image description here
Cola is a distributed crawler framework. For users, they only need to write a few specific functions without paying attention to the details of distributed operation.

Project address: https://github.com/chineking/cola

  • tool -
4 HTTP proxy tools

(1)Fiddler

Fiddler is the best visual packet capture tool on the Windows platform, and it is also the most well-known HTTP proxy tool.

The function is very powerful. In addition to clearly understanding each request and response, you can also set breakpoints, modify request data, and intercept response content.

Link: https://www.telerik.com/fiddler

(2)Charles

Charles is one of the best packet capture and analysis tools on the macOS platform.

It also provides a GUI interface, which is simple and simple. Its basic functions include HTTP and HTTPS request packet capture, and supports the modification of request parameters. The latest Charles 4 also supports HTTP/2.

Link: https://www.charlesproxy.com/

(3)AnyProxy

AnyProxy is Alibaba's open source HTTP packet capture tool, implemented based on NodeJS.

The advantage is that it supports secondary development and can customize request processing logic. If you can write JS and need to do some customized processing, then AnyProxy is very suitable.

GitHub address: https://alibaba/anyproxy

(4)mitmproxy

mitmproxy is a packet capture tool based on Python that supports SSL. It is cross-platform and provides command line interactive mode.

GitHub address: https://mitmproxy/

Summary of python crawler tools

Insert image description here

This is a summary of tools for python crawlers. Almost everything you can think of can be found here.

URL: https://lartpang/spyder_tool

httpbin

Insert image description here
This website can be used as a crawler test (http and https). It will return some information about the crawler machine and can also be used for online testing.

Website: httpbin.org

curl to python

Insert image description here

This website can quickly convert curl commands into python requests (other languages ​​are also available), and curl commands can be quickly obtained through browser developer tools.

Website: https://curl.trillworks.com

Online conversion

Insert image description here

Sometimes we see Chinese on the webpage, but when viewing the source code of the webpage, it displays Unicode characters. In this case, we need to convert the Unicode characters to Chinese online.

URL: https://unicode_chinese/

XPath Helper

Insert image description here

This tool is a chrome extension used to assist in analyzing and debugging xpath.

Link: https://xpath-helper/

at last:

[For those who want to learn crawlers, I have compiled a lot of Python learning materials and uploaded them to the CSDN official. Friends in need can scan the QR code below to obtain them]

1. Study Outline

Insert image description here

2. Development tools

Insert image description here

3. Python basic materials

Insert image description here

4. Practical data

Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/Z987421/article/details/133323552