Summary of learning reptiles entry

What are reptiles?

Web crawler (also known as web spider, web robot, in the middle of FOAF community, more often called web Chaser), in accordance with a certain rule to automatically crawl the Web information program or script.

Prerequisite knowledge

In my experience, to learn Python reptiles, we have to learn the total of the following:

  • Python Basics
  • Python中requests
  • Python Regular Expressions
  • Python Reptile framework Scrapy
  • Python crawlers more advanced features

1.python based learning

List list, dictionary dist, cycling, judge

2.Python中requests

Using this library we can get the content of the page, and content extraction and analysis using regular expressions, get the results we want.

3.python regular expressions

Python regular expression is a powerful weapon that is used to match a string. Its design idea is to use a language descriptive string to define a rule, those who comply with the rules of the string, we consider it a "match", otherwise, the string is not legitimate. In the back of this blog will be shared.

4.Python reptiles framework Scrapy

If you are a Python master the basic knowledge of reptiles have been mastered, then look for it about Python framework, the framework I chose Scrapy framework. This framework What powerful it? Here is its official description:

HTML, XML source data selection and extraction of built-in support
provides a set of reusable filter is shared between the spider (i.e. Item Loaders), the intelligent data processing crawling provide built-in support.
Provides multi-format (JSON, CSV, XML) through feed export, and more storage backend (FTP, S3, the local file system) built-in support
provides media pipeline, can be automatically downloaded to crawl data in the picture (or other resources).
High scalability. You can use signals, designed API (middleware, extensions, pipelines) to implement your custom functionality.
Built-in middleware and extension provides support for the following functions:
Cookies and the session processing
HTTP compression
HTTP Authentication
HTTP caching
user-agent simulation
robots.txt
crawling depth limit
for non-English languages are not standard or the wrong encoding declaration, provides automatic detection and robust encoding support.
Support from a template generated reptiles. While speeding reptile create and maintain large-scale projects in the code more consistent. For details, see genspider command.
For performance evaluation, failure detection in a multi reptiles, it provides a collection tool expanded state.
Providing an interactive shell terminal, for you to test XPath expressions, writing and debugging reptiles provide a great convenience
provided System service, simplifying the deployment and operation of the production environment of
the built-in Web service, enables you to monitor and control your machine
Built-in Telnet terminal, by Scrapy process Python hook into the terminal, allowing you to view and debug reptile
Logging for you to catch errors in the crawling process provides a convenient
support Sitemaps to crawl
DNS resolver has cached

Guess you like

Origin www.cnblogs.com/zhifeiji822/p/11981162.html