Getting Started with Python Crawler: An Overview

Hello everyone, recently, the blogger is learning Python. During the learning period, he also encountered some problems and gained some experience. Here, I will systematically organize my own learning. If you are interested in learning crawlers, you can use these articles as a reference. You are also welcome to share your learning experience.

Python version: 2.7, please find another blog post for Python 3.

First of all, what is a reptile?

A web crawler (also known as a web spider, web robot, and in the FOAF community, more often referred to as a web page chaser) is a program or script that automatically crawls information from the World Wide Web according to certain rules.

According to my experience, to learn Python crawler, we have to learn the following points:

<!--[if !supportLists]--> ·  <!--[endif]--> Python Basics

<!--[if !supportLists]--> ·  <!--[endif]--> Usage of urllib and urllib2 libraries in Python

<!--[if !supportLists]--> ·  <!--[endif]--> Python regular expressions

<!--[if !supportLists]--> ·  <!--[endif]--> Python crawler framework Scrapy

<!--[if !supportLists]--> ·  <!--[endif]--> Python crawler more advanced functions

1. Python basic learning

First of all, we are going to use Python to write crawlers. We must understand the basics of Python. From the ground up, we can't forget the foundation, haha, then I will share some Python tutorials I have seen before, and friends can use it as a reference .

1) MOOC Python Tutorial

There used to be some basic grammars that I read on the MOOC online. There are some exercises attached to them, which can be used as exercises after learning. I feel that the effect is quite good, but unfortunately the content is basically the most basic. If so, this is it

Learning URL: MOOC Python Tutorial

2) Liao Xuefeng Python Tutorial

Later, I found Mr. Liao's Python tutorial, which is very easy to understand, and it feels very good. If you want to know more about Python, please read this.

Learning URL: Liao Xuefeng Python Tutorial

3) Concise Python Tutorial

There is also a concise Python tutorial that I have seen, and it feels good

Learning URL: Concise Python Tutorial

2. Usage of Python urllib and urllib2 libraries

The urllib and urllib2 libraries are the most basic libraries for learning Python crawlers. Using this library, we can get the content of web pages, and extract and analyze the content with regular expressions to get the results we want. I will share this with you as I learn.

3. Python Regular Expressions

Python regular expressions are a powerful weapon for matching strings. Its design idea is to use a descriptive language to define a rule for a string. Any string that conforms to the rule, we consider it "matches", otherwise, the string is illegal. This will be shared in a later blog post.

4. Crawler framework Scrapy

If you are a Python expert and have mastered the basic crawling knowledge, then look for a Python framework. The framework I chose is the Scrapy framework. What's so powerful about this framework? Here is its official introduction:

Built-in support for HTML, XML source data selection and extraction provides a series of reusable filters (ie Item Loaders) shared between spiders, providing built-in support for intelligent processing of crawled data. Provides multiple formats (JSON, CSV, XML) through feed export, and built-in support for multiple storage backends (FTP, S3, local file system ) provides a media pipeline, which can automatically download pictures (or other images) in the crawled data resource). High scalability. You can customize your functionality by using signals, designed APIs (middleware, extensions, pipelines). Built-in middleware and extensions provide support for the following functions : cookies and session handling HTTP compression HTTP authentication HTTP caching user-agent simulation robots.txt crawling depth limit detection and robust encoding support. Supports generating crawlers based on templates. Keeping code more consistent across larger projects while speeding up crawler creation. See the genspider command for details. For performance evaluation and failure detection under multiple crawlers














,提供了可扩展的 状态收集工具
提供 交互式shell终端 , 为您测试XPath表达式,编写和调试爬虫提供了极大的方便
提供 System service, 简化在生产环境的部署及运行
内置 Web service, 使您可以监视及控制您的机器
内置 Telnet终端 ,通过在Scrapy进程中钩入Python终端,使您可以查看并且调试爬虫
Logging 为您在爬取过程中捕捉错误提供了方便
支持 Sitemaps 爬取
具有缓存的DNS解析器

 

代理选择,目前在用:http://zhimaruanjian.com/,效果还可以

等我们掌握了基础的知识,再用这个 Scrapy 框架吧!

扯了这么多,好像没多少有用的东西额,那就不扯啦!

下面开始我们正式进入爬虫之旅吧!

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326690903&siteId=291194637