Getting Started with Python reptiles, reptile common skills inventory

Programming for any novice is not an easy thing, Python for anyone who wants to learn programming is indeed a boon to read Python code as if reading the article, from the Python language provides a very elegant grammar, is known as one of the most elegant language.

When python entry

Use most or all kinds of reptiles script,

Wrote the script caught the machine proxy authentication, wrote the script forums automatically log automatic posting

Wrote a script to automatically receive e-mail, I wrote a simple script identification code.

These scripts have one thing in common, and are related to the web,

Always use some method of getting links, they accumulated a lot of experience caught reptiles station,

In this sum up, then do something in the future we will not have to repeat the work.

1, the basic crawl the web

get method

post method

2. Use a proxy server

This is useful in some cases,

For example, IP has been sealed, such as the number of IP or access is limited, and so on.

3.Cookies processing

Yes, yes, at the same time if you want to use a proxy and cookie,

Then join proxy_support then operner read as follows:

4. disguised as browser access

Some sites visited objectionable reptiles, so for all reptiles reject the request.

This time we need to masquerade as your browser,

This can be achieved by modifying the http packet header:

5, page parsing

For page parsing, of course, it is the most powerful regular expression,

This is not the same for different users in different sites, you do not have too much explanation.

The second is a resolver library, there are two commonly used lxml and BeautifulSoup.

For both libraries, my assessment is that

Are HTML / XML processing libraries, Beautifulsoup pure python implementation, low efficiency,

But functional and practical, such as an HTML source code can be obtained through the results of the search node;

lxmlC language coding, efficient support Xpath.

6. The process of verification code

Code supposed to meet?

Here are two circumstances:

google the kind of code, no way.

Simple codes: the number of characters is limited, only a simple translation or rotation of the added noise with no distortion,

This may still be treated, the general idea is to turn back the rotation, noise is removed,

Then dividing a single character, and then by a method of feature extraction (e.g., PCA) and generating dimensionality reduction feature database after a good division,

And wherein the authentication code is then compared library.

This is more complex, there will not start,

具体做法请弄本相关教科书好好研究一下。

7. gzip/deflate支持

现在的网页普遍支持gzip压缩,这往往可以解决大量传输时间,

以VeryCD的主页为例,未压缩版本247K,压缩了以后45K,为原来的1/5。

这就意味着抓取速度会快5倍。

然而python的urllib/urllib2默认都不支持压缩

要返回压缩格式,必须在request的header里面写明’accept-encoding’,

然后读取response后更要检查header查看是否有’content-encoding’一项来判断是否需要解码,很繁琐琐碎。

如何让urllib2自动支持gzip, defalte呢?

其实可以继承BaseHanlder类,

然后build_opener的方式来处理:

8、多线程并发抓取

单线程太慢的话,就需要多线程了,

这里给个简单的线程池模板 这个程序只是简单地打印了1-10,

但是可以看出是并发的。

虽然说Python的多线程很鸡肋

但是对于爬虫这种网络频繁型,

还是能一定程度提高效率的。

9. 总结

阅读Python编写的代码感觉像在阅读英语一样,这让使用者可以专注于解决问题而不是去搞明白语言本身。

Python虽然是基于C语言编写,但是摒弃了C中复杂的指针,使其变得简明易学。

并且作为开源软件,Python允许对代码进行阅读,拷贝甚至改进。

这些性能成就了Python的高效率,有“人生苦短,我用Python”之说,是一种十分精彩又强大的语言。

总而言之,开始学Python一定要注意这4点:

1.代码规范,这本身就是一个非常好的习惯,如果开始不养好好的代码规划,以后会很痛苦。

2.多动手,少看书,很多人学Python就一味的看书,这不是学数学物理,你看例题可能就会了,学习Python主要是学习编程思想。

3.勤练习,学完新的知识点,一定要记得如何去应用,不然学完就会忘,学我们这行主要都是实际操作。

4.学习要有效率,如果自己都觉得效率非常低,那就停不停,找一下原因,去问问过来人这是为什么。

推荐阅读:

零基础入门Python的最详细的源码教程

2019年Python爬虫学习路线图完整版

Python为何能坐稳AI人工智能的头牌语言

Python崛起,TIOBE编程语言排行榜再创新高!

Guess you like

Origin blog.csdn.net/meiguanxi7878/article/details/93655991