reptile python re module

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/LXJRQJ/article/details/100651321

python reptile 1

"1" What is the web crawler reptile?
Web crawler (also known as web spider, web robot), is a kind of follow certain rules, automatically grab information on the World Wide Web program or script.

"2" is the basic principle of reptiles:
We use the Internet compared to a large network, our web crawler imagine a connection between the spider web, web page and we understand nodes, reptile is equivalent to access the web page to obtain information about the page, they can crawl through a node on another web site, and then stop by a node that is accessing a page, the data on this site may be we get down.

Classification "3" crawler: crawlers can be divided into general crawler and focused crawler two kinds of

1 General reptiles : common web crawler is an important part Dissatisfied with search engine crawling system (Baidu, Google, Yahoo, etc.). The main purpose of the web page on the Internet is downloaded to the local, mirroring a form of Internet content.
2 focused crawler : focused crawler, a web crawler program "for a specific topic needs", it differs from general search engine crawlers are: focused crawler content on a screening process in the implementation of web crawling, try to ensure that only the catch take pages of information related to requirements.

urllib library

three aspects:

request: it is the most basic HTTP request when the module can be used to simulate sending a request, just enter the URL in your browser, then hit ENTER, the use of library methods passed just give the relevant URL and related the parameters can be.

error: exception handling module, if the request is an error, we can use this module to capture anomalies, and then retry or other operations, to ensure that the program is not terminated unexpectedly.

parse: This module is a tool, there is provided a method of processing many url, such as resolution, resolution, etc. combined.

Serialization parameters:

. 1 parse_qs () parameter encoding format url deserialize dictionary type
2 quote () may be converted to a URL-encoded format Chinese
. 3 unquote: can decode the encoded URL
. 4 urljoin () passing a base link, according to the underlying link one will not complete a full link to link splicing
. 5 the urlparse () implement URL identification and segmentation
. 6 urlunparse () may be configured to implement URL

Regular Expressions

In fact reptile of a total of four main steps:

* 明确目标 (要知道你准备在哪个范围或者网站去搜索)
* 爬 (将所有的网站的内容全部爬下来)
* 取 (去掉对我们没用处的数据)
* 处理数据(按照我们想要的方式存储和使用)

Common regular matching rules:
single character-handed matches

* \d : 匹配数字0-9
* \D: 匹配非数字 [^\d] [^0-9]
* \w: 匹配单词字符[a-zA-Z0-9_]
* \W: 匹配非单词字符 [^\w]
* \s: 匹配空白字符 (空格、\t....)
* \S: 匹配非空白字符 [^\s]
* . 匹配除了换行符'\n'之外的任意字符

Multi-character match (greedy: as many matches)
* * Before matching regular expression, any number of times
? Match? Before the regular expression, 0 or 1
+ + Just before a match expression, at least once
{N, m} matches {n, m} just before expression, n to m times
{N} {n} n match before the expression, n times

Non-greedy match (non-greedy match: match as little as possible)
*?
??
+?

Other:
== | or: matching the right and left of one of the regular expression ==
== () == Packet
^ Matches the beginning of a string
== == $ end of the string

re module

1. compile 方法: 将正则表达式的字符串形式编译为一个 Pattern 对象
2. match 方法:从起始位置开始匹配符合规则的字符串,单次匹配,匹配成功,立即返回Match对象,未匹配成功则返回None
3. search 方法:从整个字符串中匹配符合规则的字符串,单次匹配,匹配成功,立即返回Match对象,未匹配成功则返回None
4. findall 方法:匹配所有合规则的字符串,匹配到的字符串放到一个列表中,未匹配成功返回空列表
5. finditer 方法:匹配所有合规则的字符串,匹配到的字符串放到一个列表中,匹配成功返回
6. split 方法:根据正则匹配规则分割字符串,返回分割后的一个列表
7. sub 方法:替换匹配成功的指定位置字符串

Guess you like

Origin blog.csdn.net/LXJRQJ/article/details/100651321