Wu Yuxiong --python study notes: reptile basis

First, what is the reptile 
reptiles: the program automatically crawl the Internet for some information from the Internet to grab our valuable information.
Two, reptiles Python framework 
Python Reptile architecture consists of five parts, namely, scheduler, URL management, web download, web parser, the application (crawling worth of data). 

Scheduler: the equivalent of a computer's CPU, main responsible for scheduling URL Manager, Downloader, coordination between the parser. 
URL manager: a URL address to be crawled and crawling URL address, and to prevent repeated cycles gripping gripping URL URL, URL manager achieved mainly in three ways, be realized by a memory, database, cache database. 
Web downloader: download web page by passing a URL address to convert web pages into a string, page download has urllib2 (Python official base module) include the need to log in, the agent, and cookie, requests (third-party packages) 
pages parsed is: a string to parse a web page, we can follow the requirements of our extract useful information, it can also be resolved according to the analytical methods the DOM tree. There regex parser pages (intuitive, web page translated into strings to extract valuable information by way of fuzzy matching, the time when the document more complex, the time data extraction method will be very difficult), html. parser (Python comes), beautifulsoup (third-party plug-ins, you can use Python comes html.parser parsing, can also be used lxml be resolved, it is with respect to several other bit more powerful), lxml (third-party plug-ins , can parse xml and HTML), html.parser and beautifulsoup and lxml are parsed in a manner DOM tree. 
Application: An application that is extracted from the pages of useful of data.
 

import urllib.request
import http.cookiejar

url = "http://www.baidu.com"
response1 = urllib.request.urlopen(url)
print("第一种方法")
#获取状态码,200表示成功
print(response1.getcode())
#获取网页内容的长度
print(len(response1.read()))
第一种方法
200
156265
print("第二种方法")
request = urllib.request.Request(url)
#模拟Mozilla浏览器进行爬虫
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib.request.urlopen(request)
print(response2.getcode())
print(len(response2.read()))
第二种方法
200
156328
print("第三种方法")
cookie = http.cookiejar.CookieJar()
#加入urllib.request处理cookie的能力
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
urllib.request.install_opener(opener)
response3 = urllib.request.urlopen(url)
print(response3.getcode())
print(len(response3.read()))
print(cookie)
第三种方法
200
156488
<CookieJar[<Cookie BAIDUID=CA8C47A224EE898DC34E66D0182C70C3:FG=1 for .baidu.com/>, <Cookie BIDUPSID=CA8C47A224EE898D968EF5993499742B for .baidu.com/>, <Cookie H_PS_PSSID=1446_21123 for .baidu.com/>, <Cookie PSTM=1575029972 for .baidu.com/>, <Cookie delPer=0 for .baidu.com/>, <Cookie BDSVRTM=0 for www.baidu.com/>, <Cookie BD_HOME=0 for www.baidu.com/>]>
四、第三方库 Beautiful Soup 的安装
Beautiful Soup: Python 的第三方插件用来提取 xml 和 HTML 中的数据,官网地址 https://www.crummy.com/software/BeautifulSoup/

1、安装 Beautiful Soup
pip install bs4
2、测试是否安装成功

编写一个 Python 文件,输入:
import bs4
print(bs4)
<module 'bs4' from 'e:\\python\\lib\\site-packages\\bs4\\__init__.py'>
五、使用 Beautiful Soup 解析 html 文件
import re
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#创建一个BeautifulSoup解析对象
soup = BeautifulSoup(html_doc,"html.parser",from_encoding="utf-8")
#获取所有的链接
links = soup.find_all('a')
print("所有的链接")
for link in links:
    print(link.name,link['href'],link.get_text())
所有的链接
a http://example.com/elsie Elsie
a http://example.com/lacie Lacie
a http://example.com/tillie Tillie
print("获取特定的URL地址")
link_node = soup.find('a',href="http://example.com/elsie")
print(link_node.name,link_node['href'],link_node['class'],link_node.get_text())
 
print("正则表达式匹配")
link_node = soup.find('a',href=re.compile(r"ti"))
print(link_node.name,link_node['href'],link_node['class'],link_node.get_text())
 
print("获取P段落的文字")
p_node = soup.find('p',class_='story')
print(p_node.name,p_node['class'],p_node.get_text())
获取特定的URL地址
a http://example.com/elsie ['sister'] Elsie
正则表达式匹配
a http://example.com/tillie ['sister'] Tillie
获取P段落的文字
p ['story'] Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

Guess you like

Origin www.cnblogs.com/tszr/p/11960175.html