Are you a beginner at reptiles and you still don’t know this? ?

1. What is a reptile?

Crawler: a program that automatically crawls Internet information to capture valuable information for us from the Internet

2. Python crawler architecture

The Python crawler architecture mainly consists of five parts, namelyscheduler, URL manager, web page downloader, web page parser, and application program (Crawled valuable data).

Scheduler: Equivalent to the CPU of a computer, mainly responsible for scheduling the coordination between the URL manager, downloader, and parser.

URL manager: includes URL addresses to be crawled and URL addresses that have been crawled, preventing repeated crawling of URLs and loop crawling of URLs, and realizing the main purpose of the URL manager Three methods are implemented through memory, database, and cache database.

Webpage Downloader: Download a webpage by passing in a URL address and convert the webpage into a string. The webpage downloader has urllib2 (Python official basic module) and requires login. , proxy, and cookies, requests (third-party packages)

Web page parser: Parsing a web page string can extract useful information according to our requirements, or it can be parsed according to the DOM tree parsing method. Web page parsers include regular expressions (intuitively, convert web pages into strings to extract valuable information through fuzzy matching. When the document is complex, this method will be very difficult to extract data), html. parser (that comes with Python), beautifulsoup (a third-party plug-in, you can use the html.parser that comes with Python for parsing, or you can use lxml for parsing, which is more powerful than the other ones), lxml (a third-party plug-in , can parse xml and HTML), html.parser, beautifulsoup and lxml are all parsed in the form of a DOM tree.

Application: It is an application composed of useful data extracted from web pages.

The following diagram uses a diagram to explain how the scheduler coordinates its work.

3. Three ways to download web pages using urllib2

# _*_ coding=utf-8 _*_
import cookielib
import urllib2

url = "http://www.baidu.com"
response1 = urllib2.urlopen(url)
print "第一种方法"
#获取状态码,200表示成功
print response1.getcode()
#获取网页内容的长度
print len(response1.read())

print "第二种方法"
request = urllib2.Request(url)
#模拟Mozilla浏览器进行爬虫
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())

print "第三种方法"
cookie = cookielib.CookieJar()
#加入urllib2处理cookie的能力
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode()
print len(response3.read())
print cookie

4. Installation of the third-party library Beautiful Soup

Beautiful Soup: Python's third-party plug-in is used to extract data in xml and HTML. The official website address is https://www.crummy.com/software/BeautifulSou

1. Install Beautiful Soup

Open cmd (command prompt), go to the scripts in the Python (Python 2.7 version) installation directory, enter dir to see if there is pip.exe, if so, you can use the pip command that comes with Python to install it, enter pip install Just install beautifulsoup4

2. Test whether the installation is successful
Write a Python file and enter

import bs4
print bs4

Run the file. If it outputs normally, the installation is successful.

5. Use Beautiful Soup to parse html files

# _*_ coding:utf-8 _*_
import re

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
#创建一个BeautifulSoup解析对象
soup = BeautifulSoup(html_doc,"html.parser",from_encoding="utf-8")
#获取所有的链接
links = soup.find_all('a')
print "所有的链接"
for link in links:
    print link.name,link['href'],link.get_text()

print "获取特定的URL地址"
link_node = soup.find('a',href="http://example.com/elsie")
print link_node.name,link_node['href'],link_node['class'],link_node.get_text()

print "正则表达式匹配"
link_node = soup.find('a',href=re.compile(r"ti"))
print link_node.name,link_node['href'],link_node['class'],link_node.get_text()

print "获取P段落的文字"
p_node = soup.find('p',class_='story')
print p_node.name,p_node['class'],p_node.get_text()

Guess you like

Origin blog.csdn.net/2301_78096295/article/details/130850841