python crawler base Ⅰ - requests, BeautifulSoup: Book Information



# Basis reptile reptilian part Ⅰ ## What is it?

Reptile, essentially, is to use the program on the Internet to get data for our valuable.

Reptile can do a lot of things, do business analysis, life assistant can do, such as ...... details or look right here 555 → Web crawler (- v -)

Popular, too, we usually use the browser search, the browser will initiate a request (request) to the server, the server to return a response (response) to the browser. This reptile is a process to simulate human operation, destination server initiates a request ...... did not talk much. Let each of the next job crawler is divided into the steps of: acquiring data -> Analytical Data -> Data Extraction -> data storage

Step 0: get the data. Crawler based URLs we send a request to the server, and then return the data.

Step 1: Analytical data. Crawler server will parse the returned data into a format that we can read.

Step 2: Extraction data. Crawlers then extracts the data we need.

Step 3: Data storage. Crawlers save these useful data up, easy to use and analyze your future.

Really did not talk much to say, here we go ~!


requests


1. Install

Open a terminal software (terminal), enter the Mac computer pip3 install requests; Windows computer, open a command prompt (cmd), input pip install requests.


2. requests.get()

import requests
#引入requests库
res = requests.get('URL')
#requests.get是在调用requests库中的get()方法,它向服务器发送了一个请求,括号里的参数'URL'是你需要的数据所在的网址,然后服务器对请求作出了响应。
#把这个响应返回的结果赋值在变量res上。

Print it type(res), found it to be an object belonging to the requests.models.Responseclass. Since it is a subject that, take a look at what this object properties and methods available for our operations.


3. Response object common attributes

Attributes effect
response.status_code HTTP response corresponding to the request status code various
response.content Converting the response binary data object
response.text The response data of the object into a string
response.encoding The coding of the response object

Here is explain it ~ ~!

That simply can skip wailing

(1 )response.status_code

>>> import requests
>>> res = requests.get('https://www.baidu.com/')
>>> print(res.status_code)
200 #200表明请求已成功
>>> print(res.text)
#篇幅太长
Common corresponding status code interpreter
In response to the state it Explanation For example Explanation
1xx Requests received 100 Continue with the request
2xx Request successful 200 success
3xx Redirect 305 Use proxy access
4xx Client Error 403 No Access
5xx Server-side error 503 service is not available

Specific learn about various HTTP status codes on request of other people respond finishing: HTTP status code corresponding to the various


(2) response.content

The next attribute is response.content, it can Response object to the content of binary data is returned in the form of, for image, audio, download video, look at an example you understand.

If we want to download it, it's URL is: https://res.pandateacher.com/2018-12-18-10-43-07.png

Then the code can be written:

import requests
res = requests.get('https://res.pandateacher.com/2018-12-18-10-43-07.png')
#发出请求,并把返回的结果放在变量res中
pic = res.content
#把Reponse对象的内容以二进制数据的形式返回
photo = open('ppt.jpg','wb')
#新建了一个文件ppt.jpg,这里的文件没加路径,它会被保存在程序运行的当前目录下。
#图片内容需要以二进制wb读写。(open函数是文件处理里边的,简单简单哦)
photo.write(pic) 
#获取pic的二进制内容
photo.close()
#关闭文件

In this way, our image is downloaded successfully friends ~


(3) response.text

Continue to look at response.textthis property can put Responsethe content object to a string return forms for text, download the source code of the Web page.

For example, download the novel "Three Kingdoms" the first chapter:

import requests
#引用requests库
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/sanguo.md')
#下载《三国演义》第一回,我们得到一个对象,它被命名为res
novel = res.text
#把Response对象的内容以字符串的形式返回
print(novel[:800])
#现在,可以打印小说了,但考虑到整章太长,只输出800字看看就好。在关于列表的知识那里,前面说过[:800]的意思了~!

Oh, why is there some garbled it?

This is because the encoding type this super simple web page data is 'utf-8'(this page is designed to tell people my!). With requests.get()the transmission request, we get a Responsetarget, wherein the requestsencoding type data module will make their own judgment. However, the Responseobject is determined coding type 'gbk'. As a result, with the encoded data itself 'utf-8'will not match, so print it out, it is a pile of garbage. (Further also referred to herein behind, mainly on the type of encoding utf-8, gbk, gbk2312 the like, would like to know the degree of your mother wailing ~~ python find also a corresponding method for encoding and decoding ~)


(4) response.encoding

Then we will be encoded using encoding to utf-8, then there would be garbled friends:

import requests
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/sanguo.md')
res.encoding = 'utf-8'
#定义Reponse对象的编码为utf-8。
novel = res.text
print(novel[:800])

First, the target encoded data itself is what is unknown. With requests.get()after sending a request, we will achieve a Responsetarget, which requeststhe library will make their own judgment on the type of encoding data . but! This judgment is likely accurate, it may not be accurate.

If it is judged correct, we print out response.textthe contents is normal, no distortion, it would be less than res.encoding;

If the judgment is not accurate, there will be a bunch of gibberish, then we can go to see the encoded target data (usually web page source charsetproperty value), then use res.encodingthe coding defined objectives and consistent data types can be.

In practice, the case of the text garbled, before considering the use res.encoding.


robot protocol

Typically, less likely to care about small reptiles server, but the server will reject the high frequency of large reptiles and malicious reptiles, bring great pressure because this will harm or server.

However, the server under normal circumstances, the search engine is welcoming attitude (just mentioned, one of Google and Baidu's core technology is the reptile). Of course, this is conditional, and these conditions will be written in the Robotsagreement.

RobotsInternet protocol is a reptile recognized code of ethics, which stands for "web crawler exclusion criteria" (Robots exclusion protocol), the protocol used to tell reptiles, which pages can be crawled and what not.

View website robots protocol, just add in the site's domain name /robots.txton it .

Such as Taobao: http://www.taobao.com/robots.txt

Agreement in English is the most commonly occurring Allowand Disallow, Allowrepresentatives can be accessed, Disallowrepresentatives prohibited from being accessed. And interestingly, Taobao Baidu reptiles limits on product pages, but allow Google to access.

When you're crawling the site data, do not forget to take a look at the site of Robotswhether the agreement allows you to take the climb.

At the same time, the speed limit is good reptiles, the server provides the data thankful to avoid it caused too much pressure to maintain good order and the Internet, is what we should do.


You need to know HTML

1. Check the web page of HTML code

Open the Web page (just find a web page open on ok), press F12( Fn+F12), the browser pops up a tab, is the HTML source code of. (Below the red box with nothing, and I'm too lazy to change map)
Here Insert Picture Description
caught in angle brackets <>in the middle of the letter, they called the [label]. Tags usually come in pairs, one is the start tag, is behind a closing tag. However, there is a trend that the label appear, such as the fourth row of the HTML code <meta charset="utf-8">(page encoding format is defined utf-8). + + Start tag end tag all content middle of them together on the formation of [elements] (element).

2. The simplest HTML document

Open Notepad, save the following code, save the format .htmlis the one of the most simple HTML.

<html>
	<head>
		<meta charset="utf-8"> 
	</head>
    <body>
        <h1>我是一级标题</h1>
        <h2>我是二级标题</h2>
        <h3>我是三级标题</h3>
        <p>我是一个段落啦。一级标题、二级标题和我,我们三个一起组成了body。
         </p>
    </body>
</html>

3. HTML attributes

Inside the label, 属性名="属性值"such as name="value". As another example several common attributes:

Links in general by the <a>label definition, hrefattribute is used to specify points to a page URL.

There classused to identify a series of elements, idattributes used to identify unique elements.

emm, as if the enough of it, do not repeat them here Liaoning.


BeautifulSoup

1. Data acquisition and analysis

Since BeautifulSoupnot Python standard library, it needs to be installed separately, an input terminal line of code to run: pip install BeautifulSoup4. (Mac computers need to enter pip3 install BeautifulSoup4) said that this is no longer behind the installation, operation is the same! !

Obviously, we can parse and extract web page data BeautifulSoup in use.

I think it is better to speak directly to these codes on a strong come on, come on, you look smart annotation will be able to understand you ~

After the first to http://books.toscrape.com/ this site, for example, the first to climb down the directory book :

import requests
from bs4 import BeautifulSoup #导入模块

res = requests.get('http://books.toscrape.com/') #发起请求

print(res.status_code) #检查请求是否正确响应

html = res.text #把Response对象的内容以字符串的形式返回

soup = BeautifulSoup(html,'html.parser') #把网页解析为BeautifulSoup对象(简称BS对象)
#在括号中,要输入两个参数,第0个参数是要被解析的文本,注意了,它必须必须必须是字符串。
#括号中的第1个参数用来标识解析器,我们要用的是一个Python内置库:html.parser。(它不是唯一的解析器,但是比较简单的)

#到这里先去看下边的第2点
items = soup.find('ul',class_='nav nav-list')
cate = items.find('ul')
cates = cate.find_all('li')
for i in cates:
    print(i.text.strip())

2. find() 和 find_all() 方法

打开那个网页后,右键 - 检查 或者 按F12,再按下图操作:
Here Insert Picture Description
这时候右边的开发者工具栏就会在html源代码里定位到对应的位置,可以看到是右边的<div class="side_categories">。接着打开下边的<ul>标签,可以看到很多<li>标签,再打开那里<li>标签看,发现里面的文本就是这个书目的每一个书名:
Here Insert Picture Description
而BeautifulSoup对象的两个方法,它们可以匹配html的标签和属性,把BeautifulSoup对象里符合要求的数据都提取出来。它俩的用法基本是一样的,区别在于,find()只提取首个满足要求的数据,而find_all()提取出的是所有满足要求的数据,以列表的形式返回,用法示例:

find(‘标签名’,class_=‘属性名’,id=‘属性名’,=,【还可以使用其它属性,比如style属性等】),find_all也如此。注意:可以只用标签定位,也可以只用属性定位,一个or多个都ok。还有class属性后边要加个"_"

然后我们就可以开始用它们定位啦!我先草草的说一下定位规则:定位了外层标签后可以继续往其内层的标签定位,但是匹配属性呢,只能匹配当前标签里的属性,还是拿个图来说吧!
Here Insert Picture Description
但是要注意,你想定位哪个标签,就用哪个标签的属性来匹配
(不懂的话,自己多试试就知道辽 加油ヾ(◍°∇°◍)ノ゙)

items = soup.find('ul',class_='nav nav-list')
#因为我们要获取所有书的目录,可以选择先定位到所有书目标签'li'的外层标签'ul',通过定位'ul'标签和属性提取我们想要的数据。
#(这里还选用了class_属性定位,是因为前面还有其他的'ul'标签。
#而且我这里选用find定位,只会定位到第一个匹配到的。
#为了避免匹配到前边的'ul'标签位置,再加上属性来定位会更精确。
#那这里那么多标签,总不可能一个个翻吧,这时可以在开发者工具按 Ctrl+F ,搜索你想搜的内容,然后就会高亮显示出来啦,像下边我发的图
#哦对了,可以print一下cate的类型,是一个'Tag'对象

cate = items.find('ul')
#接着我这选择了继续定位内层的'ul'标签。

cates = cate.find_all('li')
#这里用的是find_all,返回所有跟'li'标签匹配的'Tag'对象组成的列表。

#好,到这里又可以看下边第3点了。
for i in cates:
    print(i.text.strip())

Here Insert Picture Description

3. Tag对象常用方法

方法 / 属性 作用解释
find() 和 find_all() 与BS对象一样,Tag对象也有此方法,这也是为什么能从BS对象定位到外层标签后(得到的是Tag对象),还能再继续定位内层标签(提取Tag中的Tag,得到的依然是Tag对象)
text 提取Tag中的文字,返回类型是 ‘str’。如,Tag.text。
[‘属性名’] 输入参数:属性名,可以提取Tag中的属性的值,返回类型是 ‘str’。如,Tag[‘属性名’],得到的是属性值。(注意只能提取当前所属标签的属性值)
补充:str.strip()

此方法用于移除字符串头尾指定的字符(默认为空格或换行符)或字符序列。

注意:该方法只能删除开头或是结尾的字符,不能删除中间部分的字符。

#遍历这个cates列表(每个元素的Tag对象,也是每个 li 元素)
for i in cates:
    print(i.text.strip())
    #打印li中的文字,因为前后都有很多空格,就用这个strip()方法去除。

部分运行结果:
Here Insert Picture Description

简单不简单不(/≧▽≦)/

感觉这里还是说的有点详细又有点不详细,哪儿不清楚的可以现在本地编译器多试一试,还不懂呢善用搜索或者问我也尽力解答啦~!也可以来问我拿一个清楚详细流程图,为什么不直接发在这呢,咳咳我不知道0.0

之后这些就不会再那么一步步说了,可能是几句简单的注释,或者没有~!

给个练习吧

要求:爬取网上书店Books to Scrape中的书名+价格+评分,并且打印提取到的信息。
需要注意:如何获取完整的书名,还有评分的获取。
部分运行结果大概这样:
Here Insert Picture Description



————————每个人都在抱怨生活不易,可是都在默默为生活打拼————————

发布了16 篇原创文章 · 获赞 113 · 访问量 4895

Guess you like

Origin blog.csdn.net/qq_43280818/article/details/96114307