Python web crawler notes (four) - requests and BeautifulSoup

一、requests

request is a request library, used to obtain information page.

First, remember to import libraries ah, this is a third-party library, py does not own, do not have a small venue and my last partner can install third-party libraries Tutorial

import requests

We describe several common functions
1> request command

import requests

url = 'https://www.163.com'
resp = requests.get(url)

In fact, get use with almost constructor, it is a lot of argument, here we mainly use the url and two headers.
url : learned to understand the nature of the meter network, never learned ...... ah is simply to crawl the site. . . Address, right? (Inaccurate) it is actually the browser address box.
This is it
headers : request headers, and sometimes the site will have anti-climb, this can add to climb more real. This place speaks back.

2> status code

import requests

url = 'https://www.163.com'
resp = requests.get(url)
print(resp.status_code)
# >>200

Taken crawling state. If 200 represents a success.

3> display information in text format

import requests

url = 'https://www.163.com'
resp = requests.get(url)
print(resp.status_code)
print(resp.text)

Here Insert Picture Description
You can see, I put Netease Home content crawling out, this is actually the source of the page, I use Chrome, F12 keys can be viewed in a browser
Here Insert Picture Description
4> Display source information

import requests

url = 'https://www.163.com'
resp = requests.get(url)
print(resp.status_code)
print(resp.text)
html = resp.content

On a text is to use a text format, we use the method of content stored in the variable html in.

These are the basic operation requests, let's start parsing the source code.

二、BeautifulSoup

First or import library

from bs4 import BeautifulSoup as bes

1, first of all we need to parse html into a format that can bs4

from bs4 import BeautifulSoup as bes
soup = bes(html, 'html.parser')  # 转化为bs4能解析的格式

2, the positioning element

There are many ways to locate elements, Here I introduce a few simple, easy to use, common.

Mode 1: soup.find (tag name)
This method is used to return the first satisfied Tag (that is, tag)


title = soup.find('title')
print(title)

Here Insert Picture Description

Mode 2: soup.find_all (tag name)
this is the return of all the labels meet friends, the results appear in a list.

比如我想找所有标签名为em的标签
Here Insert Picture Description
那么:

em = soup.find_all('em')

for one in em:
    print(one)

结果:
Here Insert Picture Description
方式3:CSS选择器
bs4还支持css选择器(啥是css不详细讲了),简单来说就是通过某种规则定位标签。

1> 通过标签名查找(木的特征)
首先最简单的,select也可以通过标签名查找。这个前面就不用加东西了。

t = soup.select('title')
print(t)

Here Insert Picture Description

2> 用class定位元素 (特征是点(.))
我们现想提取“网易新闻”那个Tag
Here Insert Picture Description

em1 = soup.select('.ntes-nav-app-newsapp')
print('em1:', em1)
print(type(em1))

Here Insert Picture Description
可以看到,select返回的结果是一个set类,也就是说它可以找到所有满足条件的Tag

3> 用id属性定位 (特征是井号(#))
同样,我们还可以用id来定位。
Here Insert Picture Description


em2 = soup.select('#js_N_nav')
print('em2:', em2)

4> 属性查找(特征是[ ])
还可以根据Tag内属性来查找Here Insert Picture Description

em3 = soup.select('a[href="https://news.163.com"]')
print('em3:', em3)

查找< a >标签中herf属性为OOXX的标签
Here Insert Picture Description
5> 嵌套定位

html网页源码其实是一个树形结构,那么我们就可以一层一层的定位查找。
Here Insert Picture Description
我们通过分析层次,提取北京要问信息

a = soup.select('div.yaowen_news > div.news_bj_yw > ul > li > a')
for one in a:
    print(one, '\n')

Here Insert Picture Description
emmm大家要注意身体啊,这肺炎挺严重的

首先认真分析源码,找出规律,再用上面几个方法组合起来使用,就可以得到想要的玩具标签了

3、提取内容

可以看到,我们所得到的信息中有好多乱七八糟的东西,我们想要的只是其中的内容罢了,这是我们就可以用get_text()来获取标签内容

a = soup.select('div.yaowen_news > div.news_bj_yw > ul > li > a')
c_list = []
for one in a:
    c_list.append(one.get_text())
for one in c_list:
    print(one)

我们把新闻标题存到了c_list里了

完工!

其实我觉得,爬虫这件事,三分看技术,七分看眼力,从不同中找相同,准确抓取自己想要的信息才是难点。

Published 21 original articles · won praise 17 · views 4212

Guess you like

Origin blog.csdn.net/xuanhuangwendao/article/details/104055232