一、requests
request is a request library, used to obtain information page.
First, remember to import libraries ah, this is a third-party library, py does not own, do not have a small venue and my last partner can install third-party libraries Tutorial
import requests
We describe several common functions
1> request command
import requests
url = 'https://www.163.com'
resp = requests.get(url)
In fact, get use with almost constructor, it is a lot of argument, here we mainly use the url and two headers.
url : learned to understand the nature of the meter network, never learned ...... ah is simply to crawl the site. . . Address, right? (Inaccurate) it is actually the browser address box.
headers : request headers, and sometimes the site will have anti-climb, this can add to climb more real. This place speaks back.
2> status code
import requests
url = 'https://www.163.com'
resp = requests.get(url)
print(resp.status_code)
# >>200
Taken crawling state. If 200 represents a success.
3> display information in text format
import requests
url = 'https://www.163.com'
resp = requests.get(url)
print(resp.status_code)
print(resp.text)
You can see, I put Netease Home content crawling out, this is actually the source of the page, I use Chrome, F12 keys can be viewed in a browser
4> Display source information
import requests
url = 'https://www.163.com'
resp = requests.get(url)
print(resp.status_code)
print(resp.text)
html = resp.content
On a text is to use a text format, we use the method of content stored in the variable html in.
These are the basic operation requests, let's start parsing the source code.
二、BeautifulSoup
First or import library
from bs4 import BeautifulSoup as bes
1, first of all we need to parse html into a format that can bs4
from bs4 import BeautifulSoup as bes
soup = bes(html, 'html.parser') # 转化为bs4能解析的格式
2, the positioning element
There are many ways to locate elements, Here I introduce a few simple, easy to use, common.
Mode 1: soup.find (tag name)
This method is used to return the first satisfied Tag (that is, tag)
title = soup.find('title')
print(title)
Mode 2: soup.find_all (tag name)
this is the return of all the labels meet friends, the results appear in a list.
比如我想找所有标签名为em的标签
那么:
em = soup.find_all('em')
for one in em:
print(one)
结果:
方式3:CSS选择器
bs4还支持css选择器(啥是css不详细讲了),简单来说就是通过某种规则定位标签。
1> 通过标签名查找(木的特征)
首先最简单的,select也可以通过标签名查找。这个前面就不用加东西了。
t = soup.select('title')
print(t)
2> 用class定位元素 (特征是点(.))
我们现想提取“网易新闻”那个Tag
em1 = soup.select('.ntes-nav-app-newsapp')
print('em1:', em1)
print(type(em1))
可以看到,select返回的结果是一个set类,也就是说它可以找到所有满足条件的Tag
3> 用id属性定位 (特征是井号(#))
同样,我们还可以用id来定位。
em2 = soup.select('#js_N_nav')
print('em2:', em2)
4> 属性查找(特征是[ ])
还可以根据Tag内属性来查找
em3 = soup.select('a[href="https://news.163.com"]')
print('em3:', em3)
查找< a >标签中herf属性为OOXX的标签
5> 嵌套定位
html网页源码其实是一个树形结构,那么我们就可以一层一层的定位查找。
我们通过分析层次,提取北京要问信息
a = soup.select('div.yaowen_news > div.news_bj_yw > ul > li > a')
for one in a:
print(one, '\n')
emmm大家要注意身体啊,这肺炎挺严重的
首先认真分析源码,找出规律,再用上面几个方法组合起来使用,就可以得到想要的玩具标签了
3、提取内容
可以看到,我们所得到的信息中有好多乱七八糟的东西,我们想要的只是其中的内容罢了,这是我们就可以用get_text()来获取标签内容
a = soup.select('div.yaowen_news > div.news_bj_yw > ul > li > a')
c_list = []
for one in a:
c_list.append(one.get_text())
for one in c_list:
print(one)
我们把新闻标题存到了c_list里了
完工!
其实我觉得,爬虫这件事,三分看技术,七分看眼力,从不同中找相同,准确抓取自己想要的信息才是难点。