Reptiles - how to climb static websites

Climb static website is divided into two parts:

  1. Static text crawl websites
  2. Pictures climb static websites

[TOC]

Text climb

Thinking

  1. With requestsget HTML website module
  2. We used BeautifulSoupto give regular HTML text module
  3. Use findor find_allfunction of getting what you want from a regular text
  4. By repalceremoving unwanted characters

Source

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

# The script can identify the Chinese
# climb website: http: //www.hbrchina.org/

Import Requests
from BS4 Import BeautifulSoup
module needs to call #
IF __name__ == '__main__' :
REQ = requests.get ( 'http://www.hbrchina.org/2019-02-18/7150.html' ) # get website the HTML req.encoding = req.apparent_encoding # get the original text encoding, so that the two agreed to display the correct coding HTML req.text = bf = BeautifulSoup (HTML, 'html.parser' ) # the HTML file into a regular file (I understanding that the text file) body bf.body = texts = body.find_all ( 'div' , { 'class' : 'Content-Article this article was' }) # find such files find_all div function, which is class-Content Article this article was Print (texts [









0 ] .text.replace ( 'XA0' * . 8 , 'NN' )) # proposed all characters, with the replace function like spaces

We can see the results

Climb pictures

Thinking

  1. Get HTML sites with requests module
  2. Get regular HTML text with BeautifulSoup
  3. You get what you want from the regular text with a find function, such as keyword img
  4. Use of urllib download
  5. Download photos for use statement

Source

. 1 
2
. 3
. 4
. 5
. 6
. 7
. 8
. 9
10
. 11
12 is
13 is
large column   reptiles - how climbing static website "> 14
15
16
. 17
18 is
. 19
20 is
21 is


import requests,urllib
from bs4 import BeautifulSoup

if __name__ == '__main__':
rep = requests.get('https://darerd.github.io/2019/03/21/%E9%9A%8F%E6%83%B3-%E6%96%B0%E9%9B%B6%E5%94%AE%E4%BC%81%E4%B8%9A%E2%80%9C%E2%80%9C%E6%99%BA%E8%83%9C%E2%80%9D%E6%9C%AA%E6%9D%A5/')
rep.encoding = rep.apparent_encoding
html = rep.text
bs = BeautifulSoup(html,'html.parser')
img = bs.find_all('img')
# 在bs正则文件中找到所有 带有img标签的结果
x=1
for i in img :
# 利用for语句得到每一个图片的src
imgsrc = i.get('src')
# 从img中找到图片的下载链接,src
urllib.request.urlretrieve(imgsrc,'./%s.jpg' %x)
# 利用urllib模块去下载图片
x=x+1
print ('正在下载: %d '%x)

爬虫时必须会用网页源代码

以爬图片为例:

这是我们要爬的网站:[https://darerd.github.io/2019/03/21/随想-新零售企业““智胜”未来/]

打开网站后(我用的Chrome浏览器),键盘快捷键F12,即可打开网站的调试模式,效果如下:

右侧就是网站的源代码,可以用来爬

如果需要快速定位到某一部分的代码所在位置,我们可以鼠标右键,选择检查,如下图所示:

如果我们要快速定位某图片所在的代码位置,演示如下:

用这种方法观察每一张图片的源码:

它们的写法都是非常类似的,如下:

1
<img src="http://upload.hbrchina.org/2019/0213/1550028457604.jpg" alt="1550028961(1)">

src是图片的下载地址,alt是图片的便签,每一张图片都在img语句中

所以我们只要得到所有的img语句,然后从img语句中得到所有的src链接,就可以下载图片了。

每一种爬虫程序都类似,找到要爬部分的特点,然后调用相应的模块。

对于小白,难度就在于怎么样找到要爬部分的特点


以上

Guess you like

Origin www.cnblogs.com/lijianming180/p/12041548.html