Python simple crawler third egg!

Lecture three:

  Today we introduce a new module, Beautiful Soup!

  This module has many magical functions. First of all, we don't make it so magical. Only by installing it well can it reflect its magical functions.

  Install as follows:

  Run cmd (command prompt) under Win10 system

  Type pip install beautifulsoup4

  Installation is ready. Forget the specifics of installation here, let's focus on its amazing features!

 

1  ''' Add new module ''' 
2  
3  import requests
 4  
5  from bs4 import BeautifulSoup
 6  
7   
8  
9  def getHTML_Text(url):
 10  
11      try :
 12  
13          r = requests.get(url)
 14  
15          r.raise_for_status( ) #If the status is not 200, exception 
16  
17          r.encoding = r.apparent_encoding
 18  
19          return r.text
 20  
21      except :
22  
23          return  ' An exception occurred ' 
24  
25   
26  
27  if  __name__ == ' __main__ ' :
 28  
29      url = ' https://www.hao123.com/ ' 
30  
31      print (getHTML_Text(url)[:5000])

 

 

  This is the general code of our last issue, after a simple modification, run:

 

 

 

  Get a string of strings, as mentioned earlier, which contains the information we need (refer to Example 1 in the second lecture), and what we have to do today is to extract them. The magic of the BeautifulSoup module is self-evident.

  Without further ado, let's first understand some basic applications of beautifulsoup.

1 from bs4 import BeautifulSoup
2 soup = BeautifulSoup(‘<p>data</p>’,’html.parser’)

 

  <p>data</p> is a string that refers to…

  That is, as we mentioned in the last two lectures, the text content of the web page crawled,

  html.parser is a parser for parsing HTML Documents

  Through the above code, a tag tree can be established, corresponding to the entire content of the HTML document, so that after establishment, it is more convenient for us to find the desired content.

  Basic elements of the BeautifulSoup class

fundamental element

illustrate

Tag

Tags, the most basic information organization unit, <> and </> mark the beginning and end

Name

The name of the tag, in the format <tag>.name

Attributes

Attributes of tags, organized as a dictionary, format: <tag>.attrs

Navigable String

Non-attributed strings in tags, strings in <>…</>, format: <tag>.string

Comment

The comment part of the string inside the tag

  Let's show them one by one:

1  ''' BeautifulSoup demo ''' 
2  
3  import requests
 4  
5  from bs4 import BeautifulSoup
 6  
7   
8  
9  def getHTML_Text(url):
 10  
11      try :
 12  
13          r = requests.get(url)
 14  
15          r.raise_for_status() #If the status is not 200, an exception will be generated 
16  
17          r.encoding = r.apparent_encoding
 18  
19          return r.text
 20  
21      except:
22 
23         return '产生异常'
24 
25  
26 
27 if __name__ == '__main__':
28 
29     url = 'https://www.hao123.com/'
30 
31     soup = BeautifulSoup(getHTML_Text(url),'html.parser')
32 
33     title = soup.title
34 
35     tag = soup.a
36 
37     print(title.string)
38 
39     print(str(tag.name))
40 
41     print(str(tag.attrs))
42 
43     print(str(tag.string))
44 
45     print(str(tag.comment))

 

  operation result:

 

  First print out the content of the title tag of the entire HTML

  After obtaining the first a tag, print out the tag name, tag attributes, tag content, and tag comment parts respectively.

 

  In general, these are the uses, and there are more interesting functions that you can discover by yourself, which are only briefly mentioned here.

  A prettify() method is also introduced here to make our display more friendly.

  E.g:

 1 '''BeautifulSoup演示'''
 2 
 3 import requests
 4 
 5 from bs4 import BeautifulSoup
 6 
 7  
 8 
 9 def getHTML_Text(url):
10 
11     try:
12 
13         r = requests.get(url)
14 
15         r.raise_for_status() #如果状态不是200,则产生异常
16 
17         r.encoding = r.apparent_encoding
18 
19         return r.text
20 
21     except:
22 
23         return '产生异常'
24 
25  
26 
27 if __name__ == '__main__':
28 
29     url = 'https://www.hao123.com/'
30 
31     soup = BeautifulSoup(getHTML_Text(url),'html.parser')
32 
33     print(soup.prettify()[:5000])

 

 

  结果如上图,显示出比较友好的文本形式了,否则直接打印的话,密麻麻的字符串,看得确实让人头疼。

  最后小小展示一些BeautifulSoup的一个有用的方法,具体的下期再详细讲,看代码!

 1 '''BeautifulSoup演示'''
 2 
 3 import requests
 4 
 5 from bs4 import BeautifulSoup
 6 
 7  
 8 
 9 def getHTML_Text(url):
10 
11     try:
12 
13         r = requests.get(url)
14 
15         r.raise_for_status() #如果状态不是200,则产生异常
16 
17         r.encoding = r.apparent_encoding
18 
19         return r.text
20 
21     except:
22 
23         return '产生异常'
24 
25  
26 
27 if __name__ == '__main__':
28 
29     url = 'https://www.hao123.com/'
30 
31     soup = BeautifulSoup(getHTML_Text(url),'html.parser')
32 
33     for link in soup.find_all('a'):
34 
35         print(link.get('href'))

 

 

  已经获取到所有<a>标签的url连接了,是不是很强大?

  你问这个有什么用?啊哈?有什么用呢,下期继续讲。

  今天到此结束,谢谢。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325292698&siteId=291194637