Getting Started with Python Reptile - Organization and information extraction method (2)

General information extraction method of

  It refers to the content extracted from the information concerned tag. Information marks mentioned in the previous chapter, there are three forms: XML, JSON, YAML.
Several methods in general:
Method One: full marks in the form of analytical information, and then extract key information. Like XML, JSON, YAML, etc., need to mark the parser, for example bs4 library tag tree traversal, what information needs to be resolved, to traverse the tree ok.
Advantages: accurate information analysis, shortcomings: the extraction process cumbersome and slow.

Method Two: to ignore any tag information, search directly critical information. Just like search keywords, like in a Word document, the document does not need to care about what kind of title form and format, we just need to find the function of text information.
Advantages: simple extraction process, faster. Disadvantages: lack of precision extraction results.

Both methods what is good? In real life we use, it is a method of integration.
Fusion: Combining analytical form and search methods to extract key information.

2. A small example

Examples of the original page: http: //python123.io/ws/demo.html
Example: Extract all URL links in HTML.
Ideas: (1) observe the page source, the URL link found in all <a> tag.
   (2) search all the <a> tag
   (3) resolved <a> tag format, after extracting the linked content herf.

import requests
from bs4 import BeautifulSoup      #BeautifulSoup是一个类
r=requests.get("http://python123.io/ws/demo.html")
r.encoding=r.apparent_encoding
demo=r.text
soup=BeautifulSoup(demo,"html.parser")    #两个参数,第一个是要解析的文章,第二个是“html的解析器”
for link in soup.find_all('a'):
	print(link.get('href'))      #这里的find_all方法,待会会在后面讲。

3. Find method is based on HTML content bs4 library

The above mentioned <>. Find_all (name, attrs , rescursive, string, ** kwargs)
returns a list type, the result is stored lookup. Described parameters are as follows:

  • name: string to retrieve the tag name
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> for tag in soup.find_all(True):  #当参数为True时,默认是所有的标签
		print(tag.name)

	
html
head
title
body
p
b
p
a
a
>>> for tag in soup.find_all(re.compile('b')): #打印以b开头的标签
		print(tag.name)

	
body
b
  • attrs: string retrieval tag attribute value, the attribute may be retrieved annotation
>>> soup.find_all('p','course')   #对p标签的course属性值进行搜索
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all(id='link1')   #对所有标签的id=link1属性,进行搜索
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id='link')
[]
>>> soup.find_all(id=re.compile('link'))    #所有以link开头的id
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
  • recursive: Whether to retrieve the descendants of all, the default is True
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]
  • Retrieving character string <> ... </> string region: string
>>> soup.find_all(string='Basic Python')
['Basic Python']
>>> soup.find_all(string=re.compile('python'))   #搜索:字符串区域出现过‘python’字样的标签
['This is a python demo page', 'The demo python introduces several python courses.']

find_all () function is used, all there is a shorthand form :

<Tag> (...) is equivalent to <Tag> .find_all (...)
Soup (...) is equivalent to soup.find_all (...)

Extension method:

method Explanation
<>.find() And returns a search result, string type. With .find_all () parameter.
<>.find_parents() Search ancestor node, returns a list of types. With .find_all () parameter.
<>.find_parent() Ancestor node returns a result, the string type. With .find_all () parameter.
<>.find_next_sibling() Returning a result in a subsequent parallel nodes, string type. With .find_all () parameter.
<>.find_next_siblings() In a subsequent search for a parallel node, returns a list of types. With .find_all () parameter.
<>.find_previous_siblings() Continued parallel nodes in the previous search to return a list of types. With .find_all () parameter.
<>.find_previous_sibling() Continued parallel preceding node returns a result of type string. With .find_all () parameter.

4. combat: Chinese university rankings reptiles

Technical Route: requests-bs4
orientation reptiles, only enter the URL crawling, crawling is not extended.

 4.1 Analysis

Crawling web link: http: //www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html
first open a Web page, view the source code.

Step one: determine the feasibility, we want to determine what can be found in html code. Because part of the data may be generated by JavaScripe scripting language. When you visit a web page, his message is generated dynamically. In this case, request library and bs4 are not extracted. Methods crawling dynamic web pages would later say, where the first crawled static pages.

Step two: View robots protocol, reptiles found it no limit.

Step three: Description: Print out the rankings, university name, and score. We need to find the specific location of our goals in an HTML document.

Step four: the structure of the program design.

  • University Rankings --getHTMLText get web content from the Web () function to achieve
  • Extracting information page content into a suitable data structure (focus) - fillUnivList () function to achieve
  • Using the data structures shown and outputs the result --printUnivList () function to achieve

 4.2 combat

(1) available on the web page content University Rankings --getHTMLText () function implementation, the code is relatively simple, it is no longer a specific analysis.

def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""

(2) code, is the information needed to extract web content, into a suitable data structure --fillUnivList () function to achieve

  • HTML code must first understand and know our goal, "ranking, university name, and score" in which tab below. Then find this label.
  • After seeing, we found what we needed in a place called <tbody> tag below.
  • Tbody again label specific analysis, using soup.tbody.prettify () to print out labels of the tree structure of the tag. Analysis showed that information for each university are in the tbody tr tag inside the tag below. The printed label tr follows (soup.tbody.tr.prettify ())
<tr class="alt">
 <td>
  1
 </td>
 <td>
  <div align="left">
   清华大学
  </div>
 </td>
 <td>
  北京市
 </td>
 <td>
  95.9
 </td>
 <td class="hidden-xs need-hidden indicator5">
  100.0
 </td>
 <td class="hidden-xs need-hidden indicator6" style="display:none;">
  97.90%
 </td>
 <td class="hidden-xs need-hidden indicator7" style="display:none;">
  37342
 </td>
 <td class="hidden-xs need-hidden indicator8" style="display:none;">
  1.298
 </td>
 <td class="hidden-xs need-hidden indicator9" style="display:none;">
  1177
 </td>
 <td class="hidden-xs need-hidden indicator10" style="display:none;">
  109
 </td>
 <td class="hidden-xs need-hidden indicator11" style="display:none;">
  1137711
 </td>
 <td class="hidden-xs need-hidden indicator12" style="display:none;">
  1187
 </td>
 <td class="hidden-xs need-hidden indicator13" style="display:none;">
  593522
 </td>
</tr>

  • This glance, ranked university is the first in a string td tag, university name string, the string score in the second td tag in the third td tag. Here it is printed. Specific code as follows (Note some better looking!):
def fillUnivList(ulist,html):
   soup=BeautifulSoup(html,"html.parser") 
   for tr in soup.find('tbody').children:    #tr是tbody的儿子节点。   .children返回的是一个迭代类型
       if isinstance(tr,bs4.element.Tag):    #isinstance() 函数来判断一个对象是否是一个已知的类型,类似type()。 判断tr是否是一个标签类型
           tds=tr("td")   #等同于<tr>.find_all("td"),返回一个列表类型,列表中是一个tr标签下的所有td标签
           ulist.append([tds[0].string,tds[1].string,tds[2].string])

(3) using the data structure shown and outputs the result --printUnivList () function to achieve

def printUnivList(ulist,num):           #\t相当于tab键
    tplt="{0:^10}\t{1:{3}^10}\t{2:^10}"  #{3}表示,当打印学校名字时,采用format函数第三个变量来填充,也就是使用中文空格来填充。
    print(tplt.format("排名","学校","分数",chr(12288)))    #中文对齐问题,采用中文字符填充
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))

It should be used in the format Python () function, it is not explained, specifically, can be self Baidu.

Options meaning
‘<’ Mandatory fields left-justified within the available space (this is the default setting for most objects).
‘>’ Mandatory fields available space in right-justified (which is the default number).
‘^’ Mandatory fields centered within the available space.

 4.3 The complete code

import requests
import bs4
from bs4 import BeautifulSoup      #BeautifulSoup是一个类

def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist,html):
    soup=BeautifulSoup(html,"html.parser") 
    for tr in soup.find('tbody').children:    #tr是tbody的儿子节点,这是一个迭代类型
        if isinstance(tr,bs4.element.Tag):    #isinstance() 函数来判断一个对象是否是一个已知的类型,类似 type()。 
            tds=tr("td")   #等同于tr.find_all(“td”),返回一个列表类型
            ulist.append([tds[0].string,tds[1].string,tds[2].string])

def printUnivList(ulist,num):           #\t相当于tab键
    tplt="{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名","学校","分数",chr(12288)))    #中文对齐问题
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))

def main():
    uinfo=[]
    url="http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html"
    html=getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20)   #只打印前20所学校

main()


5. Summary

  Article explained the general method of information extraction. Requests BeautifulSoup use libraries and library, general information extraction step probably can be summarized as follows: to find a specific location information (search), then use the parser partial traverse (in the form of parse) to find the target print.
  Re article Some example uses regular expressions (but rarely), the regular expression is very important, and will then explain later. Please concern ~!

Published 20 original articles · won praise 51 · views 7513

Guess you like

Origin blog.csdn.net/weixin_43275558/article/details/104456987