Python web crawler and information extraction (2): extraction of web crawler

This series of notes comes from
the Python series courses of Chinese University MOOC-Beijing Institute of Technology-Songtian teacher

Reprinted from: http://www.jianshu.com/p/7b950b8a5966


4. Getting Started with Beautiful Soup Library

Beautiful Soup library can parse HTML/XML format and extract relevant information

  • Installation: Open CMD in administrator mode - enter pip install beautifulsoup4
    small test:

    Get the HTML code content of the link





    Parse with beautifulsoup


  • The basic elements of the
    Beautiful Soup library Beautiful Soup library is a functional library that parses/traverses/maintains "tags cooked". Reference method:
    from bs4 import BeautifulSoup
    import bs4
    Four parsers of the Beautiful Soup library:




    Basic elements of the Beautiful Soup class:



    Basic elements of class bs


    • Tags




      Any tag that exists in the HTML syntax can be accessed by **soup.<tag>, if there are multiple, take the first one
    • tagname




      Each <tag> has its own name, obtained through <tag>.name, string type
    • Tag's attrs



    • Tag的NavigableString



    • Tag 的 Comment



  • HTML content traversal method based on bs4 library



    Three traversal methods


    • down traversal



      Attributes






      traverse


    • Up traversal



      Attributes






      traverse


    • parallel traversal



      Attributes






      traverse


  • HTML format output based on bs4 library
    Use the prettify() method to add "\n" to HTML text <> and its content and can be used for tags/methods


5. Information organization and extraction methods


  • Three Forms of Information Markup and Comparison
    XML (eXtensible Markup Language) is the earliest general information markup language, which is extensible but cumbersome; tags are composed of names and attributes in the following forms:
    <name>...</name>
    <name />
    <!--   -->

    JSON (JavaScript Objection Notation) is suitable for program processing and is more concise than XML; there are types of key-value pairs in the form of:

    "key":"value"
    "key":["value1","value2"]
    "key":{"subkey":"subvalue"}

    YAML (YAML Ain't Markup Language) has the highest proportion of text information and good readability; untyped key-value pairs in the form of:

    key:value
    key:#Comment
    -value1
    -value2
    key:
      subkey:subvalue

  • General approach to information extraction
    • Completely parse the marked form of the information and then extract the key information, a markup parser is required; the advantage is that the analysis is accurate, but the disadvantage is that the extraction is cumbersome and slow.
    • Ignore the marked form and directly search for key information; the advantage is that the extraction speed is fast, and the disadvantage is that the accuracy is related to the information content
    • Combining the two approaches requires a token parser and a text search function
  • HTML content search method based on bs4 library
    <>.find_all(name,attrs,recursive,string,**kwargs)`
    #返回一个列表类型,存储查找的结果
    #name:对标签名称的检索字符串
    #attrs:对标签属性值的检索字符串,可标注属性检索
    #recursive:是否对子孙全部搜索,默认True
    #string:对字符串域进行检索

    Seven methods extended by find_all():




    extension method



6. Example 1: Chinese University Ranking Crawler

Step 1: Get the university ranking webpage content from the Internet getHTMLText()
Step 2: Extract the information in the webpage content to the appropriate data structure fillUnivList()
Step 3: Use the data structure to display and output the result printUnivLise()

import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return "error" 

def fillUnivList(ulist,html):
    soup=BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr,bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string,tds[1].string,tds[3].string])

def printUnivList(ulist,num):
    tplt="{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名","学校名称","总分",chr(12288)))
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))

def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20)
main()



This series of notes comes from
the Python series courses of Chinese University MOOC-Beijing Institute of Technology-Songtian teacher

Reprinted from: http://www.jianshu.com/p/7b950b8a5966


4. Getting Started with Beautiful Soup Library

Beautiful Soup library can parse HTML/XML format and extract relevant information

  • Installation: Open CMD in administrator mode - enter pip install beautifulsoup4
    small test:

    Get the HTML code content of the link





    Parse with beautifulsoup


  • The basic elements of the
    Beautiful Soup library Beautiful Soup library is a functional library that parses/traverses/maintains "tags cooked". Reference method:
    from bs4 import BeautifulSoup
    import bs4
    Four parsers of the Beautiful Soup library:




    Basic elements of the Beautiful Soup class:



    Basic elements of class bs


    • Tags




      Any tag that exists in the HTML syntax can be accessed by **soup.<tag>, if there are multiple, take the first one
    • tagname




      Each <tag> has its own name, obtained through <tag>.name, string type
    • Tag's attrs



    • Tag的NavigableString



    • Tag 的 Comment



  • HTML content traversal method based on bs4 library



    Three traversal methods


    • down traversal



      Attributes






      traverse


    • Up traversal



      Attributes






      traverse


    • parallel traversal



      Attributes






      traverse


  • HTML format output based on bs4 library
    Use the prettify() method to add "\n" to HTML text <> and its content and can be used for tags/methods


5. Information organization and extraction methods


  • Three Forms of Information Markup and Comparison
    XML (eXtensible Markup Language) is the earliest general information markup language, which is extensible but cumbersome; tags are composed of names and attributes in the following forms:
    <name>...</name>
    <name />
    <!--   -->

    JSON (JavaScript Objection Notation) is suitable for program processing and is more concise than XML; there are types of key-value pairs in the form of:

    "key":"value"
    "key":["value1","value2"]
    "key":{"subkey":"subvalue"}

    YAML (YAML Ain't Markup Language) has the highest proportion of text information and good readability; untyped key-value pairs in the form of:

    key:value
    key:#Comment
    -value1
    -value2
    key:
      subkey:subvalue

  • General approach to information extraction
    • Completely parse the marked form of the information and then extract the key information, a markup parser is required; the advantage is that the analysis is accurate, but the disadvantage is that the extraction is cumbersome and slow.
    • Ignore the marked form and directly search for key information; the advantage is that the extraction speed is fast, and the disadvantage is that the accuracy is related to the information content
    • Combining the two approaches requires a token parser and a text search function
  • HTML content search method based on bs4 library
    <>.find_all(name,attrs,recursive,string,**kwargs)`
    #返回一个列表类型,存储查找的结果
    #name:对标签名称的检索字符串
    #attrs:对标签属性值的检索字符串,可标注属性检索
    #recursive:是否对子孙全部搜索,默认True
    #string:对字符串域进行检索

    Seven methods extended by find_all():




    extension method



6. Example 1: Chinese University Ranking Crawler

Step 1: Get the university ranking webpage content from the Internet getHTMLText()
Step 2: Extract the information in the webpage content to the appropriate data structure fillUnivList()
Step 3: Use the data structure to display and output the result printUnivLise()

import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return "error" 

def fillUnivList(ulist,html):
    soup=BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr,bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string,tds[1].string,tds[3].string])

def printUnivList(ulist,num):
    tplt="{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名","学校名称","总分",chr(12288)))
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))

def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20)
main()



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325723415&siteId=291194637