"GIS DICTIONARY AZ" query page development (1) - bs4 dictionary and data processing

The first day on the job: to find the data source, data downloads, data processing.

数据源:"http://webhelp.esri.com/arcgisserver/9.3/java/geodatabases/definition_frame.htm"。

Data Download: Right-click the Save page as.

Data processing: bs4 + + chrome contrast observation element + write function check writing method

 A, bs4 part

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('GIS_dictionary.html','r',encoding='UTF-8'),features="lxml")
#tag标签
GlossaryTerm_list = soup.find_all(attrs={'class':'GlossaryTerm'})#完整,1729个
Definition_list = soup.find_all(attrs={'class':'Definition'})#Lack <ol>

Second, comparative observation + check element

They found that when preparing GlossaryTerm and Definition of number-one correspondence between the two is not on. After observation and analysis to determine the site's front-end code has a different approach to the different forms of Definition: There are several interpretation of the words use <ol> ordered list, it can not directly be selected to find bs4 properties.

Third, the write method

The first one: as explained in the same phrase and the front end of the code segment, so the use of ".text" and ".a.attrs [ 'name']" corresponding to the completion of the first portion.

'' ' 
    Complete Definition_list already in 1610 the text acquisition and interpretation of the words corresponding to 
' '' 
defList = []
 for I in Definition_list: 
    DEFI = i.text.strip ( ' \ n- ' ) # Modified Definition 
    Word = iaattrs [ ' name ' ] .replace ( ' _ ' , '  ' ) # modification Glossary 
    defList.append ([DEFI, word])   # grab all the explanations and words in a small list, and then stored in a large list of 
    IF (i.text == '' ):                 # Make sure that no definition is empty 
        print(i.a.attrs['name'])
#defList示例[["defi",'word'],["",''],["",''],["",'']...]

A second block: Defined Function func_n (), washed <ol> tag data. Wherein the modified list using a method if the string techniques and screening through an intermediary. Finally, the corresponding phrase and interpreted accordingly to complete the preparation of project data. Tomorrow's plan: database.

'' ' 
    <Ol> tag, the full complement defList from Ctrl + F to get a total of 119 <ol> tag 
    "1610 + 119 = 1729," success! == len 1729 (GlossaryTerm_list) 
'' ' 
# defined function func_n 
# format <ol> The definition: first adding "1"; and a plurality of consecutive "\ n" is a close; added after the "\ n" "2" and number 
DEF func_n (TXT): 
    lstTxt = List (TXT)   # because they can not directly modify the string, which are broken up so as to operate List 
    n-= len (lstTxt) 
    newlstTxt = [ " 1. " ]   # Add the first "1." 
    COUNT = 2 for I in Range (. 1-n- ):
         IF lstTxt [I] == '
     ! lstTxt [I] = lstTxt [I +. 1] and ! lstTxt [I +. 1] = '  ' :   # maintain separate "\ n", thereafter adding number; negative '\ n' + '' combination 
            newlstTxt .append ( ' \ n- ' ) 
            newlstTxt.append (STR (COUNT)) 
            newlstTxt.append ( ' . ' ) 
            COUNT + =. 1
         IF lstTxt [I]! = ' \ n- '  and lstTxt [I]! = lstTxt [I + 1'd] and ! lstTxt [I] = ' \ T ' :   # abandon a plurality of consecutive "\ n", give up all '\ T' 
            newlstTxt.append(lstTxt[i])
    newlstTxt.append (lstTxt [ -1])     # was added for the last cycle was not 
    strTxt = '' .join (newlstTxt)      # '' .join () function list becomes String 
    return strTxt
 # practical operation 
ol_list = soup .find_all ( ' OL ' )
 for J in ol_list: 
    defi_ol = j.text.strip ( ' \ n- ' ) 
    defi_ol = func_n (defi_ol) 
    word_ol = jaattrs [ ' name ' ] .replace ( ' _ ' , '  ' ) 
    defList.append ([defi_ol, word_ol])

 Data dictionary results:

 

Guess you like

Origin www.cnblogs.com/hsh17/p/11701686.html