The first day on the job: to find the data source, data downloads, data processing.
数据源:"http://webhelp.esri.com/arcgisserver/9.3/java/geodatabases/definition_frame.htm"。
Data Download: Right-click the Save page as.
Data processing: bs4 + + chrome contrast observation element + write function check writing method
A, bs4 part
from bs4 import BeautifulSoup soup = BeautifulSoup(open('GIS_dictionary.html','r',encoding='UTF-8'),features="lxml") #tag标签 GlossaryTerm_list = soup.find_all(attrs={'class':'GlossaryTerm'})#完整,1729个 Definition_list = soup.find_all(attrs={'class':'Definition'})#Lack <ol>
Second, comparative observation + check element
They found that when preparing GlossaryTerm and Definition of number-one correspondence between the two is not on. After observation and analysis to determine the site's front-end code has a different approach to the different forms of Definition: There are several interpretation of the words use <ol> ordered list, it can not directly be selected to find bs4 properties.
Third, the write method
The first one: as explained in the same phrase and the front end of the code segment, so the use of ".text" and ".a.attrs [ 'name']" corresponding to the completion of the first portion.
'' ' Complete Definition_list already in 1610 the text acquisition and interpretation of the words corresponding to ' '' defList = [] for I in Definition_list: DEFI = i.text.strip ( ' \ n- ' ) # Modified Definition Word = iaattrs [ ' name ' ] .replace ( ' _ ' , ' ' ) # modification Glossary defList.append ([DEFI, word]) # grab all the explanations and words in a small list, and then stored in a large list of IF (i.text == '' ): # Make sure that no definition is empty print(i.a.attrs['name']) #defList示例[["defi",'word'],["",''],["",''],["",'']...]
A second block: Defined Function func_n (), washed <ol> tag data. Wherein the modified list using a method if the string techniques and screening through an intermediary. Finally, the corresponding phrase and interpreted accordingly to complete the preparation of project data. Tomorrow's plan: database.
'' ' <Ol> tag, the full complement defList from Ctrl + F to get a total of 119 <ol> tag "1610 + 119 = 1729," success! == len 1729 (GlossaryTerm_list) '' ' # defined function func_n # format <ol> The definition: first adding "1"; and a plurality of consecutive "\ n" is a close; added after the "\ n" "2" and number DEF func_n (TXT): lstTxt = List (TXT) # because they can not directly modify the string, which are broken up so as to operate List n-= len (lstTxt) newlstTxt = [ " 1. " ] # Add the first "1." COUNT = 2 for I in Range (. 1-n- ): IF lstTxt [I] == ' ! lstTxt [I] = lstTxt [I +. 1] and ! lstTxt [I +. 1] = ' ' : # maintain separate "\ n", thereafter adding number; negative '\ n' + '' combination newlstTxt .append ( ' \ n- ' ) newlstTxt.append (STR (COUNT)) newlstTxt.append ( ' . ' ) COUNT + =. 1 IF lstTxt [I]! = ' \ n- ' and lstTxt [I]! = lstTxt [I + 1'd] and ! lstTxt [I] = ' \ T ' : # abandon a plurality of consecutive "\ n", give up all '\ T' newlstTxt.append(lstTxt[i]) newlstTxt.append (lstTxt [ -1]) # was added for the last cycle was not strTxt = '' .join (newlstTxt) # '' .join () function list becomes String return strTxt # practical operation ol_list = soup .find_all ( ' OL ' ) for J in ol_list: defi_ol = j.text.strip ( ' \ n- ' ) defi_ol = func_n (defi_ol) word_ol = jaattrs [ ' name ' ] .replace ( ' _ ' , ' ' ) defList.append ([defi_ol, word_ol])
Data dictionary results: