First, the analysis of website content
The website crawling to opgg, available at: " http://www.op.gg/champion/statistics "
As can be seen from the web interface, the right side details hero to Garen example, the winning percentage of 53.84%, 16.99% was selected, a single location is used
Now the web page source code analysis (right mouse button can be found in the menu to view the page source code). By looking for "53.84%" to quickly locate the position where Garen
As can be seen by the code, the name of the hero, and select the winning rates are td tags, and each message a hero tr tag, td tr tag is the parent tag, tr tbody tag is the parent tag.
Tbody tag to find
There are five tbody tag label (tag beginning end are tbody "tbody", so that a total of 10 "tbody"), the field of content analysis, are single, playing field, single, ADC, auxiliary information
More than just this part of the hero as an example, we need to first find the tbody tab, then find the tr tag (each tr tag is a hero of information), and then get more information from the sub-label of hero td tag
Second, the step of crawling
Crawling web content -> extract the required information -> Output Data Hero
getHTMLText(url)->fillHeroInformation(hlist,html)->printHeroInformation(hlist)
getHTMLText (url) function returns html content in the url link
fillHeroInformation (hlist, html) is a function of the desired information extracted html stored list hlist
printHeroInformation (hlist) is the output function information hlist list hero
Third, code implementation
1, getHTMLText (url) function
1 def getHTMLText (url): # Returns html document information 2 the try: . 3 = R & lt requests.get (URL, timeout = 30) . 4 r.raise_for_status () . 5 = r.apparent_encoding r.encoding . 6 return r.text returned html # content 7 the except: 8 return ""
2, fillHeroInformation (hlist, html) function
In an example a label tr, tr inner td tag label 7, the fourth td tag attribute value "champion-index-table__name" div tag content name Hero, the fifth winning content td tag, the 6 td tag content is selected rate, this information is stored in hlist list
1 def fillHeroInformation (hlist, html) : # the hero list of information into hlist 2 = BeautifulSoup Soup (HTML, "html.parser") 3 for TR in soup.find (name = "tbody", attrs = "TabItem champion- . trend-tier-TOP ") children: son label on a single traversal tbody tag # 4 if isinstance (tr, bs4.element.Tag) : # tr is determined whether the tag type, remove blank lines 5 tds = tr ( 'td ') in the # Find td tr tag label 6 heroName = tds [3] .find (attrs = "champion-index-table__name"). string # name hero 7 winRate = tds [4] .string # winning 8 pickRate = tds [5] .string # selected rate 9 hlist.append ([heroName, winRate, pickRate]) # hero information added to the list hlist
3, printHeroInformation (hlist) function
1 def printHeroInformation (hlist): # output hlist list information 2 print ( "{: ^ 20 } \ t {: ^ 20} \ t {: ^ 20} \ t {: ^ 20}". Format ( " hero name" "winning", "selection rate", "position")) . 3 for I in Range (len (hList)): . 4 I = hList [I] . 5 Print ( "{: ^ 20 is} \ T {: ^ 20 is} \ t {: ^ 20} \ t {: ^ 20} "( a single" on) format i [0], i [1], i [2],). "
4, main () function
Web site address assigned to the url, create a new hlist list, call getHTMLText (url) function to get the html document information, use fillHeroInformation (hlist, html) function information into hlist hero list, and then use printHeroInformation (hlist) function output
Main DEF. 1 (): 2 URL = "http://www.op.gg/champion/statistics" . 3 hList = [] . 4 html = getHTMLText (URL) obtained html document information # 5 fillHeroInformation (hlist, html) # The hero hlist information write a list of 6 printHeroInformation # output (hlist)
Fourth, the results demonstrate
1, the web interface information
2, the results of crawling
Five complete code
Import Requests. 1 2 Import Re . 3 Import BS4 . 4 from the BeautifulSoup Import BS4 . 5 . 6 DEF getHTMLText (URL): # Returns html document information . 7 the try: . 8 = R & lt requests.get (URL, timeout = 30) . 9 r.raise_for_status () 10 = r.apparent_encoding r.encoding . 11 return r.text returned html content # 12 is the except: 13 is return "" 14 15 DEF fillHeroInformation (hlist, html): # hlist listing information into the hero 16 soup = BeautifulSoup (html, " html .parser ") . 17 in soup.find for TR (name =" tbody ", attrs =" Champion-TabItem the TOP-Trend-Tier ") Children:. # son label on a single traversal tbody tag 18 if isinstance (tr, bs4 .element.Tag): # tr is determined whether the tag type, remove blank lines 19 tds = tr ( 'td ') # Find tr td tag in the tag 20 heroName = tds [3] .find (attrs = "champion-index-table__name"). String # name Hero 21 winRate = tds [4] .string # winning 22 pickRate = tds [5] .string # rate select 23 hlist .append ([heroName, winRate, pickRate ]) # hero information added to the list hlist 24 25 DEF printHeroInformation (hlist): # hlist outputs the list information 26 print ( "{: ^ 20 } \ t {: ^ 20} \ t {: ^ 20} \ t {: ^ 20} ". format (" hero name "," winning "," selection rate "," position ")) 27 for I in Range (len (hList)): 28 I = hList [I] 29 Print (. "{: ^ 20 is} \ T {: ^ 20 is} \ T {: ^ 20 is} \ T {: ^ 20 is}" the format (I [0], I [. 1], I [2], "single")) 30 31 is DEF main (): 32 URL = "http://www.op.gg/champion/statistics" 33 hlist = [] 34 html = getHTMLText (url) # get html document information 35 fillHeroInformation (hlist, html) # hlist list information is written in the hero 36 printHeroInformation (hlist) # output 37 [ 38 is main ()
If you need to crawl playing field, single, ADC or auxiliary information, only need to modify
fillHeroInformation (hlist, html) function in
for tr in soup.find (name = "tbody", attrs = "tabItem champion-trend-tier-TOP"). children statement, modify the property value attrs
"tabItem champion-trend-tier-JUNGLE"、"tabItem champion-trend-tier-MID"、"tabItem champion-trend-tier-ADC"、"tabItem champion-trend-tier-SUPPORT"等即可
First, the analysis of website content
The website crawling to opgg, available at: " http://www.op.gg/champion/statistics "
As can be seen from the web interface, the right side details hero to Garen example, the winning percentage of 53.84%, 16.99% was selected, a single location is used
Now the web page source code analysis (right mouse button can be found in the menu to view the page source code). By looking for "53.84%" to quickly locate the position where Garen
As can be seen by the code, the name of the hero, and select the winning rates are td tags, and each message a hero tr tag, td tr tag is the parent tag, tr tbody tag is the parent tag.
Tbody tag to find
There are five tbody tag label (tag beginning end are tbody "tbody", so that a total of 10 "tbody"), the field of content analysis, are single, playing field, single, ADC, auxiliary information
More than just this part of the hero as an example, we need to first find the tbody tab, then find the tr tag (each tr tag is a hero of information), and then get more information from the sub-label of hero td tag
Second, the step of crawling
Crawling web content -> extract the required information -> Output Data Hero
getHTMLText(url)->fillHeroInformation(hlist,html)->printHeroInformation(hlist)
getHTMLText (url) function returns html content in the url link
fillHeroInformation (hlist, html) is a function of the desired information extracted html stored list hlist
printHeroInformation (hlist) is the output function information hlist list hero
Third, code implementation
1, getHTMLText (url) function
1 def getHTMLText (url): # Returns html document information 2 the try: . 3 = R & lt requests.get (URL, timeout = 30) . 4 r.raise_for_status () . 5 = r.apparent_encoding r.encoding . 6 return r.text returned html # content 7 the except: 8 return ""
2, fillHeroInformation (hlist, html) function
In an example a label tr, tr inner td tag label 7, the fourth td tag attribute value "champion-index-table__name" div tag content name Hero, the fifth winning content td tag, the 6 td tag content is selected rate, this information is stored in hlist list
1 def fillHeroInformation (hlist, html) : # the hero list of information into hlist 2 = BeautifulSoup Soup (HTML, "html.parser") 3 for TR in soup.find (name = "tbody", attrs = "TabItem champion- . trend-tier-TOP ") children: son label on a single traversal tbody tag # 4 if isinstance (tr, bs4.element.Tag) : # tr is determined whether the tag type, remove blank lines 5 tds = tr ( 'td ') in the # Find td tr tag label 6 heroName = tds [3] .find (attrs = "champion-index-table__name"). string # name hero 7 winRate = tds [4] .string # winning 8 pickRate = tds [5] .string # selected rate 9 hlist.append ([heroName, winRate, pickRate]) # hero information added to the list hlist
3, printHeroInformation (hlist) function
1 def printHeroInformation (hlist): # output hlist list information 2 print ( "{: ^ 20 } \ t {: ^ 20} \ t {: ^ 20} \ t {: ^ 20}". Format ( " hero name" "winning", "selection rate", "position")) . 3 for I in Range (len (hList)): . 4 I = hList [I] . 5 Print ( "{: ^ 20 is} \ T {: ^ 20 is} \ t {: ^ 20} \ t {: ^ 20} "( a single" on) format i [0], i [1], i [2],). "
4, main () function
Web site address assigned to the url, create a new hlist list, call getHTMLText (url) function to get the html document information, use fillHeroInformation (hlist, html) function information into hlist hero list, and then use printHeroInformation (hlist) function output
Main DEF. 1 (): 2 URL = "http://www.op.gg/champion/statistics" . 3 hList = [] . 4 html = getHTMLText (URL) obtained html document information # 5 fillHeroInformation (hlist, html) # The hero hlist information write a list of 6 printHeroInformation # output (hlist)
Fourth, the results demonstrate
1, the web interface information
2, the results of crawling
Five complete code
Import Requests. 1 2 Import Re . 3 Import BS4 . 4 from the BeautifulSoup Import BS4 . 5 . 6 DEF getHTMLText (URL): # Returns html document information . 7 the try: . 8 = R & lt requests.get (URL, timeout = 30) . 9 r.raise_for_status () 10 = r.apparent_encoding r.encoding . 11 return r.text returned html content # 12 is the except: 13 is return "" 14 15 DEF fillHeroInformation (hlist, html): # hlist listing information into the hero 16 soup = BeautifulSoup (html, " html .parser ") . 17 in soup.find for TR (name =" tbody ", attrs =" Champion-TabItem the TOP-Trend-Tier ") Children:. # son label on a single traversal tbody tag 18 if isinstance (tr, bs4 .element.Tag): # tr is determined whether the tag type, remove blank lines 19 tds = tr ( 'td' ) # Find td tr tag in the tag 20 heroName = tds [3] .find (attrs = "champion-index-table__name"). String # name Hero 21 winRate = tds [4]. winning string # 22 pickRate = tds [5] .string # rate select 23 hlist.append ([heroName, winRate, pickRate]) # hero information added to the list hlist 24 25 DEF printHeroInformation (hlist): # outputs list information hlist 26 print ( "{: ^ 20 } \ t {: ^ 20} \ t {: ^ 20} \ t {: ^ 20}". format ( " hero name", "winning", "selection rate", "position ")) 27 for I in Range (len (hList)): 28 I = hList [I] 29 Print (" {: ^ 20 is} \ T {: ^ 20 is} \ T {: ^ 20 is} \ T {: ^ ". format (i [0] , i [1], i [2]," 20} on a single ")) 30 31 is DEF main (): 32 URL =" HTTP: //www.op.gg/champion/statistics" 33 hlist = [] 34 html = getHTMLText (url) # html document information obtained 35 fillHeroInformation (hlist, html) # hlist list information is written in the hero 36 printHeroInformation (hlist) # output 37 [ 38 is main ()
If you need to crawl playing field, single, ADC or auxiliary information, only need to modify
fillHeroInformation (hlist, html) function in
for tr in soup.find (name = "tbody", attrs = "tabItem champion-trend-tier-TOP"). children statement, modify the property value attrs
"tabItem champion-trend-tier-JUNGLE"、"tabItem champion-trend-tier-MID"、"tabItem champion-trend-tier-ADC"、"tabItem champion-trend-tier-SUPPORT"等即可