Use Python crawling OPGG the League of Heroes and select the winning rate information

First, the analysis of website content

The website crawling to opgg, available at: "  http://www.op.gg/champion/statistics "

As can be seen from the web interface, the right side details hero to Garen example, the winning percentage of 53.84%, 16.99% was selected, a single location is used

Now the web page source code analysis (right mouse button can be found in the menu to view the page source code). By looking for "53.84%" to quickly locate the position where Garen

As can be seen by the code, the name of the hero, and select the winning rates are td tags, and each message a hero tr tag, td tr tag is the parent tag, tr tbody tag is the parent tag.

Tbody tag to find

There are five tbody tag label (tag beginning end are tbody "tbody", so that a total of 10 "tbody"), the field of content analysis, are single, playing field, single, ADC, auxiliary information

More than just this part of the hero as an example, we need to first find the tbody tab, then find the tr tag (each tr tag is a hero of information), and then get more information from the sub-label of hero td tag

Second, the step of crawling

Crawling web content -> extract the required information -> Output Data Hero

getHTMLText(url)->fillHeroInformation(hlist,html)->printHeroInformation(hlist)

getHTMLText (url) function returns html content in the url link

fillHeroInformation (hlist, html) is a function of the desired information extracted html stored list hlist

printHeroInformation (hlist) is the output function information hlist list hero

Third, code implementation

1, getHTMLText (url) function

Copy the code
1 def getHTMLText (url): # Returns html document information 
2 the try: 
. 3 = R & lt requests.get (URL, timeout = 30) 
. 4 r.raise_for_status () 
. 5 = r.apparent_encoding r.encoding 
. 6 return r.text returned html # content 
7 the except: 
8 return ""
Copy the code

2, fillHeroInformation (hlist, html) function

In an example a label tr, tr inner td tag label 7, the fourth td tag attribute value "champion-index-table__name" div tag content name Hero, the fifth winning content td tag, the 6 td tag content is selected rate, this information is stored in hlist list

Copy the code
1 def fillHeroInformation (hlist, html) : # the hero list of information into hlist 
2 = BeautifulSoup Soup (HTML, "html.parser") 
3 for TR in soup.find (name = "tbody", attrs = "TabItem champion- . trend-tier-TOP ") children: son label on a single traversal tbody tag # 
4 if isinstance (tr, bs4.element.Tag) : # tr is determined whether the tag type, remove blank lines 
5 tds = tr ( 'td ') in the # Find td tr tag label 
6 heroName = tds [3] .find (attrs = "champion-index-table__name"). string # name hero 
7 winRate = tds [4] .string # winning 
8 pickRate = tds [5] .string # selected rate 
9 hlist.append ([heroName, winRate, pickRate]) # hero information added to the list hlist
Copy the code

3, printHeroInformation (hlist) function

1 def printHeroInformation (hlist): # output hlist list information 
2 print ( "{: ^ 20 } \ t {: ^ 20} \ t {: ^ 20} \ t {: ^ 20}". Format ( " hero name" "winning", "selection rate", "position")) 
. 3 for I in Range (len (hList)): 
. 4 I = hList [I] 
. 5 Print ( "{: ^ 20 is} \ T {: ^ 20 is} \ t {: ^ 20} \ t {: ^ 20} "( a single" on) format i [0], i [1], i [2],). "

4, main () function

Web site address assigned to the url, create a new hlist list, call getHTMLText (url) function to get the html document information, use fillHeroInformation (hlist, html) function information into hlist hero list, and then use printHeroInformation (hlist) function output

Copy the code
Main DEF. 1 (): 
2 URL = "http://www.op.gg/champion/statistics" 
. 3 hList = [] 
. 4 html = getHTMLText (URL) obtained html document information # 
5 fillHeroInformation (hlist, html) # The hero hlist information write a list of 
6 printHeroInformation # output (hlist)
Copy the code

Fourth, the results demonstrate

1, the web interface information

2, the results of crawling

Five complete code

Copy the code
Import Requests. 1 
 2 Import Re 
 . 3 Import BS4 
 . 4 from the BeautifulSoup Import BS4 
 . 5 
 . 6 DEF getHTMLText (URL): # Returns html document information 
 . 7 the try: 
 . 8 = R & lt requests.get (URL, timeout = 30) 
 . 9 r.raise_for_status () 
10 = r.apparent_encoding r.encoding 
. 11 return r.text returned html content # 
12 is the except: 
13 is return "" 
14 
15 DEF fillHeroInformation (hlist, html): # hlist listing information into the hero 
16 soup = BeautifulSoup (html, " html .parser ") 
. 17 in soup.find for TR (name =" tbody ", attrs =" Champion-TabItem the TOP-Trend-Tier ") Children:. # son label on a single traversal tbody tag  
18 if isinstance (tr, bs4 .element.Tag): # tr is determined whether the tag type, remove blank lines
19 tds = tr ( 'td ') # Find tr td tag in the tag
20 heroName = tds [3] .find (attrs = "champion-index-table__name"). String # name Hero 
21 winRate = tds [4] .string # winning 
22 pickRate = tds [5] .string # rate select 
23 hlist .append ([heroName, winRate, pickRate ]) # hero information added to the list hlist 
24 
25 DEF printHeroInformation (hlist): # hlist outputs the list information 
26 print ( "{: ^ 20 } \ t {: ^ 20} \ t {: ^ 20} \ t {: ^ 20} ". format (" hero name "," winning "," selection rate "," position ")) 
27 for I in Range (len (hList)): 
28 I = hList [I] 
29 Print (. "{: ^ 20 is} \ T {: ^ 20 is} \ T {: ^ 20 is} \ T {: ^ 20 is}" the format (I [0], I [. 1], I [2], "single")) 
30 
31 is DEF main (): 
32 URL = "http://www.op.gg/champion/statistics"
33     hlist = []
34 html = getHTMLText (url) # get html document information 
35 fillHeroInformation (hlist, html) # hlist list information is written in the hero 
36 printHeroInformation (hlist) # output 
37 [ 
38 is main ()
Copy the code

If you need to crawl playing field, single, ADC or auxiliary information, only need to modify

fillHeroInformation (hlist, html) function in
for tr in soup.find (name = "tbody", attrs = "tabItem champion-trend-tier-TOP"). children statement, modify the property value attrs
"tabItem champion-trend-tier-JUNGLE"、"tabItem champion-trend-tier-MID"、"tabItem champion-trend-tier-ADC"、"tabItem champion-trend-tier-SUPPORT"等即可
Life has a limit but knowledge not

First, the analysis of website content

The website crawling to opgg, available at: "  http://www.op.gg/champion/statistics "

As can be seen from the web interface, the right side details hero to Garen example, the winning percentage of 53.84%, 16.99% was selected, a single location is used

Now the web page source code analysis (right mouse button can be found in the menu to view the page source code). By looking for "53.84%" to quickly locate the position where Garen

As can be seen by the code, the name of the hero, and select the winning rates are td tags, and each message a hero tr tag, td tr tag is the parent tag, tr tbody tag is the parent tag.

Tbody tag to find

There are five tbody tag label (tag beginning end are tbody "tbody", so that a total of 10 "tbody"), the field of content analysis, are single, playing field, single, ADC, auxiliary information

More than just this part of the hero as an example, we need to first find the tbody tab, then find the tr tag (each tr tag is a hero of information), and then get more information from the sub-label of hero td tag

Second, the step of crawling

Crawling web content -> extract the required information -> Output Data Hero

getHTMLText(url)->fillHeroInformation(hlist,html)->printHeroInformation(hlist)

getHTMLText (url) function returns html content in the url link

fillHeroInformation (hlist, html) is a function of the desired information extracted html stored list hlist

printHeroInformation (hlist) is the output function information hlist list hero

Third, code implementation

1, getHTMLText (url) function

Copy the code
1 def getHTMLText (url): # Returns html document information 
2 the try: 
. 3 = R & lt requests.get (URL, timeout = 30) 
. 4 r.raise_for_status () 
. 5 = r.apparent_encoding r.encoding 
. 6 return r.text returned html # content 
7 the except: 
8 return ""
Copy the code

2, fillHeroInformation (hlist, html) function

In an example a label tr, tr inner td tag label 7, the fourth td tag attribute value "champion-index-table__name" div tag content name Hero, the fifth winning content td tag, the 6 td tag content is selected rate, this information is stored in hlist list

Copy the code
1 def fillHeroInformation (hlist, html) : # the hero list of information into hlist 
2 = BeautifulSoup Soup (HTML, "html.parser") 
3 for TR in soup.find (name = "tbody", attrs = "TabItem champion- . trend-tier-TOP ") children: son label on a single traversal tbody tag # 
4 if isinstance (tr, bs4.element.Tag) : # tr is determined whether the tag type, remove blank lines 
5 tds = tr ( 'td ') in the # Find td tr tag label 
6 heroName = tds [3] .find (attrs = "champion-index-table__name"). string # name hero 
7 winRate = tds [4] .string # winning 
8 pickRate = tds [5] .string # selected rate 
9 hlist.append ([heroName, winRate, pickRate]) # hero information added to the list hlist
Copy the code

3, printHeroInformation (hlist) function

1 def printHeroInformation (hlist): # output hlist list information 
2 print ( "{: ^ 20 } \ t {: ^ 20} \ t {: ^ 20} \ t {: ^ 20}". Format ( " hero name" "winning", "selection rate", "position")) 
. 3 for I in Range (len (hList)): 
. 4 I = hList [I] 
. 5 Print ( "{: ^ 20 is} \ T {: ^ 20 is} \ t {: ^ 20} \ t {: ^ 20} "( a single" on) format i [0], i [1], i [2],). "

4, main () function

Web site address assigned to the url, create a new hlist list, call getHTMLText (url) function to get the html document information, use fillHeroInformation (hlist, html) function information into hlist hero list, and then use printHeroInformation (hlist) function output

Copy the code
Main DEF. 1 (): 
2 URL = "http://www.op.gg/champion/statistics" 
. 3 hList = [] 
. 4 html = getHTMLText (URL) obtained html document information # 
5 fillHeroInformation (hlist, html) # The hero hlist information write a list of 
6 printHeroInformation # output (hlist)
Copy the code

Fourth, the results demonstrate

1, the web interface information

2, the results of crawling

Five complete code

Copy the code
Import Requests. 1 
 2 Import Re 
 . 3 Import BS4 
 . 4 from the BeautifulSoup Import BS4 
 . 5 
 . 6 DEF getHTMLText (URL): # Returns html document information 
 . 7 the try: 
 . 8 = R & lt requests.get (URL, timeout = 30) 
 . 9 r.raise_for_status () 
10 = r.apparent_encoding r.encoding 
. 11 return r.text returned html content # 
12 is the except: 
13 is return "" 
14 
15 DEF fillHeroInformation (hlist, html): # hlist listing information into the hero 
16 soup = BeautifulSoup (html, " html .parser ") 
. 17 in soup.find for TR (name =" tbody ", attrs =" Champion-TabItem the TOP-Trend-Tier ") Children:. # son label on a single traversal tbody tag 
18 if isinstance (tr, bs4 .element.Tag): # tr is determined whether the tag type, remove blank lines
19 tds = tr ( 'td' ) # Find td tr tag in the tag 
20 heroName = tds [3] .find (attrs = "champion-index-table__name"). String # name Hero 
21 winRate = tds [4]. winning string # 
22 pickRate = tds [5] .string # rate select 
23 hlist.append ([heroName, winRate, pickRate]) # hero information added to the list hlist 
24 
25 DEF printHeroInformation (hlist): # outputs list information hlist 
26 print ( "{: ^ 20 } \ t {: ^ 20} \ t {: ^ 20} \ t {: ^ 20}". format ( " hero name", "winning", "selection rate", "position ")) 
27 for I in Range (len (hList)): 
28 I = hList [I] 
29 Print (" {: ^ 20 is} \ T {: ^ 20 is} \ T {: ^ 20 is} \ T {: ^ ". format (i [0] , i [1], i [2]," 20} on a single ")) 
30 
31 is DEF main (): 
32 URL =" HTTP: //www.op.gg/champion/statistics"
33     hlist = []
34 html = getHTMLText (url) # html document information obtained
35 fillHeroInformation (hlist, html) # hlist list information is written in the hero 
36 printHeroInformation (hlist) # output 
37 [ 
38 is main ()
Copy the code

If you need to crawl playing field, single, ADC or auxiliary information, only need to modify

fillHeroInformation (hlist, html) function in
for tr in soup.find (name = "tbody", attrs = "tabItem champion-trend-tier-TOP"). children statement, modify the property value attrs
"tabItem champion-trend-tier-JUNGLE"、"tabItem champion-trend-tier-MID"、"tabItem champion-trend-tier-ADC"、"tabItem champion-trend-tier-SUPPORT"等即可

Guess you like

Origin www.cnblogs.com/7758520lzy/p/12499626.html