Crawling Lottery Information

(1) http://www.zhcw.com/ssq/kaijiangshuju/index.shtml?type=0, open this Web site, and data sources found by the law of this page in your browser "check" option;

 

 (2) found his information exists tag <tr> in

 

 

 (3) code shows:

Crawling all pages 1-5 winning <lottery time>, <Issue>, <winning numbers>, <sales> <prize>, <prize> information stored in a CSV file.

 # Information exists crawling out of the list and 
form = []
 for I in Range (l, 5 ): 
    URL1 = " http://kaijiang.zhcw.com/zhcw/html/ssq/list_%s.html " % (I) 
    HTML1 = requests.get (URL1) .text 
    Soup = the BeautifulSoup (HTML1, ' html.parser ' ) 
    Tag = soup.find_all ( ' TR ' )
     # Print (Tag) 
    for A in Tag [2: len (Tag ) -. 1 ]: 
        TEMP = []
         for b in a.contents[0:12]:
            if (b != '\n'):
                temp += [b.text.strip().replace('\r\n', '').replace(' ', '').replace('\n', ' ')]
        form.append(temp)

Save to csv in:

Open with ( ' Two color winning information .csv ' , ' W ' , NEWLINE = '' , encoding = ' UTF-. 8 ' ) AS F: 
    Writer = csv.writer (F) 
    writer.writerow ([ ' lottery date ' , ' Issue ' , ' winning numbers ' , ' sales (dollars) ' , ' First Prize ' , ' second prize ' ])
     for A in form:
         Print  (A)
        Writer.writerow(a)

operation result:

 

 to sum up:

Recommended lxml parsing library, if necessary, use html.parser
label select the filter function is weak but fast
is recommended to use find (), find_all () query matches a single result or multiple results
if the recommendations are familiar with CSS selectors to use select ()
Remember common methods of acquiring attributes and text values

 

Guess you like

Origin www.cnblogs.com/wt714/p/12003239.html