(1) http://www.zhcw.com/ssq/kaijiangshuju/index.shtml?type=0, open this Web site, and data sources found by the law of this page in your browser "check" option;
(2) found his information exists tag <tr> in
(3) code shows:
Crawling all pages 1-5 winning <lottery time>, <Issue>, <winning numbers>, <sales> <prize>, <prize> information stored in a CSV file.
# Information exists crawling out of the list and form = [] for I in Range (l, 5 ): URL1 = " http://kaijiang.zhcw.com/zhcw/html/ssq/list_%s.html " % (I) HTML1 = requests.get (URL1) .text Soup = the BeautifulSoup (HTML1, ' html.parser ' ) Tag = soup.find_all ( ' TR ' ) # Print (Tag) for A in Tag [2: len (Tag ) -. 1 ]: TEMP = [] for b in a.contents[0:12]: if (b != '\n'): temp += [b.text.strip().replace('\r\n', '').replace(' ', '').replace('\n', ' ')] form.append(temp)
Save to csv in:
Open with ( ' Two color winning information .csv ' , ' W ' , NEWLINE = '' , encoding = ' UTF-. 8 ' ) AS F: Writer = csv.writer (F) writer.writerow ([ ' lottery date ' , ' Issue ' , ' winning numbers ' , ' sales (dollars) ' , ' First Prize ' , ' second prize ' ]) for A in form: Print (A) Writer.writerow(a)
operation result:
to sum up:
Recommended lxml parsing library, if necessary, use html.parser
label select the filter function is weak but fast
is recommended to use find (), find_all () query matches a single result or multiple results
if the recommendations are familiar with CSS selectors to use select ()
Remember common methods of acquiring attributes and text values