For beginners Python tutorial crawling chain of home network

Foreword

The text of text and images from the network, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.

Author: TinaLY

PS: If necessary Python learning materials can be added to a small partner click the link below to obtain their own

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef

Many online tutorials crawling, but generally there are two problems:

First: he will encounter a lot of debugging bug, it can not be generally used for debugging the code difficult it is rather crazy;

Second: Since the format of the page data is not entirely regular, looking for code block may encounter problems, you need to set an exception is thrown mechanism, in order to avoid crawling in front of the data did not have time to save, time-consuming and labor-intensive.

This is based on their own experience, to provide small quantities of data crawling, to open the code as much as possible open.

By crawling structures, mainly selenium, the page will be continuously open.
Case in Jinan City, to small-scale test, were acquired for a single administrative area, after the code may be familiar with the district changed cycle.

code show as below:

Key package:

1 from selenium import webdriver
2 from urllib import request,parse
3 from selenium.common.exceptions import NoSuchElementException

Defined parameters (the first three lines of high moral API is used to obtain the coordinates obtained, the fourth line is crawling the city, then there will be a general Web links):

1 amap_web_key = '你的key'
2 poi_search_url = "http://restapi.amap.com/v3/place/text"
3 poi_boundary_url = "https://ditu.amap.com/detail/get/detail"
4 city ='jinan'

Key Code:

. 1 headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 29.0.1547.57 Safari / 537.36 ' }
 2 driver1 = webdriver.Chrome ()
 . 3 pageid . 1 =
 . 4  the while (pageid <= ' page number ' ): // page number refers to the number of pages is here displayed after entering a zone
 . 5      allarray = []
 . 6      Print ( ' pageid = ' , pageid)
 . 7      URL = ' https://jn.lianjia.com/xiaoqu/pingyin/pg ' +str(pageid)
 8     driver1.get(url)
 9     driver1.implicitly_wait(5)
10     house_list =driver1.find_elements_by_class_name('img')
11     for i in range(house_list.__len__()):
12         time.sleep(2)
13         temparray =[]
14         detailurl = house_list[i].get_attribute('href')
15         print(i,'detailurl',detailurl)
16         driver =webdriver.Chrome ()
 . 17          driver.get (detailurl)
 18 is          the try :
 . 19              housename = driver.find_element_by_class_name ( ' detailTitle ' ) .text
 20 is              . price driver.find_element_by_class_name = ( ' xiaoquUnitPrice ' ) .text
 21 is              xiaoquinfo = driver.find_elements_by_class_name ( ' xiaoquInfoContent ' )
 22              # [floor area, building area, floor area ratio, green rate, parking spaces, the total number of BAN, the total number of households, property companies, property costs, property description, floor condition] 
23              xiaoquage = xiaoquinfo [0] .text   # building's 
24             = xiaoquinfo jianzhuleixing [1] .text   # building types 
25              wuyefei = xiaoquinfo [2] .text   # property charges 
26              dongshu = xiaoquinfo [5] .text   # total BAN 
27              Hushu = xiaoquinfo [6] .text # The total number of houses 
28              TempArray .append (housename)
 29              temparray.append (. price)
 30              temparray.append (jianzhuleixing)   # households 
31              temparray.append (wuyefei)   # property costs 
32              temparray.append (dongshu)   # volume ratio 
33             temparray.append (Hushu)   # green rate 
34 is              # LOCATION = getpoi_page (TempArray [0]) // Get API call with a high de coordinate function 
35              # by a high de query Coordinates       
36              // get the ultimate goal cells map to fall on, it is necessary to obtain the coordinates of the point, high moral open API available, but
 37              // due to the limited number of queries a key, in order to prevent the middle of error, it is recommended that all housing data after the first set up,
 38              // unified coordinate search for beginner, all to easy to achieve the main!
39              temparray.append ( ' 0 ' )
 40              temparray.append ( ' 0 ' )
 41 is              # IF (LOCATION == ''): 
42 is              #     temparray.append ( '0') 
43 is              #      temparray.append ( '0') 
44 is              # the else: 
45              #      temparray.append (LOCATION [0]) 
46 is              #      temparray.append (LOCATION [. 1]) 
47              # BREAK 
48              # Print (TempArray) 
49          the except NoSuchElementException AS msg:
 50              // exception thrown function very, very important, though, such as Taobao, Alibaba and other HTML tags page has a unified format,
 51              // but experienced children's shoes should know that there will always be one or two are not routinely play, if an exception is thrown mechanisms write
 52              // well, often easily come to naught
 53              # Print ( "first", i, "a cell search element failure") 
54              the try:
 55                  housename = driver.find_element_by_class_name ( ' detailTitle ' ) .text
 56 is price = driver.find_element_by_css_selector ( " [class = 'Clear xiaoquPrice'] " ) .text
 57 is // above comparison price can be seen, because the exception is thrown for the price tag attribute there were two
 58 xiaoquinfo = driver.find_elements_by_class_name ( ' xiaoquInfoContent ' )
 59                  # [floor area, building area, floor area ratio, green rate, parking spaces, the total number of BAN, the total number of households, property companies, property fees, property description, floor condition] 
60                  xiaoquage = xiaoquinfo [0] .text   # building's 
61                  jianzhuleixing = xiaoquinfo [1] .text  # Building types 
62                  wuyefei = xiaoquinfo [2] .text   # property charges 
63                  dongshu = xiaoquinfo [5] .text   # total BAN 
64-                  Hushu = xiaoquinfo [6] .text   # The total number of houses 
65                  temparray.append (housename)
 66                  TempArray. the append (. price)
 67                  temparray.append (jianzhuleixing)   # households 
68                  temparray.append (wuyefei)   # property costs 
69                  temparray.append (dongshu)   # volume ratio 
70                 temparray.append (Hushu)   # green rate 
71 is                  temparray.append ( ' 0 ' )
 72                  temparray.append ( ' 0 ' )
 73 is              the except NoSuchElementException AS MSG:
 74                  Print ( " in both cases not find " )
 75          allarray.append (TempArray)
 76          driver.close ()
 77      text_save (allarray, ' lianjia_fangwu.txt ' )
 78 pageid +. 1 =