The text of text and images from the network, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.
Author: TinaLY
PS: If necessary Python learning materials can be added to a small partner click the link below to obtain their own
http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef
Many online tutorials crawling, but generally there are two problems:
First: he will encounter a lot of debugging bug, it can not be generally used for debugging the code difficult it is rather crazy;
This is based on their own experience, to provide small quantities of data crawling, to open the code as much as possible open.
-
By crawling structures, mainly selenium, the page will be continuously open.
-
Case in Jinan City, to small-scale test, were acquired for a single administrative area, after the code may be familiar with the district changed cycle.
code show as below:
Key package:
1 from selenium import webdriver 2 from urllib import request,parse 3 from selenium.common.exceptions import NoSuchElementException
Defined parameters (the first three lines of high moral API is used to obtain the coordinates obtained, the fourth line is crawling the city, then there will be a general Web links):
1 amap_web_key = '你的key' 2 poi_search_url = "http://restapi.amap.com/v3/place/text" 3 poi_boundary_url = "https://ditu.amap.com/detail/get/detail" 4 city ='jinan'
Key Code:
. 1 headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 29.0.1547.57 Safari / 537.36 ' } 2 driver1 = webdriver.Chrome () . 3 pageid . 1 = . 4 the while (pageid <= ' page number ' ): // page number refers to the number of pages is here displayed after entering a zone . 5 allarray = [] . 6 Print ( ' pageid = ' , pageid) . 7 URL = ' https://jn.lianjia.com/xiaoqu/pingyin/pg ' +str(pageid) 8 driver1.get(url) 9 driver1.implicitly_wait(5) 10 house_list =driver1.find_elements_by_class_name('img') 11 for i in range(house_list.__len__()): 12 time.sleep(2) 13 temparray =[] 14 detailurl = house_list[i].get_attribute('href') 15 print(i,'detailurl',detailurl) 16 driver =webdriver.Chrome () . 17 driver.get (detailurl) 18 is the try : . 19 housename = driver.find_element_by_class_name ( ' detailTitle ' ) .text 20 is . price driver.find_element_by_class_name = ( ' xiaoquUnitPrice ' ) .text 21 is xiaoquinfo = driver.find_elements_by_class_name ( ' xiaoquInfoContent ' ) 22 # [floor area, building area, floor area ratio, green rate, parking spaces, the total number of BAN, the total number of households, property companies, property costs, property description, floor condition] 23 xiaoquage = xiaoquinfo [0] .text # building's 24 = xiaoquinfo jianzhuleixing [1] .text # building types 25 wuyefei = xiaoquinfo [2] .text # property charges 26 dongshu = xiaoquinfo [5] .text # total BAN 27 Hushu = xiaoquinfo [6] .text # The total number of houses 28 TempArray .append (housename) 29 temparray.append (. price) 30 temparray.append (jianzhuleixing) # households 31 temparray.append (wuyefei) # property costs 32 temparray.append (dongshu) # volume ratio 33 temparray.append (Hushu) # green rate 34 is # LOCATION = getpoi_page (TempArray [0]) // Get API call with a high de coordinate function 35 # by a high de query Coordinates 36 // get the ultimate goal cells map to fall on, it is necessary to obtain the coordinates of the point, high moral open API available, but 37 // due to the limited number of queries a key, in order to prevent the middle of error, it is recommended that all housing data after the first set up, 38 // unified coordinate search for beginner, all to easy to achieve the main! 39 temparray.append ( ' 0 ' ) 40 temparray.append ( ' 0 ' ) 41 is # IF (LOCATION == ''): 42 is # temparray.append ( '0') 43 is # temparray.append ( '0') 44 is # the else: 45 # temparray.append (LOCATION [0]) 46 is # temparray.append (LOCATION [. 1]) 47 # BREAK 48 # Print (TempArray) 49 the except NoSuchElementException AS msg: 50 // exception thrown function very, very important, though, such as Taobao, Alibaba and other HTML tags page has a unified format, 51 // but experienced children's shoes should know that there will always be one or two are not routinely play, if an exception is thrown mechanisms write 52 // well, often easily come to naught 53 # Print ( "first", i, "a cell search element failure") 54 the try: 55 housename = driver.find_element_by_class_name ( ' detailTitle ' ) .text 56 is price = driver.find_element_by_css_selector ( " [class = 'Clear xiaoquPrice'] " ) .text 57 is // above comparison price can be seen, because the exception is thrown for the price tag attribute there were two 58 xiaoquinfo = driver.find_elements_by_class_name ( ' xiaoquInfoContent ' ) 59 # [floor area, building area, floor area ratio, green rate, parking spaces, the total number of BAN, the total number of households, property companies, property fees, property description, floor condition] 60 xiaoquage = xiaoquinfo [0] .text # building's 61 jianzhuleixing = xiaoquinfo [1] .text # Building types 62 wuyefei = xiaoquinfo [2] .text # property charges 63 dongshu = xiaoquinfo [5] .text # total BAN 64- Hushu = xiaoquinfo [6] .text # The total number of houses 65 temparray.append (housename) 66 TempArray. the append (. price) 67 temparray.append (jianzhuleixing) # households 68 temparray.append (wuyefei) # property costs 69 temparray.append (dongshu) # volume ratio 70 temparray.append (Hushu) # green rate 71 is temparray.append ( ' 0 ' ) 72 temparray.append ( ' 0 ' ) 73 is the except NoSuchElementException AS MSG: 74 Print ( " in both cases not find " ) 75 allarray.append (TempArray) 76 driver.close () 77 text_save (allarray, ' lianjia_fangwu.txt ' ) 78 pageid +. 1 =