Data and analysis Homelink second-hand housing in Guangzhou - crawling data



Before blog shared basis Methods R and rvest package crawlers. Now to combat it: crawling chain of home network Guangzhou 40,000+ sets of second-hand housing data.
lianjia homepage

Before Web Scraping with R reptiles have said in this method is not in the repeat. Here to share how to crawl the website page of data.


>> Web Scraping across Multiple Pages

First, observe the laws url flip the page, such as Guangzhou Homelink second-hand housing data:

The first page: https://gz.lianjia.com/ershoufang/

The second page: https://gz.lianjia.com/ershoufang/pg2/

The third page: https://gz.lianjia.com/ershoufang/pg3/

......

Which can be inferred, url is " https://gz.lianjia.com/ershoufang/pg " Page +

1) Suppose we need to climb to the housing price Page 1 on page 100. Then we can try to crawl data of the first page, and packaged into a function.

getHouseInfo <- function(pageNum, urlWithoutPageNum) {
  url <- paste0(urlWithoutPageNum, pageNum)
  webpage <- read_html(url,encoding="UTF-8")
  total_price_data_html <- html_nodes(webpage,'.totalPrice span')
  total_price_data <- html_text(total_price_data_html)
  data.frame(totalprice = total_price_data)
}

2) Then using the above-described function loops crawling data page 1 to page 100, and a plurality of pages of data into a merged data frame

url <- "https://gz.lianjia.com/ershoufang/pg"
houseInfo <- data.frame()
for (ii in 1:1553){
  houseInfo <- rbind(houseInfo, getHouseInfo(ii, url))
}


>> Sample Code

Crawling know how to flip the data we can try to complete 4w + Detailed Guangzhou Homelink online second-hand housing (including regional, district, several rooms of several halls, with or without lift, etc.) a crawl.

download here

Large amount of data, the data will take some time crawling. If you want to save the crawling complete data need to pay attention to select the appropriate coding, or likely to be garbled. Provide an open format in Mac Excel cvs.

data

Guess you like

Origin www.cnblogs.com/yukiwu/p/10975337.html