See how I grab the latest house price data

After a round of housing price hikes in the past few years, now due to the country's vigorous regulation and other reasons, the market has faded away and is slowly becoming stable, and the price has dropped compared to the peak. So what is the price now? What will be the next development trend? Here we can use Python to capture the recent house price data for analysis.

Module installation

The following modules need to be installed here. Of course, if they are already installed, they do not need to be installed again:

# 安装引用模块pip3 install bs4pip3 install requestspip3 install lxmlpip3 install numpypip3 install pandas

Configure request headers

Generally, when we crawl a website, in order to deal with the anti-crawling mechanism of the website, we will encapsulate the header information of the request. The following is the simplest processing, which is to randomly select and use the requested client information. The code is as follows:

# 代理客户端列表USER_AGENTS = [    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",]
# 创建请求头信息def create_headers():    headers = dict()    headers["User-Agent"] = random.choice(USER_AGENTS)    headers["Referer"] = "http://www.ke.com"    return headers

Configure proxy IP

In addition to the request headers configured above, if you use the same IP to make a large number of requests for crawling, it is likely that the IP will be blocked. When you use this IP to request a website after being blocked, you will be prompted that the request timed out. To avoid being blocked, it is best for us to To crawl through the proxy IP, how can I find the proxy IP that can be used?

# 引入模块from bs4 import BeautifulSoupimport requestsfrom lib.request.headers import create_headers
# 定义变量proxys_src = []proxys = []
# 请求获取代理地址def spider_proxyip(num=10):    try:        url = 'http://www.xicidaili.com/nt/1'        # 获取代理 IP 列表        req = requests.get(url, headers=create_headers())        source_code = req.content        # 解析返回的 html        soup = BeautifulSoup(source_code, 'lxml')        # 获取列表行        ips = soup.findAll('tr')
        # 循环遍历列表        for x in range(1, len(ips)):            ip = ips[x]            tds = ip.findAll("td")            proxy_host = "{0}://".format(tds[5].contents[0]) + tds[1].contents[0] + ":" + tds[2].contents[0]            proxy_temp = {tds[5].contents[0]: proxy_host}            # 添加到代理池            proxys_src.append(proxy_temp)            if x >= num:                break    except Exception as e:        print("获取代理地址异常:")        print(e)

house price data object

Here we create the price information of the new house as an object, and then we only need to save the acquired data as an object, and then processing it will be much more convenient. NewHouse The object code looks like this:

# 新房对象class NewHouse(object):    def __init__(self, xiaoqu, price, total):        self.xiaoqu = xiaoqu        self.price = price        self.total = total
    def text(self):        return self.xiaoqu + "," + \                self.price + "," + \                self.total

Get price information and save

Okay, make the above preparations, let's take Shell as an example to crawl its new house data in Beijing in batches and save it locally. In fact, as long as the data can be captured, it can be saved in any format, and of course it can also be saved to the database. What I mainly want to talk about here is how to capture data, so save it in the simplest  txt text format.

# 创建文件准备写入with open("newhouse.txt", "w", encoding='utf-8') as f:    # 获得需要的新房数据    total_page = 1    loupan_list = list()    page = 'http://bj.fang.ke.com/loupan/'    # 调用请求头    headers = create_headers()    # 请求 url 并返回结果    response = requests.get(page, timeout=10, headers=headers)    html = response.content    # 解析返回 html    soup = BeautifulSoup(html, "lxml")
    # 获取总页数    try:        page_box = soup.find_all('div', class_='page-box')[0]        matches = re.search('.*data-total-count="(\d+)".*', str(page_box))        total_page = int(math.ceil(int(matches.group(1)) / 10))    except Exception as e:        print(e)
    print('总页数:' + total_page)    # 配置请求头    headers = create_headers()    # 从第一页开始遍历    for i in range(1, total_page + 1):        page = 'http://bj.fang.ke.com/loupan/pg{0}'.format(i)        print(page)        response = requests.get(page, timeout=10, headers=headers)        html = response.content        # 解释返回结果        soup = BeautifulSoup(html, "lxml")
        # 获得小区信息        house_elements = soup.find_all('li', class_="resblock-list")        # 循环遍历获取想要的元素        for house_elem in house_elements:            price = house_elem.find('span', class_="number")            desc = house_elem.find('span', class_="desc")            total = house_elem.find('div', class_="second")            loupan = house_elem.find('a', class_='name')
            # 开始清理数据            try:                price = price.text.strip() + desc.text.strip()            except Exception as e:                price = '0'
            loupan = loupan.text.replace("\n", "")            # 继续清理数据            try:                total = total.text.strip().replace(u'总价', '')                total = total.replace(u'/套起', '')            except Exception as e:                total = '0'
            # 作为对象保存到变量            loupan = NewHouse(loupan, price, total)            print(loupan.text())            # 将新房信息加入列表            loupan_list.append(loupan)
    # 循环获取的数据并写入到文件中    for loupan in loupan_list:        f.write(loupan.text() + "\n")

The code is written, and now we can  python newhouse.py run the code through the command for data capture. The captured results are shown in the following figure:

Summarize

This article introduces how to use Python to capture the new house data on the real estate network in batches, and then you can compare and analyze the results captured every day with the historical data to judge the general trend of the real estate market. It involves  parsing with BeautifulSoup the  html whole code. It is not difficult to see how the whole code is implemented. I hope this process can provide you with some help.

 Friends who are learning programming and Python, it is difficult to learn by one person, and bloggers are also here. Here, a new deduction group has been created: 1020465983, which has prepared learning resources and fun projects for everyone. Welcome everyone to join and communicate.

Guess you like

Origin blog.csdn.net/weixin_56659172/article/details/124093945