Crawler - Crawling Guiyang housing prices (Python implementation)

content

1 Preface

1.1 Philosophy brought about by the pressure of survival

1.2 Buying a house & house slaves

2 reptiles 

2.1 Basic Concepts

2.2 The basic process of crawler 

3 Crawling Guiyang house prices and writing to the table

3.1 Display of results

3.2 Code Implementation (Python) 

 


1 Preface

1.1 Philosophy brought about by the pressure of survival

Malthus first discovered that the innate ability of creatures to multiply according to geometric progression is always greater than their actual survivability or actual surviving population . In turn, it is speculated that the intraspecific competition of creatures must be extremely cruel and inevitable. Leaving aside whether it is necessary for Malthus to issue a corresponding warning to mankind, it is only a series of basic questions implicit in this phenomenon, such as, what is the natural limit of the ability of organisms to overproduce? What advantages do survivors of intraspecific competition rely on to win? And where do these so-called advantaged groups lead themselves? And so on, it is enough to cause any thinking person to ponder (fear) deeply.

Later, in the introduction to his epoch-making book "The Origin of Species" , Darwin deliberately mentioned the scientific contribution and enlightenment of Malthus's theory. It can be seen that to be the bosom friend of the old priest is not qualified for ordinary people!

1.2 Buying a house & house slaves

When getting married now, the woman generally requires the man to have a house and a car. In fact, you can't blame the girl. In today's highly developed and turbulent society, this requirement is really not high. However, since the reform and opening up, the class has been solidified, and our generation is in trouble! Let's take a look at Guiyang housing prices (Lianjia new house: https://gy.fang.lianjia.com/ )

   

Can't be eliminated by the times, can't always sigh, there are very few big capitalists who started from scratch, and Liu Qiangdong is one of them. Idols are idols, come back to reality, rural children may buy a house, or they may be house slaves for a lifetime. When they return to the countryside, they are admired by others with a bright and beautiful appearance, and only they know the pain and grievances in their hearts. In view of this, I personally don’t want to be a house slave and car slave. My happiness is my own, my life is my own, and I live my own wonderful life, not for others to see. I want to make my own destiny beautiful and colorful. What I do is to improve my ability, I don't want to be a house slave!

The heart is full of blood, and the sigh is over. It is time to return to today's theme. Why not put these data into a document table for analysis and analysis, just do what you say, just use a crawler to crawl, and then write it into the document.

2 reptiles 

2.1 Basic Concepts

Web crawler (Crawler): also known as web spiders, or web robots (Robots). It is a program or script that automatically grabs information from the World Wide Web according to certain rules. In other words, it can automatically obtain web page content based on the link address of the web page. If the Internet is compared to a big spider web, there are many web pages in it, and web spiders can obtain the content of all web pages.
A crawler is a program or automated script that simulates the behavior of humans requesting a website and downloads website resources in batches.

Crawler : A way to obtain website information in batches using any technical means. The key is batch size.
Anti-crawler : A way to prevent others from obtaining information on your own website in batches using any technical means. The key is also batch size.
Accidental injury : In the process of anti-crawling, ordinary users are mistakenly identified as crawlers. The anti-reptile strategy with a high accidental injury rate cannot be used no matter how good the effect is.
Block : Successfully blocked crawler access. There will be the concept of interception rate here. Generally speaking, the higher the interception rate of the anti-reptile strategy, the higher the possibility of accidental injury. So a tradeoff needs to be made.
Resource : The sum of machine cost and labor cost.

2.2 The basic process of crawler 

(1) Request a web page:
initiate a request to the target site through the HTTP library, that is, send a Request, the request can include additional headers and other
information, and wait for the server to respond!
(2) Get the corresponding content:
If the server can respond normally, it will get a Response , the content of Response is the content of the page to be obtained, the type may be HTML, Json string, binary data (such as pictures and videos) and other types.
(3) Parsing content:
The obtained content may be HTML, which can be parsed with regular expressions and web page parsing libraries. It may be Json, which can be
directly converted to Json object parsing, or binary data, which can be saved or further processed.
(4) Store and analyze data:
save in various forms, which can be saved as text, or saved to a database, or saved in a specific format.
Test case:
Code implementation: Crawling the page data of Guiyang house prices

#==========导 包=============
import requests

#=====step_1 : 指 定 url=========
url = 'https://gy.fang.lianjia.com/ /'

#=====step_2 : 发 起 请 求 :======
#使 用 get 方 法 发 起 get 请 求 , 该 方 法 会 返 回 一 个 响 应 对 象 。 参 数 url 表 示 请 求 对 应 的 url
response = requests . get ( url = url )

#=====step_3 : 获 取 响 应 数 据 :===
#通 过 调 用 响 应 对 象 的 text 属 性 , 返 回 响 应 对 象 中 存 储 的 字 符 串 形 式 的 响 应 数 据 ( 页 面 源 码数 据 )
page_text = response . text

#====step_4 : 持 久 化 存 储=======
with open ('贵阳房价 . html ','w', encoding ='utf -8') as fp:
    fp.write ( page_text )
print (' 爬 取 数 据 完 毕 !!!')

                                                         

爬 取 数 据 完 毕 !!!

Process finished with exit code 0

3 Crawling Guiyang house prices and writing to the table

3.1 Display of results

 

3.2 Code Implementation (Python) 

#==================导入相关库==================================
from bs4 import BeautifulSoup
import numpy as np
import requests
from requests.exceptions import  RequestException
import pandas as pd


#=============读取网页=========================================
def craw(url,page):
    try:

        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36"}
        html1 = requests.request("GET", url, headers=headers,timeout=10)
        html1.encoding ='utf-8' # 加编码,重要!转换为字符串编码,read()得到的是byte格式的
        html=html1.text

        return html
    except RequestException:#其他问题
        print('第{0}读取网页失败'.format(page))
        return None
#==========解析网页并保存数据到表格======================
def pase_page(url,page):
    html=craw(url,page)
    html = str(html)
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        "--先确定房子信息,即li标签列表--"
        houses=soup.select('.resblock-list-wrapper li')#房子列表
        "--再确定每个房子的信息--"
        for j in range(len(houses)):#遍历每一个房子
            house=houses[j]
            "名字"
            recommend_project=house.select('.resblock-name a.name')
            recommend_project=[i.get_text()for i in recommend_project]#名字 英华天元,斌鑫江南御府...
            recommend_project=' '.join(recommend_project)
            #print(recommend_project)
            "类型"
            house_type=house.select('.resblock-name span.resblock-type')
            house_type=[i.get_text()for i in house_type]#写字楼,底商...
            house_type=' '.join(house_type)
            #print(house_type)
            "销售状态"
            sale_status = house.select('.resblock-name span.sale-status')
            sale_status=[i.get_text()for i in sale_status]#在售,在售,售罄,在售...
            sale_status=' '.join(sale_status)
            #print(sale_status)
            "大地址"
            big_address=house.select('.resblock-location span')
            big_address=[i.get_text()for i in big_address]#
            big_address=''.join(big_address)
            #print(big_address)
            "具体地址"
            small_address=house.select('.resblock-location a')
            small_address=[i.get_text()for i in small_address]#
            small_address=' '.join(small_address)
            #print(small_address)
            "优势。"
            advantage=house.select('.resblock-tag span')
            advantage=[i.get_text()for i in advantage]#
            advantage=' '.join(advantage)
            #print(advantage)
            "均价:多少1平"
            average_price=house.select('.resblock-price .main-price .number')
            average_price=[i.get_text()for i in average_price]#16000,25000,价格待定..
            average_price=' '.join(average_price)
            #print(average_price)
            "总价,单位万"
            total_price=house.select('.resblock-price .second')
            total_price=[i.get_text()for i in total_price]#总价400万/套,总价100万/套'...
            total_price=' '.join(total_price)
            #print(total_price)

            #=====================写入表格=================================================
            information = [recommend_project, house_type, sale_status,big_address,small_address,advantage,average_price,total_price]
            information = np.array(information)
            information = information.reshape(-1, 8)
            information = pd.DataFrame(information, columns=['名称', '类型', '销售状态','大地址','具体地址','优势','均价','总价'])

            information.to_csv('贵阳房价.csv', mode='a+', index=False, header=False)  # mode='a+'追加写入
        print('第{0}页存储数据成功'.format(page))
    else:
        print('解析失败')


#==================双线程=====================================
import threading
for i  in range(1,100,2):#遍历网页1-101
    url1="https://gy.fang.lianjia.com/loupan/pg"+str(i)+"/"
    url2 = "https://gy.fang.lianjia.com/loupan/pg" + str(i+1) + "/"

    t1 = threading.Thread(target=pase_page, args=(url1,i))#线程1
    t2 = threading.Thread(target=pase_page, args=(url2,i+1))#线程2
    t1.start()
    t2.start()

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326943930&siteId=291194637