[Python crawler] Parallel crawling of 400,000 pieces of house price data on the whole site, and the crawling city can be replaced


Click here to jump to receive

Some relevant information files and code files can be found directly at the end of the article~ Remember to like and support~

insert image description here

foreword

The crawler this time is about capturing housing price information, and its purpose is to practice data processing and whole-stop crawling of more than 100,000.

The most intuitive feeling of the increase in the amount of data is the increase in the requirements for function logic. According to the characteristics of Python, the data structure is carefully selected. In the past, when capturing small amounts of data, even if the logic part of the function is repeated, the frequency of I/O requests is intensive, and the loop nesting is too deep, the difference is only 1~2s. With the increase of the data scale, the difference of 1~2s It is possible to expand to h.

Therefore, for websites that need to capture a large amount of data, we can reduce the time cost of capturing information from two aspects.

1) Optimize function logic, select appropriate data structure, and conform to Pythonic programming habits. For example, for string merging, use join() to save memory space than "+".

2) Based on I/O-intensive and CPU-intensive, multi-thread and multi-process parallel execution methods are selected to improve execution efficiency. # 1. What is pandas?
Example: pandas is a NumPy-based tool created to solve data analysis tasks.

1. Get the index

Wrap the request and set the timeout timeout

# 获取列表页面
def get_page(url):
    headers = {
    
    
        'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
        'Referer': r'http://bj.fangjia.com/ershoufang/',
        'Host': r'bj.fangjia.com',
        'Connection': 'keep-alive'
    }
    timeout = 60
    socket.setdefaulttimeout(timeout)  # 设置超时
    req = request.Request(url, headers=headers)
    response = request.urlopen(req).read()
    page = response.decode('utf-8')
    return page

Level 1 Location: Area Information

insert image description here

Secondary position: plate information

insert image description here
Stored in dict, you can quickly query the target you are looking for. -> {'Chaoyang': {'Gongti', 'Anzhen', 'Jianxiang Bridge'...}}

Level 3 location: subway information (search for housing information around the subway)

insert image description here
Add the location subway information to the dict. -> {'Chaoyang': {'Gongti': {'Line 5', 'Line 10', 'Line 13'}, 'Anzhen', 'Jianxiang Bridge'...}}

Corresponding url: http://bj.fangjia.com/ershoufang/–r-%E6%9C%9D%E9%98%B3%7Cw-5%E5%8F%B7%E7%BA%BF%7Cb- %E6%83%A0%E6%96%B0%E8%A5%BF%E8%A1%97

Decoded url: http://bj.fangjia.com/ershoufang/–r-Chaoyang|w-Line 5|b-Huixin West Street

According to the parameter mode of the url, there are two ways to obtain the destination url:

  1. Obtain the destination url according to the index path

insert image description here

# 获取房源信息列表(嵌套字典遍历)
def get_info_list(search_dict, layer, tmp_list, search_list):
    layer += 1  # 设置字典层级
    for i in range(len(search_dict)):
        tmp_key = list(search_dict.keys())[i]  # 提取当前字典层级key
        tmp_list.append(tmp_key)   # 将当前key值作为索引添加至tmp_list
        tmp_value = search_dict[tmp_key]
        if isinstance(tmp_value, str):   # 当键值为url时
            tmp_list.append(tmp_value)   # 将url添加至tmp_list
            search_list.append(copy.deepcopy(tmp_list))   # 将tmp_list索引url添加至search_list
            tmp_list = tmp_list[:layer]  # 根据层级保留索引
        elif tmp_value == '':   # 键值为空时跳过
            layer -= 2           # 跳出键值层级
            tmp_list = tmp_list[:layer]   # 根据层级保留索引
        else:
            get_info_list(tmp_value, layer, tmp_list, search_list)  # 当键值为列表时,迭代遍历
            tmp_list = tmp_list[:layer]
    return search_list
  1. Wrap url according to dict information

{'Chaoyang': {'Gongti': {'Line 5'}}}

parameter:

—— r-Chaoyang

—— b-working body

—— Line w-5

Assembly parameters: http://bj.fangjia.com/ershoufang/–r-Chaoyang|w-line 5|b-gongti

# 根据参数创建组合url
def get_compose_url(compose_tmp_url, tag_args,  key_args):
    compose_tmp_url_list = [compose_tmp_url, '|' if tag_args != 'r-' else '', tag_args, parse.quote(key_args), ]
    compose_url = ''.join(compose_tmp_url_list)
    return compose_url

2. Get the maximum number of pages in the index page

# 获取当前索引页面页数的url列表
def get_info_pn_list(search_list):
    fin_search_list = []
    for i in range(len(search_list)):
        print('>>>正在抓取%s' % search_list[i][:3])
        search_url = search_list[i][3]
        try:
            page = get_page(search_url)
        except:
            print('获取页面超时')
            continue
        soup = BS(page, 'lxml')
        # 获取最大页数
        pn_num = soup.select('span[class="mr5"]')[0].get_text()
        rule = re.compile(r'\d+')
        max_pn = int(rule.findall(pn_num)[1])
        # 组装url
        for pn in range(1, max_pn+1):
            print('************************正在抓取%s页************************' % pn)
            pn_rule = re.compile('[|]')
            fin_url = pn_rule.sub(r'|e-%s|' % pn, search_url, 1)
            tmp_url_list = copy.deepcopy(search_list[i][:3])
            tmp_url_list.append(fin_url)
            fin_search_list.append(tmp_url_list)
    return fin_search_list

3. Grab the listing information Tag

This is the Tag we want to grab:

['area', 'plate', 'subway', 'title', 'location', 'square meter', 'house type', 'floor', 'total price', 'unit square meter price']
insert image description here

# 获取tag信息
def get_info(fin_search_list, process_i):
    print('进程%s开始' % process_i)
    fin_info_list = []
    for i in range(len(fin_search_list)):
        url = fin_search_list[i][3]
        try:
            page = get_page(url)
        except:
            print('获取tag超时')
            continue
        soup = BS(page, 'lxml')
        title_list = soup.select('a[class="h_name"]')
        address_list = soup.select('span[class="address]')
        attr_list = soup.select('span[class="attribute"]')
        price_list = soup.find_all(attrs={
    
    "class": "xq_aprice xq_esf_width"})  # select对于某些属性值(属性值中间包含空格)无法识别,可以用find_all(attrs={})代替
        for num in range(20):
            tag_tmp_list = []
            try:
                title = title_list[num].attrs["title"]
                print(r'************************正在获取%s************************' % title)
                address = re.sub('\n', '', address_list[num].get_text())
                area = re.search('\d+[\u4E00-\u9FA5]{2}', attr_list[num].get_text()).group(0)
                layout = re.search('\d[^0-9]\d.', attr_list[num].get_text()).group(0)
                floor = re.search('\d/\d', attr_list[num].get_text()).group(0)
                price = re.search('\d+[\u4E00-\u9FA5]', price_list[num].get_text()).group(0)
                unit_price = re.search('\d+[\u4E00-\u9FA5]/.', price_list[num].get_text()).group(0)
                tag_tmp_list = copy.deepcopy(fin_search_list[i][:3])
                for tag in [title, address, area, layout, floor, price, unit_price]:
                    tag_tmp_list.append(tag)
                fin_info_list.append(tag_tmp_list)
            except:
                print('【抓取失败】')
                continue
    print('进程%s结束' % process_i)
    return fin_info_list

4. Assign tasks and fetch in parallel

Shard the task list, set up a process pool, and fetch in parallel.

# 分配任务
def assignment_search_list(fin_search_list, project_num):  # project_num每个进程包含的任务数,数值越小,进程数越多
    assignment_list = []
    fin_search_list_len = len(fin_search_list)
    for i in range(0, fin_search_list_len, project_num):
        start = i
        end = i+project_num
        assignment_list.append(fin_search_list[start: end])  # 获取列表碎片
    return assignment_list

By setting the process pool to crawl in parallel, the time is shortened to 3/1 of the single-process crawl time, and the total time is 3h.

The computer has 4 cores. After testing, when the number of tasks is 3, the running efficiency of the current computer is the highest.

5. Store the crawling results in excel and wait for the visual data processing

# 存储抓取结果
def save_excel(fin_info_list, file_name):
    tag_name = ['区域', '板块', '地铁', '标题', '位置', '平米', '户型', '楼层', '总价', '单位平米价格']
    book = xlsxwriter.Workbook(r'C:\Users\Administrator\Desktop\%s.xls' % file_name)  # 默认存储在桌面上
    tmp = book.add_worksheet()
    row_num = len(fin_info_list)
    for i in range(1, row_num):
        if i == 1:
            tag_pos = 'A%s' % i
            tmp.write_row(tag_pos, tag_name)
        else:
            con_pos = 'A%s' % i
            content = fin_info_list[i-1]  # -1是因为被表格的表头所占
            tmp.write_row(con_pos, content)
    book.close()

insert image description here

Summarize:

When the scale of captured data is larger, the requirements for program logic are more rigorous, and the requirements for python syntax are more proficient. How to write a more pythonic grammar also requires continuous learning and mastering.
Please add a picture description

↓ ↓ ↓ Add the business card below to find me, directly get the source code and cases ↓ ↓ ↓

Guess you like

Origin blog.csdn.net/weixin_45841831/article/details/131157892