Python crawler 2 ------ (crawling housing information to achieve multi-page acquisition)

1. Crawling tasks

        Obtain Nanchang Nanchang second-hand housing listing_Nanchang Nanchang second-hand housing for sale | sale | transaction information (Nanchang Lianjia) (lianjia.com) in the listing information and picture path, and realize the crawling of multiple pages and download to the corresponding Folder, realize multi-threading and speed up the progress bar and display the progress of the crawler, and will get "property title", "property location", "property description", "property concern", "property link", "property Source Price", 'Housing unit price' is stored in the csv file, which makes the data highly visible and easy for data analysis or other operations.

2. Use technology

        In the process of crawling pictures, BeautifulSoup is used to parse and obtain html text, ThreadPoolExecutor is used for multi-thread acceleration, tqdm is used to display the download progress, and csv is used to write csv files.

3. Need to install a third-party library

import csv
import os
import requests
from bs4 import BeautifulSoup
import time
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor

4. Source code

import csv
import os
import requests
from bs4 import BeautifulSoup
import time
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor


# 将房源信息写入csv中
def writeIntoCsv(list, csv_writer):
    for i in tqdm(list):
        title = i.div.a.text
        position = i.find_parent().find('div', class_='positionInfo').text
        desc = i.find_parent().find('div', class_='houseInfo').text
        concern = i.find_parent().find('div', class_='followInfo').text
        houseHref = i.div.a.attrs['href']
        housePrice = i.find_parent().find('div', class_='totalPrice totalPrice2').text
        unitPrice = i.find_parent().find('div', class_='unitPrice').text
        csv_writer.writerow([title, position, desc, concern, houseHref, housePrice, unitPrice])


# 进行主操作
def main(url, headers):
    startTime = time.time()
    response = requests.get(url=url, headers=headers)
    if response.status_code == 200:
        response1 = response.text
        soup = BeautifulSoup(response, 'lxml')
        list = soup.find_all('div', class_='info clear')
        houseDirectory = "csvfiles\\house"
        if not os.path.exists(houseDirectory):
            os.makedirs(houseDirectory)
        houseFilePath = houseDirectory + '\\链家南昌' + '房源信息' + '.csv'
        with open(houseFilePath, 'a', encoding='utf-8', newline="") as f:
            # 2. 基于文件对象构建 csv写入对象
            csv_writer = csv.writer(f)
            # 3. 构建列表头
            csv_writer.writerow(["房源标题", "房源位置", "房源描述", "房源关注", "房源链接", "房源价格", '房源单价'])
            # writeIntoCsv(list, csv_writer)
            endTime = time.time()
            print(f"第{i}页房源信息获取成功,耗时{endTime - startTime}s")


if __name__ == '__main__':
    mainStartTime = time.time()
    url = "https://peapix.com/bing/cn"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.58'}
    with ThreadPoolExecutor(12) as pool:
        for i in tqdm(range(1, 11), desc="页面下载"):
            url = 'https://nc.lianjia.com/ershoufang/pg'
            url = url + str(i) + 'rs南昌/'
            pool.submit(main(url, headers))
    mainEndTime = time.time()
    print(f"所有房源下载成功,共耗时{mainEndTime - mainStartTime}s")

#  耗时情况使用多线程爬取50页共耗时44.71712136268616s  40.10211801528931s 40.3564076423645s
#  耗时情况不使用多线程爬取50页共耗时45.724690437316895s

5. Experiment summary

        Use BeautifulSoup to locate more clearly and simply, obtain the information we need, and write the information to the file. Of course, since the obtained information is all on one page, there is no need to make multiple requests, and one page only needs to be requested once, and The thread pool is used to speed up the picture download speed. Generally speaking, it is still good and worth learning. Welcome friends to communicate and discuss, remember to pay attention and praise! ! !

Guess you like

Origin blog.csdn.net/m0_64238843/article/details/131491902