Example of Renrendai Loose Standard Crawler

Crawling example of Renrendai bulk bid

1. Goal

Grab the bulk record of Renrendai official website (https://www.renrendai.com/loan.html), and extract the borrower's information from it. The results are as follows,operation result

2. Preparations

The python library used is as follows (python3):

#常规爬虫库
import requests
from bs4 import BeautifulSoup
import re
import json
import csv
#selenium登录
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
#多进程
from multiprocessing import Process, Queue

import time

3. Crawling implementation

3.1main()

Enter a loose bid order in the loose bid list through the official website, its url is https://www.renrendai.com/loan-6782341.html, where 6782341 is the number of the loose bid order, and the value changed is the order number How much is the scatter record.
Use multi-process to grab 10W data at a time, and use list generation to construct url.

3.1.1 Construct url

start_id is the starting order id. If you use multiple processes, you can create multiple url_lists as needed
or use a generator to construct urls to reduce memory usage

#
start_id = 1 #start_id为起始订单id
init_url = "https://www.renrendai.com/loan-{}.html"
url_list1 = [init_url.format(i + start_id + 00000) for i in range(25000)]

3.1.2 Start the read/write process

When the program is executed, it will output everything is ok, please terminate to prompt that the program is executed.
PS: If it is multiple child processes, you can use the process pool Pool to write it, so you don’t need to write it like this

    #2.父子进程就绪
    #2.1父进程创建Queue,并传给各个子进程:
    q = Queue()
    pw1 = Process(target=getHtmlText, args=(q, url_list1))

    pr = Process(target=parseAndSave, args=(q,))
    #2.2启动子进程pw*,pd,
    pw1.start()
    pr.start()
    #2.3等待pw结束即全部读取进程工作完毕,才强制中止pr进程
    pw1.join()
	print("******************everything is ok,please terminate ******************")

3.2 get_new_cookie()

In the case of testing many times, the session cookie will still expire in about 20 minutes. In the end, I can only log in to the account by using selenium, and update the cookie after login to the session.
PS: The browser driver process needs to be shut down with the quit() function. Using close() will cause memory leaks. The
key codes are implemented as follows:

def get_new_cookie(session):
		driver = webdriver.Chrome()
        cookies = driver.get_cookies()
        c = requests.cookies.RequestsCookieJar()
        for item in cookies:
            c.set(item["name"], item["value"])
        session.cookies.update(c)  # 登陆后刷新cookies
        driver.quit()

3.3 parseAndSave()

Parse the page and extract some information

    while True:
        html_text_list = q.get(True)
        for index,html_text in enumerate(html_text_list):
            try:
            	#根据网页url进行常规解析/Beautiful的常规操作
                bs = BeautifulSoup(html_text, "html.parser")
                info = str(bs.find("script", {
    
    "src": "/ps/static/common/page/layout_c0258d7.js"}).next_sibling.string).replace("\n","")
                #根据正则表达式提取信息所在片段,并进行手动转码处理
                infoProcess = pattern.findall(info)[0].encode('utf-8').decode("utf-8").replace('\\u0022', '"').replace("\\u002D","-").replace("'","").replace("\\u005C","\\").replace(";","") #+ '"}}'
                info_dict = json.loads(infoProcess)
 
                #解析失败则跳过
                if "gender" not in info_dict["borrower"]:
                    print("gender not in borrower'key,index:",index)
                    continue
                    
                with open("all.csv","a") as csvfile:
                    writer = csv.writer((csvfile))
                    #具体写入数据可根据json进行取舍
                    writer.writerow(info_dict["loan"]["loanId"])
                print("id:{} has done".format(info_dict["loan"]["loanId"]))
                
            except Exception as e:
                print("Exception in parser:",info_dict["loan"]["loanId"])
                continue

3.4 getHtmlText()

The function of getHtmlText() is the process of reading data, that is, to request according to the url provided by the main function, and periodically re-login to update cookies. This function uses the multiprocessing Queue to pass the returned page to the data parsing function parseAndSave().
The main code snippet is as follows:

    for index,url in enumerate(url_list):#len(url_list):
        try:
            res = session.get(url,timeout=10,headers=my_header)
            res.raise_for_status()
            res.encoding = res.apparent_encoding
            htmlTextList.append(res.text)
            print("request:"+str(index))
            if (index+1)%250 == 0:
                print(res.text)
                get_new_cookie(session)
            #网页文本列表满十个就向解析进程发送数据
            if (index+1)%10 == 0:
                q.put(htmlTextList)
                htmlTextList = []
        except Exception as e:
            print("Exception in request:",index)

4. Finally

1. The article only gives some key codes/the complete code may be put on github later.
2. The requests used in the article are blocking access, and the asynchronous aiohttp library can be used to crawl 1200 scattered records in one minute.

Guess you like

Origin blog.csdn.net/zsllsz2022/article/details/104327748