Crawler technology: Ctrip data reptile sun in politics

Ctrip crawling sunlight Q Post: a simple data storage, the amount of data a total of 145,226, crawling time: 3.65 hours, the time to feel more than this time.

code show as below:

Import Time
 Import GEVENT
 Import Re
 Import lxml
 Import lxml.etree
 Import Requests
 Import the chardet 


DEF get_url_list (URL): 
    RESP = requests.get (URL)
     # Get the encoding format of the page, but not gb2312 codec type byte, only replaced GBK 
    code_sytle = chardet.detect (resp.content) 
    html_str = resp.content.decode ( " GBK " ) 
    E = lxml.etree.HTML (html_str) 
    COUNT_STR = e.xpath ( " // div [@ class = 'the pagination' ] / text () ")[-1]
    # 获取帖子数量
    count = re.findall("(\d+)", count_str)[0]
    # 获取有多少页
    page_count = int(count) // 30
    url_list = list()
    for i in range(0, page_count + 1):
        url_list.append(url.format(i * 30))
    return url_list


def get_per_page_info(url_list, f):
    for url in url_list:
        try:
            resp = requests.get(url=url,timeout=10)  # Prone to long waiting here, so disposed of timeout 
            html_str= resp.content.decode ("GBK", errors ="the ignore")   # where a decoding error occurs readily, and therefore add = erros "the ignore" 
            E=lxml.etree.HTML (html_str)
            tr_list= e.xpath ("// div [@ class = 'greyframe'] // Table [2] // // Table TR")
            info_str=""
            forTRintr_list:
                serial_numtr.xpath = (".// TD [. 1] / text ()") [0]
                info_str+ = serial_num
                info_str += " "
                request_type = tr.xpath(".//td[2]//a[1]/text()")[0]
                info_str += request_type
                info_str += " "
                request_reason = tr.xpath(".//td[2]//a[2]/text()")[0]
                info_str += request_reason
                info_str += " "
                duty_department = tr.xpath(".//td[2]//a[3]/text()" ) [0] 
                info_str + = duty_department 
                info_str + = "  " 
                Status = tr.xpath ( " .// TD [. 3] // span / text () " ) [0] 
                info_str + = Status 
                info_str + = "  "    
                Person tr.xpath = ( " .// TD [. 4] / text () " )    # testing process, where cross-border easily, thus determined is added.
                 IF len (Person) == 0: 
                    Person = " MISS "
                    info_str += person
                else:
                    person = tr.xpath(".//td[4]/text()")[0]
                    info_str += person
                info_str += " "
                time = tr.xpath(".//td[5]/text()")[0]
                info_str += time
                info_str += "\r\n "
            print(info_str)
            f.write(info_str)

        exceptAS E Exception:
             Print (E) 

IF  the __name__ == ' __main__ ' : 
    T = the time.time () 
    URL = " http://wz.sun0769.com/index.php/question/report?page= {} " 
    URL_LIST = get_url_list (URL) 
    F = Open ( " sun_info.txt " , " W " , encoding = " UTF-. 8 " )
     # task cut, so ctrip 10 to perform the task, each portion of the data transfer url_list of Cheng. 
    = xclist [[], [], [], [], [], [], [], [], [], []] 
    N = len (xclist)
    task_list = list()
    for i in range(len(url_list)):
        xclist[i % N].append(url_list[i])
    for i in range(N):
        task_list.append(gevent.spawn(get_per_page_info, xclist[i], f))

    gevent.joinall(task_list)
    f.close()
    print(time.time() - t)


# 13162.275838851929 执行时间

Review coroutine knowledge:

Multithreading will grab snatch public resources, thus resulting in insecurity of public resources need to be resolved by a thread lock, then why do not more Ctrip existence of this situation?

Because multiple coroutine also be run in a thread inside, but in the face of blockage IO operation time, it will automatically switch to the other co-routines, make them work, in this case, the file is written to an IO operation, when a co when the blocking process, immediately switching to another file writing request and coroutine filter data, the right time, switching ( at this time should be used in a context, where the written recording data is blocked, switch back to continue here writing, where rather vague concept # TODO continuing research ) to just read and write operations coroutine, data write.

 

Guess you like

Origin www.cnblogs.com/meloncodezhang/p/11443580.html