Ctrip crawling sunlight Q Post: a simple data storage, the amount of data a total of 145,226, crawling time: 3.65 hours, the time to feel more than this time.
code show as below:
Import Time Import GEVENT Import Re Import lxml Import lxml.etree Import Requests Import the chardet DEF get_url_list (URL): RESP = requests.get (URL) # Get the encoding format of the page, but not gb2312 codec type byte, only replaced GBK code_sytle = chardet.detect (resp.content) html_str = resp.content.decode ( " GBK " ) E = lxml.etree.HTML (html_str) COUNT_STR = e.xpath ( " // div [@ class = 'the pagination' ] / text () ")[-1] # 获取帖子数量 count = re.findall("(\d+)", count_str)[0] # 获取有多少页 page_count = int(count) // 30 url_list = list() for i in range(0, page_count + 1): url_list.append(url.format(i * 30)) return url_list def get_per_page_info(url_list, f): for url in url_list: try: resp = requests.get(url=url,timeout=10) # Prone to long waiting here, so disposed of timeout html_str= resp.content.decode ("GBK", errors ="the ignore") # where a decoding error occurs readily, and therefore add = erros "the ignore" E=lxml.etree.HTML (html_str) tr_list= e.xpath ("// div [@ class = 'greyframe'] // Table [2] // // Table TR") info_str="" forTRintr_list: serial_numtr.xpath = (".// TD [. 1] / text ()") [0] info_str+ = serial_num info_str += " " request_type = tr.xpath(".//td[2]//a[1]/text()")[0] info_str += request_type info_str += " " request_reason = tr.xpath(".//td[2]//a[2]/text()")[0] info_str += request_reason info_str += " " duty_department = tr.xpath(".//td[2]//a[3]/text()" ) [0] info_str + = duty_department info_str + = " " Status = tr.xpath ( " .// TD [. 3] // span / text () " ) [0] info_str + = Status info_str + = " " Person tr.xpath = ( " .// TD [. 4] / text () " ) # testing process, where cross-border easily, thus determined is added. IF len (Person) == 0: Person = " MISS " info_str += person else: person = tr.xpath(".//td[4]/text()")[0] info_str += person info_str += " " time = tr.xpath(".//td[5]/text()")[0] info_str += time info_str += "\r\n " print(info_str) f.write(info_str) exceptAS E Exception: Print (E) IF the __name__ == ' __main__ ' : T = the time.time () URL = " http://wz.sun0769.com/index.php/question/report?page= {} " URL_LIST = get_url_list (URL) F = Open ( " sun_info.txt " , " W " , encoding = " UTF-. 8 " ) # task cut, so ctrip 10 to perform the task, each portion of the data transfer url_list of Cheng. = xclist [[], [], [], [], [], [], [], [], [], []] N = len (xclist) task_list = list() for i in range(len(url_list)): xclist[i % N].append(url_list[i]) for i in range(N): task_list.append(gevent.spawn(get_per_page_info, xclist[i], f)) gevent.joinall(task_list) f.close() print(time.time() - t) # 13162.275838851929 执行时间
Review coroutine knowledge:
Multithreading will grab snatch public resources, thus resulting in insecurity of public resources need to be resolved by a thread lock, then why do not more Ctrip existence of this situation?
Because multiple coroutine also be run in a thread inside, but in the face of blockage IO operation time, it will automatically switch to the other co-routines, make them work, in this case, the file is written to an IO operation, when a co when the blocking process, immediately switching to another file writing request and coroutine filter data, the right time, switching ( at this time should be used in a context, where the written recording data is blocked, switch back to continue here writing, where rather vague concept # TODO continuing research ) to just read and write operations coroutine, data write.