Primeval reptile 思路
Analog browser requests a web page data (acquisition html)
Extracting data (data extracted from the html)
Refining data (format specification, except as spaces)
Service implementation (process data required for the post-refining, such as sorting)
Program entry
for example:
Request the urllib Import from Import Re # Guizhou Talent recruitment information network gripping class Spider (): # where the target feature character string, the content of the group is the target: recruitment root_pattern = 'name = "thiszw" href = "[ ?? \ S \ S] * "target =" [\ S \ S] * "title =" (? [\ S \ S] *) " ' # analog web page request to return data url =" http: // www .gzrc.com.cn / SearchResult.php " DEF __fetch_content (Self): headers = { '- Agent-the User': 'the Mozilla / 5.0 (the Windows NT 6.1; the WOW64; RV: 23.0) the Gecko / Firefox 20,100,101 / 23.0'} PAGE1 request.Request = (Spider.url, headers = headers) HTMLs = request.urlopen (Spider.url) .read () HTMLs = STR (HTMLs, encoding = "GBK") Return HTMLs # data extraction def __analyse (self, page): job_name = re.findall (Spider.root_pattern, page) return job_name # 业务处理 def __show(self,job_list): for rank in range(0,len(job_list)): print('no.'+str(rank+1)+' : '+job_list[rank]) def go(self): page = self.__fetch_content() result = self.__analyse(page) self.__show(result) if __name__ == '__main__': spider = Spider() spider.go()
Reptile library (write large reptiles such as distributed, multi-threading)
Beautiful soup
scrap