01 reptiles entry

Primeval reptile 思路

Analog browser requests a web page data (acquisition html)

Extracting data (data extracted from the html)

Refining data (format specification, except as spaces)

Service implementation (process data required for the post-refining, such as sorting)

Program entry

for example:

Request the urllib Import from 
Import Re 

# Guizhou Talent recruitment information network gripping 
class Spider (): 
    # where the target feature character string, the content of the group is the target: recruitment 
    root_pattern = 'name = "thiszw" href = "[ ?? \ S \ S] * "target =" [\ S \ S] * "title =" (? [\ S \ S] *) " ' 

    # analog web page request to return data 
    url =" http: // www .gzrc.com.cn / SearchResult.php " 
    DEF __fetch_content (Self): 
        headers = { '- Agent-the User': 'the Mozilla / 5.0 (the Windows NT 6.1; the WOW64; RV: 23.0) the Gecko / Firefox 20,100,101 / 23.0'} 
        PAGE1 request.Request = (Spider.url, headers = headers) 
        HTMLs = request.urlopen (Spider.url) .read () 
        HTMLs = STR (HTMLs, encoding = "GBK") 
        Return HTMLs 

    # data extraction 
    def __analyse (self, page): 
        job_name = re.findall (Spider.root_pattern, page)
        return job_name

    # 业务处理
    def __show(self,job_list):
        for rank in range(0,len(job_list)):
            print('no.'+str(rank+1)+' : '+job_list[rank])

    def go(self):
        page = self.__fetch_content()
        result = self.__analyse(page)
        self.__show(result)

if __name__ == '__main__':

    spider = Spider()
    spider.go()

  

Reptile library (write large reptiles such as distributed, multi-threading)

Beautiful soup

scrap

 

 

 

Guess you like

Origin www.cnblogs.com/scopicat/p/11788298.html