Python crawled more pages of information in URLs

  Previous blog learned by crawling operational data, but for URLs with multiple pages, the use of the information on the blog post code crawling down is not complete. The next step is to explain how the page information after crawling.

First, review the elements

  Right Move at page, select Inspect Element

  

 

  Then the bottom of the screen will appear corresponding statement html

  

 

 

 

 Second, analysis and project requirements statement html

  The project is crawling all the information, according to the first step in the html statement, we have two ways of crawling subsequent pages of information:

  Method One: iterate this page in the "next" link until the tag is empty

  

 

 

   which is

def next_page(url):
    soup=get_requests(url)
    draw_base_list(soup)
    pcxt=soup.find('div',{'class':'babynames-term-articles'}).find('nav')
    pcxt1=pcxt.find('div',{'class':'nav-links'}).find('a',{'class':'next page-numbers'})
    if pcxt1!=None:
        link=pcxt1.get('href')
        next_page(link)
    else:
        print("爬取完成")

 

  方法二:获取总页数,通过更改url来爬取后续信息

 

  通过html语句可以看出不同页数的url只有最后的数字不一样,而最后的数字就代表着这个url中的信息是第几页的信息。

 

   页面中的html语句给出了总页码,我们只需要定位至该标签并获得总页数即可。

  即

def get_page_size(soup):
    pcxt=soup.find('div',{'class':'babynames-term-articles'}).find('nav')
    pcxt1=pcxt.find('div',{'class':'nav-links'}).findAll('a')
    for i in pcxt1[:-1]:
        link=i.get('href')
        s=str(i)
    page=re.sub('<a class="page-numbers" href="','',s)
    page1=re.sub(link,'',page)
    page2=re.sub('">','',page1)
    page3=re.sub('</a>','',page2)
    pagesize=int(page3)
    print(pagesize)
    return pagesize
    pass

  获得总页数后这个模块还没有结束,我们还需要更改url来访问网址,也就是主函数的编写:

if __name__ == '__main__':
        url="http://www.sheknows.com/baby-names/browse/a/"
        soup=get_requests(url)
        page=get_page_size(soup)
        for i in range(1,page+1):
            url1=url+"page/"+str(i)+"/"
            soup1=get_requests(url1)
            draw_base_list(soup1)

 

  通过以上的两种方法就可以将多个页面中的信息都爬取下来了,赶紧动手试试吧。

Guess you like

Origin www.cnblogs.com/zhangmingfeng/p/12041702.html