Python crawls blog garden browsing data

1. Get the blog garden link:

       https://www.cnblogs.com/

It is found that the way to load more blogs is to load the next page.At the
Insert picture description here
same time, we click on the following pages, and the links to the blog garden change regularly:

      https://www.cnblogs.com/#p3

Click on the third page, the number behind #p is the number of pages, we can use this number of pages to get the web link of 200 pages, the specific code is:

url="https://www.cnblogs.com/"

def get_html(url):         #获取博客园的200个主页面的链接并一一爬取存入列表中
    html_list=[]
    for i in range(1,201):
    #for i in range(1,2):

        r=requests.get(url+"/#p"+str(i))

        r.encoding=r.apparent_encoding
        html_list.append(BeautifulSoup(r.text,"html.parser"))
    return html_list

2. Extract blog information

Next, it is necessary to obtain the information of each of the blogs, check the source code of the web page, see where the required information is, use the beautifulSoup library to process the source code of the web page, and try to extract the required information.Here is the information extraction code:

def get_text(html):      #爬取一个博客的信息存储在一个字典中,然后把所有字典存储在一个列表中
    dict={
    
    
        "名字":" ",
        "标题":" ",
        "阅读量":" ",
        "评论数":" "
    }
    text_list = html.find_all("div", class_=re.compile('post_item'))
    for i in range(len(text_list)):
        try:
            text1=text_list[i]
            dict["标题"] = text1.h3.a.string
            dict["名字"] = text1.div.a.string
            dict["阅读量"] = text1.div.contents[4].a.string[3:-1]
            dict["评论数"] = text1.div.span.a.string[13:-1]


            need_list.append(dict.copy())      #此处为什么用copy()可以看我上一篇博客
        except AttributeError:
            continue
    return need_list

Write dictionary blog link in the list

3. Write the crawled information into the excel table

I use the xlsxwriter library to operate

def write_xlsx(need_list):  #将爬取的信息写入excel表中
    workbook = xlsxwriter.Workbook('excel.xlsx')
    worksheet = workbook.add_worksheet('Sheet1')
    for i in range(1,len(need_list)):
        worksheet.write('A'+str(i), need_list[i]["标题"])
        worksheet.write('B'+str(i), need_list[i]["名字"])
        worksheet.write('C'+str(i), need_list[i]["阅读量"])
        worksheet.write('D'+str(i), need_list[i]["评论数"])

        print("yes")
    workbook.close()

That's it!

Finally, give the source code:

import requests
from bs4 import BeautifulSoup
import re

import xlsxwriter



need_list=[]
url="https://www.cnblogs.com/"

def get_html(url):         #获取博客园的200个主页面的链接并一一爬取存入列表中
    html_list=[]
    for i in range(1,201):
    #for i in range(1,2):

        r=requests.get(url+"/#p"+str(i))

        r.encoding=r.apparent_encoding
        html_list.append(BeautifulSoup(r.text,"html.parser"))
    return html_list

def get_text(html):      #爬取一个博客的信息存储在一个字典中,然后把所有字典存储在一个列表中
    dict={
    
    
        "名字":" ",
        "标题":" ",
        "阅读量":" ",
        "评论数":" "
    }
    text_list = html.find_all("div", class_=re.compile('post_item'))
    for i in range(len(text_list)):
        try:
            text1=text_list[i]
            dict["标题"] = text1.h3.a.string
            dict["名字"] = text1.div.a.string
            dict["阅读量"] = text1.div.contents[4].a.string[3:-1]
            dict["评论数"] = text1.div.span.a.string[13:-1]


            need_list.append(dict.copy())      #此处为什么用copy()可以看我上一篇博客
        except AttributeError:
            continue
    return need_list


def get(html_list):       #获取200个页面的所有博客信息
    for i in range(len(html_list)):
    #for i in range(1):
        html=html_list[i]
        get_text(html)


def write_xlsx(need_list):  #将爬取的信息写入excel表中
    workbook = xlsxwriter.Workbook('excel.xlsx')
    worksheet = workbook.add_worksheet('Sheet1')
    for i in range(1,len(need_list)):
        worksheet.write('A'+str(i), need_list[i]["标题"])
        worksheet.write('B'+str(i), need_list[i]["名字"])
        worksheet.write('C'+str(i), need_list[i]["阅读量"])
        worksheet.write('D'+str(i), need_list[i]["评论数"])

        print("yes")
    workbook.close()
def main():
    html_list=get_html(url)
    get(html_list)
    write_xlsx(need_list)


main()

Guess you like

Origin blog.csdn.net/xinzhilinger/article/details/102808484