Python crawler "Acta Automatica" data crawling and data analysis

Python crawler "Acta Automatica" data crawling and data analysis



foreword

This article mainly records the process and code of crawling the data of "Acta Automation Sinica" by calling python's urllib, bs4, re (regular expression) and other libraries. And perform word cloud analysis on the data, and call the matplotlib library to draw the corresponding data analysis graph.
Note: The code provided in the article is divided into two parts: web crawler and data analysis. The web crawler crawls the two data of the author and keywords of the paper. After installing the corresponding library, it can run directly and save the data to excel. The data analysis section only provides code for word cloud analysis.
You can see the basic use of web crawlers and word cloud analysis from this article, and if you have any questions, you can communicate together.


1. Code

Code 1: Only the content of the web crawler, crawl the data of the authors of the papers of "Acta Automatica Sinica" and save it in excel.
Note: The code has been verified and can be run directly. If it does not work, it is likely to be a network problem or a proxy problem. Because there is no crawler without the network. (Most of the code can be executed directly without the network, but the crawler needs the network)

import urllib.request,urllib.error
from  bs4 import BeautifulSoup
import re
import pandas as pd
findlink2=re.compile(r'authorNameCn.,.\t?(.*?)..;.>')  #正则表达式定义 论文作者
def askURL(url,a):    #获取网页源代码
    if a==1:
        head={
    
    "User-Agent": "Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
    else:
        head={
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}
    request=urllib.request.Request(url,headers=head)   #head:代理信息  获取url网站的源代码
    try:
        request=urllib.request.urlopen(request)   #打开
        html=request.read().decode("utf-8")       #解码
        # print(html)
    except:
        print("error")
    return html     #返回源代码
def main():   #主函数
    base_url2='http://www.aas.net.cn/article/latest_all'    #《自动化学报》网
    #获取网页源代码
    html=askURL(base_url2,1)
    # print(html)   #测试爬取的源代码
    soup=BeautifulSoup(html,"html.parser")   #解码成html
    # print(soup)    #测试爬取的源代码
    # print(type(html),type(soup))
    temp1=0
    temp3=0  #爬取每个论文的具体网站
    data_web=[]    #论文的具体网站
    data_keyword=[]  #论文关键字
    all_auther=[]
    ##################################爬取论文作者   解析数据
    for item in soup.find_all('div', class_='article-list-author'):
        item = str(item)
        link2 = re.findall(findlink2, item)
        link2 = ','.join(link2)  # 将列表转化为字符串
        all_auther.append(link2)
        temp1 += 1
        if temp1 >= 200:
            break
    print(all_auther)
    #保存数据
    f = pd.DataFrame(data=all_auther).to_excel('论文作者《自动化学报》.xlsx', encoding='utf-8')
if __name__ == "__main__":
    main()    

Code 2: It includes two parts: web crawler and word cloud analysis. It crawls the keywords of the first 200 papers of "Acta Automation China" and performs word cloud analysis on the keywords.
Note: Because word cloud analysis requires mask and font files. So you need to download these two files to run the code directly.

Baidu network disk link (including mask mask picture and font file):
https://pan.baidu.com/s/1kJo1DQW0eis-G3MwYk5RHw
Extraction code:
x9jq

import urllib.request,urllib.error
from  bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
from wordcloud import WordCloud
import matplotlib.pylab as plb


findlink3=re.compile(r'keywordCn.,.(.*?)..">')    #论文关键字
findlink4=re.compile(r'href="(.*?)"')   #论文具体网站

def askURL(url,a):    #获取网页源代码
    if a==1:
        head={
    
    "User-Agent": "Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
    else:
        head={
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}
    request=urllib.request.Request(url,headers=head)   #head:代理信息  获取url网站的源代码
    try:
        request=urllib.request.urlopen(request)   #打开
        html=request.read().decode("utf-8")       #解码
        # print(html)
    except:
        print("error")
    return html     #返回源代码

#词云分析需要的函数,处理mask掩膜
def transform_format(val):
    # pixel transformation
    if val >= 1:
        return 255
    else:
        return 0

def main():   #主函数
    base_url2='http://www.aas.net.cn/article/latest_all'    #《自动化学报》网
    #获取网页源代码
    html=askURL(base_url2,1)
    soup=BeautifulSoup(html,"html.parser")   #解码成html
    # print(soup)    #测试爬取的源代码
    temp3=0  #爬取每个论文的具体网站
    data_web=[]    #论文的具体网站
    data_keyword=[]  #论文关键字
    #####################################################################################爬取关键词
    ##(1)一次爬虫:爬取具体网站。
    for item in soup.find_all('div', class_='article-list-title'):
        item = str(item)
        # print(item)
        link4 = re.findall(findlink4, item)
        data_web.append(link4[0])  # 保存字符串,而不是列表
        # print(link4)
        temp3 += 1
        if temp3 >= 100:  # 爬取网站的个数为150
            break
    #print(data_web)   #网站信息保存至data_web中
    ##(2)二次爬虫:爬取论文关键词
    for i, web in enumerate(data_web):
        html = askURL(web, 1)
        # print(html)
        soup = BeautifulSoup(html, "html.parser")
        for item in soup.find_all('ul', class_='article-keyword'):
            item = str(item)
            # print(item)
            link5 = re.findall(findlink3, item)
            print(link5)
            for i, j in enumerate(link5):  # 将爬取的数据,放在同一个列表中
                if j != '':
                    data_keyword.append(j)  # 保存为列表形式,每个元素都是一个关键字
            break  # 只处理第一个,第二个为英文版关键字
    #print(data_keyword)
    ######################################################################################爬虫结束
    ############################################################词云分析
    ##统计频率次数
    wordDict = {
    
    }  # 字典
    wordSet = set(data_keyword)  # 用于生成不重复元素集合,先记录关键词
    for w in wordSet:
        if len(w) > 1:
            wordDict[w] = data_keyword.count(w)  # 确定关键词词频
    ## 排序
    wordList = list(wordDict.items())
    wordList.sort(key=lambda x: x[1], reverse=True)  # 把结果存放到文件里
    ##保存为csv文件
    former_five = pd.DataFrame(data=wordList).to_excel('词频.xlsx', encoding='utf-8')

    ##取前40个名单,制作词云
    df = pd.read_excel('词频.xlsx', index_col=None)
    former_five = df.values[0:40, 1]
    s = " ".join(i for i in former_five)
    x = np.array(plb.imread('big_thinker.png'))
    x = x[:, :, 1]
    transformed_x = np.ndarray((x.shape[0], x.shape[1]))
    transformed_x = np.array([list(map(transform_format, x[i])) for i in range(len(x))])
    wordcloud = WordCloud(font_path='STKAITI.TTF', background_color='white',
                          mask=transformed_x, contour_color='firebrick', contour_width=30).generate(s)
    wordcloud.to_file(r'词云7.jpg')
    ################################################################################################词云分析结束
if __name__ == "__main__":
    main()

2. Results display

Effect diagram of word cloud analysis:
insert image description here

3. Realization of reptiles

1. Prepare

Install the corresponding libraries required by the crawler, including: urllib, bs4, re, xlwt, pandas, numpy, etc.
Make sure that the crawled website is "Acta Automation", the URL is:
http://www.aas.net.cn/article/latest_all

2. Obtain the source code of the webpage

The main task of obtaining data is to obtain the source code of the webpage. The main steps are: call the urllib library to obtain the source code of the webpage, decode the source code in utf-8 format, and call bs4 to convert the source code into bs4 type.
Get source code: urllib's request function simulates a browser to apply for source code from a web page. The source code is saved to the request variable.
It should be noted that when obtaining the webpage, we set the head (proxy) variable, which is to imitate the browser to send the relevant information of our computer to the webpage, so as to obtain the source code of the webpage (by applying). There are two points to be emphasized. One is that only dynamic webpages need a proxy, and static webpages do not, and this is a static webpage that actually does not need a proxy. The second is that everyone's proxy information is different. If you want to determine what your own proxy information is, you can borrow a browser, turn on the developer mode, and check it in the corresponding location.
Decode the source code in utf-8 format: the obtained source code is still garbled and needs to be converted to utf-8 format.

def askURL(url,a):    #获取网页源代码
    if a==1:
        head={
    
    "User-Agent": "Mozilla/5.0(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
    else:
        head={
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}
    request=urllib.request.Request(url,headers=head)   #head:代理信息  获取url网站的源代码
    try:
        request=urllib.request.urlopen(request)   #打开
        html=request.read().decode("utf-8")       #解码
        # print(html)
    except:
        print("error")
    return html     #返回源代码

Call bs4 to convert the source code to a bs4 type: In a later step, we use the bs4 library to parse the data. To convert the source code data type to bs4, the bs4 library can be called. After the following step, the type of the web page source code is converted from a string to a bs4 type, which is convenient for the bs4 library to call.

    html=askURL(base_url2,1)
    # print(html)   #测试爬取的源代码
    soup=BeautifulSoup(html,"html.parser")   #解码成html

3. Parse the data

Parsing the data is actually extracting the information we need from the webpage through methods such as the bs4 library and regular expressions for the obtained source code.
Call the bs4 library first to narrow down the scope of regular extraction. Use the developer mode of the browser to locate the source code column of the author of the paper. Its html source code corresponds to: div and article-list-author.
Call the find_all function of the bs4 library to extract this part.

 for item in soup.find_all('div', class_='article-list-author'):

Write corresponding regular expressions to accurately extract information, that is, to do the final information screening. The part (.*?) in the following regular expression is the information to be extracted.

link2 = re.findall(findlink2, item)

Call the re library (regular expression) to extract precise information. It should be noted that the object processed by the bs4 library is the bs4 data type, while the re library handles str, which is the string type that comes with Python.

4. Save data

Save data: The last step of the crawler is to save the data to excel. Here it is saved to excel, of course it can also be saved to csv and txt and so on.
Call the to_excel function of pandas to save the data of the author of the paper that was just stored in the form of a list to the excel file.

f = pd.DataFrame(data=all_auther).to_excel('论文作者《自动化学报》.xlsx', encoding='utf-8')

Guess you like

Origin blog.csdn.net/ychpython/article/details/122479760