HowNet retrieved documents based on authors and institutions and summarized them in Excel (2021.6.9)

1. CNKI retrieves documents based on authors and units

insert image description here

        CNKI , after entering the CNKI page, click the advanced search on the right side of the search box , enter the advanced search page, select the author's paper search , you can see that there are onlyauthorandAuthor unitTwo items, just enter the author's name and author's unit in the text boxes on the right of the two items, and then click the button belowretrievebutton to get the search results.

insert image description here
insert image description here
insert image description here

1.1 Retrieval example (29 results)

        Let’s take Professor Zhong Yanfei of Wuhan University as an example to search the author’s published papers. You can see that there are 29 Chinese papers in the search results, 20 papers are displayed on each page, and they are divided into 2 pages in total.
insert image description here

1.1.1 20 pages display 20 search results per page

        The following two pictures are the search results displayed on page 1 (1-20) and page 2 (21-29) respectively.
insert image description here
insert image description here

1.1.2 1 page displays 50 results per page

insert image description here

        Of course, as shown in the figure above, the number of results displayed on each page can be set. At most, 50 results can be displayed on each page, so 29 results will only be displayed on one page, as shown in the figure below.
        Page 1: 1~29
insert image description here

2. Summarize the search results to Excel

        In the process of study and research,It may often be necessary to read a large number of papers by some experts and scholars in order to be inspired and comprehended, then first of all, you must retrieve all the papers of the scholar, it is best to summarize them in your own Excel table for marking, and then read them in a targeted manner, which may be more effective (because each retrieval itself will cost you a certain amount of time and energy . ).

2.1 Manual copy and paste (feasible for few search results and few pages)

        In this era of the Internet of Everything and the rapid development of information technology, the method of copying and pasting by hand is really cumbersome , butSometimes it's okay to try until you find a better wayYes, here are the steps to summarize the results on page 1 to Excel. If there are many pages, the method for each page is similar. For this page
insert image description here
        in the browser , keyboard + select all page elements and then press and hold the keyboard + to edit the page textCtrlACtrlCcopy
insert image description here
        Then create a new txt file in the folder, and press Ctrl+ on the keyboard Vto paste it into the txt file. The content of the file is shown in the figure below.
insert image description here
        Since the text in the dotted line box belongs to the target content, it is necessary to delete the content before and after the target in the txt. After the deleted result is shown in the figure below, you need to
insert image description here
        use the replacement in the txt file editing function to replace the spaces marked in the figure below. It is a comma in English . After the replacement is completed, it is as follows. At
insert image description here
insert image description here
        this time, you only need to replace the comma before download , the word download and the line below downloadAlso select copy, and replace it with empty to complete.
insert image description here
insert image description here
        Finally, after saving the newly created text document.txt file as ANSI encoding, you can change the suffix name to .csv, open it with Excel to view the effect, and add the number column attribute to the first line
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

2.2 Python parses the HTML page where the search results are located (strong operability)

2.2.1 Get the HTML page code corresponding to a page

insert image description here
        First, enter HowNet’s Author Publishing Search interface, enter the author’s name in the corresponding author’s box, enter the corresponding author’s affiliation in the author’s affiliation box, and then press to open the development tool page, select Network at the top and F12click XHR in the middle , and then you can Click the search button to search, when the search is clicked, the browser background has actually initiated a GetGridTableHtml request in the Post mode , and then there are search results at the bottom of the page, and the HTML codes corresponding to the 29 search results are inResponsein, but inPreviewA preview is available in .
        Then open Response , you can see the HTML code corresponding to the left page, click the inside of the HTML code below Response, press and hold Ctrl+ Ato select all and then press Ctrl+Ccopy
insert image description here

2.2.2 HTML page code formatting

insert image description here
        Use the online code formatting tool to paste the page HTML code copied from the clipboard into the text box below the HTML to be formatted, click Format , and after the formatting is complete, click Copy Formatting Code to paste the copied code Go to a new Notepad txt file.
insert image description here
        After creating a new Notepad file 1.txt in the folder ( note that it is saved as ANSI encoding ), paste the copied formatting code into it, as shown in the figure below.
insert image description here

2.2.3 Python parses each formatted HTML page to obtain retrieval results

        Python parsing codeParseHTMLCNKI.py

print('序号,'+'题名,'+'作者,'+'来源,'+'发表时间,'+'数据库,'+'被引次数,'+'下载次数')
f = open('D:\\搜狗高速下载\\CNKIGet\\1.txt','r')  # 返回一个文件对象
wf = open("D:\\搜狗高速下载\\CNKIGet\\1_parseCNKIHtml.csv",'w')
wf.write('序号,'+'题名,'+'作者,'+'来源,'+'发表时间,'+'数据库,'+'被引次数,'+'下载次数'+'\n')
line = f.readline()  # 调用文件的 readline()方法
while line:
    if (line.find('<td class="seq">') >= 0):
        sequence = line.strip('\n')  # 去掉列表中每一个元素的换行符
        sequence = sequence[sequence.find('filenameClick()" />') + 19:sequence.find('</td>')]

        line= f.readline()
        name = line.strip('\n')  # 去掉列表中每一个元素的换行符
        name = name[name.find('"_blank">') + 9:name.find('</a>')]

        line = f.readline()
        author = line.strip('\n')  # 去掉列表中每一个元素的换行符
        author = author[author.find('"Mark">') + 7:author.find('</font>')]

        line = f.readline()
        source = line.strip('\n')  # 去掉列表中每一个元素的换行符
        source = source[source.find('BaseID=') + 13:source.find('</a>')]

        line = f.readline()
        publishdate = line.strip('\n')  # 去掉列表中每一个元素的换行符
        publishdate = publishdate[publishdate.find('"date">') + 7:publishdate.find('</td>')]

        line = f.readline()
        db = line.strip('\n')  # 去掉列表中每一个元素的换行符
        db = db[db.find('"data">')+7:db.find('</td>')]

        line = f.readline()
        citied = line.strip('\n')  # 去掉列表中每一个元素的换行符
        if(citied.find('"_blank">')>=0):
            citied = citied[citied.find('"_blank">') + 9:citied.find('</a>')]
        else:
            citied = citied[citied.find('"quote">') + 8:citied.find('</td> ')]

        line = f.readline()
        download = line.strip('\n')  # 去掉列表中每一个元素的换行符
        download = download[download.find('void(0);"') + 10:download.find('</a>')]
        print(sequence+','+name+','+author+','+source+','+publishdate+','+db+','+citied+','+download)
        wf.write(sequence+','+name+','+author+','+source+','+publishdate+','+db+','+citied+','+download+'\n')
    line = f.readline()
f.close()
wf.close();

insert image description here
insert image description here

        The computer has installed python , open a Python IDE, here use PyCharm to create a new project, set the path of the Python compiler, ParseHTMLCNKI.pycopy the above to the project and then run it, after running, it will output the parsed search literature information on the console, This information is also saved to the1_parseCNKIHtml.csvfile, the results are as follows.
insert image description here
        Notepad to open 1_parseCNKIHtml.csvthe file to view the results
insert image description here
        Excel to open 1_parseCNKIHtml.csvthe file to view the results
insert image description here

Guess you like

Origin blog.csdn.net/jing_zhong/article/details/114823640