[Crawler Example] Analyze the general writing ideas of python crawlers from station B and a certain paper website——To someone

problem background

I haven't written a crawler for a long time. Two days ago, a friend came to me and asked me if python could crawl papers, and if I could tell her the basic operating principle of a crawler and what it looks like when it runs.

I saw that paper crawling - one of the most practical scenarios for crawlers, isn't that right?

insert image description here

So I tried to demonstrate it live. But the cowhide is blown. Due to my hands-on experience and wrong memory, the code was fragmented and couldn't run. I was a little embarrassed for a while. I took half a day today to review it again, checked some information, and found out this reptile. Right here for the record.

OK, let's not talk about it, let's get to the point.

task analysis

需求:
Provide a keyword, first take station B as an example, demonstrate the general crawler idea, crawl the popular list of station B, and then crawl the relevant paper information collected by a certain paper website, including title, author, source, abstract, publication time, The amount of citations, etc., are organized into an excel table.

说明:

  • The crawling of static web pages is generally relatively simple. One request plus BeautifulSoup is enough, and generally does not involve turning over the Network. But the current webpages are generally dynamic webpages, and they are encrypted, and the URLs need to be deciphered. If you don’t use senlenium, a library that can be seen and crawled, it is essential to find the real URL where the target information is located. part of it.

  • Here we start with the crawling of the hot list information at station B as an introduction, start with a relatively simple URL parsing process, and then enter the crawling of a paper on a certain paper website—operation with parameters. This content involves dynamic web pages, URL encryption, Parameter search and modification, etc.

  • Both of these examples need to turn over NetWork. It is not enough to parse html, and you will only get [ ]null values.

Bilibili hot list crawling steps

  1. Idea: [Get Data] -> [Analyze Data] -> [Extract Data] -> [Storage Data]
  • Enter the popular page of station B, press F12 to open the developer tools, click NetWork, press Ctrl+R to refresh, check Fetch/XHR, and the following page appears:
    insert image description here
  • Then click on the files under Name one by one to see which file contains the desired title, up host, playback volume and other information. Finally locate popular?ps=20&pn=1the data we want hidden in the found file, and [Get Data] is completed, as shown in the figure below:
    insert image description here
  • At present, it can be seen that this is a json file, so consider using the json library to parse it, that is, convert it into a dictionary and then perform dictionary operations, peeling off the information we want layer by layer like an onion.

Then we click on it Headers, the red area in the picture below is the URL of the json file:
insert image description here

  • Finally, organize the extracted data and put them in the excel table to complete [store data].
  • The idea is mapped to the code : use the requests library to get the URL, then use the json library to convert the obtained content into a dictionary, then extract the information layer by layer, and finally write it into the excel table for storage
  1. Practice : [Knock the code]
    First put the complete code below, and then analyze it line by line:
import requests
import pandas as pd

url = 'https://api.bilibili.com/x/web-interface/popular?ps=20&pn=1'  # 信息藏匿的网址
headers = {
    
    
    "user-agent": "Mozilla/5.0 (Windows NT 10.0Win64x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48"}   # 加个请求头防止被ban
res = requests.get(url, headers=headers)  # 向网址发起请求,获取信息
res_json = res.json()   # 用json把网页信息转成字典
target = res_json['data']['list']   # 用字典方法提取信息
titles = []   # 建一个空列表,后续提取title(标题)
owners = []   # 建一个空列表,后续提取owner(up主)
descs = []    # 建一个空列表,后续提取desc(摘要)
views = []    # 建一个空列表,后续提取view(播放量)

for i in target:
# 把每个视频对应的信息分别提取出来,加到各自的列表中
    titles.append(i['title'])        
    owners.append(i['owner']['name'])
    descs.append(i['desc'])
    views.append(i['stat']['view'])
    print([i['title'], i['owner']['name'], i['desc'], i['stat']['view']])  # 打印看看效果

dic = {
    
    "标题": titles, "up主": owners, "简介": descs, "播放量": views}  # 建一个字典
df = pd.DataFrame(dic)    # 字典转pandas.frame类,好写入excel文件
df.to_excel('./Bilibili_2.xlsx', sheet_name="Sheet1", index=False)  # 信息储存到该文件目录下

代码分析:

  • The first two lines import the library to be used;
import requests
import pandas as pd
  • This section is mainly the json part, because the dictionary structure of the place where the target information is hidden is like this ↓↓↓
    insert image description here
    So res_json['data']['list']first extract all the information of each video, and then extract each of them by further positioning;
url = 'https://api.bilibili.com/x/web-interface/popular?ps=20&pn=1'  # 信息藏匿的网址
headers = {
    
    
    "user-agent": "Mozilla/5.0 (Windows NT 10.0Win64x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48"}   # 加个请求头防止被ban
res = requests.get(url, headers=headers)  # 向网址发起请求,获取信息
res_json = res.json()   # 用json把网页信息转成字典
target = res_json['data']['list']   # 用字典方法提取信息
titles = []   # 建一个空列表,后续提取title(标题)
owners = []   # 建一个空列表,后续提取owner(up主)
descs = []    # 建一个空列表,后续提取desc(摘要)
views = []    # 建一个空列表,后续提取view(播放量)

for i in target:
# 把每个视频对应的信息分别提取出来,加到各自的列表中
    titles.append(i['title'])        
    owners.append(i['owner']['name'])
    descs.append(i['desc'])
    views.append(i['stat']['view'])
    print([i['title'], i['owner']['name'], i['desc'], i['stat']['view']])  # 打印看看效果
  • The following paragraph is to convert all the extracted data into a dictionary first, and then into a dataframe, which is convenient for pandas.to_excelwriting;
dic = {
    
    "标题": titles, "up主": owners, "简介": descs, "播放量": views}  # 建一个字典
df = pd.DataFrame(dic)    # 字典转pandas.frame类,好写入excel文件
df.to_excel('./Bilibili_2.xlsx', sheet_name="Sheet1", index=False)  # 信息储存到该文件目录下
  • The final result is as follows:
    insert image description here

A paper website crawling

The anti-climbing mechanism of this website is relatively powerful. I checked the articles of many big guys, thought about it for a long time, and wrote it all night before I figured out the principle.

  1. Idea: [Get Data] -> [Analyze Data] -> [Extract Data] -> [Storage Data]
  • Enter the website in the same way, after searching for keywords (here I take carbon neutrality as an example), press F12 to open the developer tools, click on NetWork, press Ctrl+R to refresh, check Fetch/XHR, and the following page appears:
    insert image description here
  • After clicking one by one, we found that the information is hidden in GetGridTableHtmlthe file. We clicked and opened it, and found that this is a web page file generated by html language, so we don’t need to parse it with json first. Maybe the ordinary Beautifulsoup is enough. Just in case, this time we do a good job of "camouflage" and bring up the request header and cookies to access it.
    insert image description here
    (We can also see the source html code of this webpage file under the response Response)
    insert image description here
    Without further ado, let's start!
  1. Practice : [Knock the code]
    First put the complete code below, and then analyze it line by line:
import requests
from bs4 import BeautifulSoup

url = 'https://kns.cnki.net/KNS8/Brief/GetGridTableHtml'
head = '''
Accept: text/html, */*; q=0.01
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6
Connection: keep-alive
Content-Length: 5134
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: Ecp_ClientId=b230309162401492276; Ecp_loginuserbk=gz0332; knsLeftGroupSelectItem=1%3B2%3B; Ecp_ClientIp=219.223.233.83; cnkiUserKey=3c1cc8af-46ae-8fb2-d5f7-1014b5e3d034; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22186c579a2ef11ea-0e8fd66c04c69e8-74525476-1327104-186c579a2f01680%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%7D%2C%22%24device_id%22%3A%22186c579a2ef11ea-0e8fd66c04c69e8-74525476-1327104-186c579a2f01680%22%7D; Hm_lvt_6e967eb120601ea41b9d312166416aa6=1678350324,1680238706; Ecp_session=1; SID_kns_new=kns15128006; ASP.NET_SessionId=4dobe0vxy4f4m2fnmchugzbg; SID_kns8=25124104; CurrSortFieldType=desc; LID=WEEvREcwSlJHSldSdmVqMDh6c3VFeXBleUNRb1BiK00rb0k3YXk5YllBQT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!; Ecp_LoginStuts={"IsAutoLogin":false,"UserName":"gz0332","ShowName":"%E6%B7%B1%E5%9C%B3%E5%A4%A7%E5%AD%A6%E5%9F%8E%E5%9B%BE%E4%B9%A6%E9%A6%86","UserType":"bk","BUserName":"","BShowName":"","BUserType":"","r":"77fOa5","Members":[]}; dsorder=pubdate; CurrSortField=%e5%8f%91%e8%a1%a8%e6%97%b6%e9%97%b4%2f(%e5%8f%91%e8%a1%a8%e6%97%b6%e9%97%b4%2c%27time%27); _pk_ref=%5B%22%22%2C%22%22%2C1681733499%2C%22https%3A%2F%2Fcn.bing.com%2F%22%5D; _pk_ses=*; _pk_id=b73a87af-2c79-4c0e-866e-1b6b2c1787d5.1678350286.7.1681733507.1681733499.; c_m_LinID=LinID=WEEvREcwSlJHSldSdmVqMDh6c3VFeXBleUNRb1BiK00rb0k3YXk5YllBQT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&ot=04%2F17%2F2023%2020%3A16%3A32; c_m_expire=2023-04-17%2020%3A16%3A32; dblang=ch
Host: kns.cnki.net
Origin: https://kns.cnki.net
Referer: https://kns.cnki.net/kns8/defaultresult/index
sec-ch-ua: "Chromium";v="112", "Microsoft Edge";v="112", "Not:A-Brand";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48
X-Requested-With: XMLHttpRequest
'''
headers = dict([[y.strip() for y in x.strip().split(':', 1)] for x in head.strip().split('\n') if x.strip()])

# head的issearch参数我直接设置为false,具体分析看注释
head = '''
IsSearch: false
QueryJson: {"Platform":"","DBCode":"CFLS","KuaKuCode":"CJFQ,CDMD,CIPD,CCND,CISD,SNAD,BDZK,CCJD,CCVD,CJFN","QNode":{"QGroup":[{"Key":"Subject","Title":"","Logic":1,"Items":[{"Title":"主题","Name":"SU","Value":"碳中和","Operate":"%=","BlurType":""}],"ChildItems":[]}]},"CodeLang":"ch"}
SearchSql: $s
PageName: defaultresult
DBCode: CFLS
KuaKuCodes: CJFQ,CDMD,CIPD,CCND,CISD,SNAD,BDZK,CCJD,CCVD,CJFN
CurPage: $page
RecordsCntPerPage: 20
CurDisplayMode: listmode
CurrSortField: PT
CurrSortFieldType: desc
IsSentenceSearch: false
Subject: 
'''
s = '0645419CC2F0B23BC604FFC82ADF67C6E920108EDAD48468E8156BA693E89F481391D6F5096D7FFF3585B29E8209A884EFDF8EF1B43B4C7232E120D4832CCC8979F171B4C268EE675FFB969E7C6AF23B4B63CE6436EE93F3973DCB2E4950C92CBCE188BEB6A4E9E17C3978AE8787ED6BB56445D70910E6E32D9A03F3928F9AD8AADE2A90A8F00E2B29BD6E5A0BE025E88D8E778EC97D42EF1CF47C35B8A9D5473493D11B406E77A4FF28F5B34B8028FE85F57606D7A3FED75B27901EEF587583EBD4B63AC0E07735BE77F216B50090DEE5ABB766456B996D37EB8BDACA3A67E8126F111CF9D15B351A094210DB6B4638A21065F03B6F0B73BB4625BBECE66F8197909739D8FB4EB756DEF71864177DFA3CB468CFA6E8ABF7924234DED6B0DFD49D9269CBA4A2BF4075D517A61D094225D70C1B4C137DB9614758A5E097376F5F3E55A7063A4B7E437436D13FF3CC8FB435E131FFCD16FC30DD997098B4FC997D995E767E2712175BC05B960D3FEB5CAF12A13BD1CE3530AD72FC4DB93206996E216BC5DC294960A0CA05E986848E1E64FFC5A52BFCB41A97840A708E397F11EFF261E08F3A34094061AE8E8F819AF6A17A9E2176C3893C6DD3E3C06864C91989BDEF9790A38FAF2524B17743B30EBA4ADD550BF985F9C3097A608C697283CE37F8CB78BDC9EAA4874C3485E6F931B016EC41BFBC0EF91B2AD7E1B424E1DFB8FC8771DEA2458C5A7A4C9BF0192C101FD8EDDEE1BACB44C3E478361EF0D1B70FAD56BCF6870A6044D3A226611B9C1A43C6F9F7C021C98E0D5F778D72C87183F026071A730B8BB4FABF9F68FEC783AB1E6E79218B5D87FD1BB541817FB4F3C21DC849A803CB8A620A2EE00475BAF2CE6556638B7A949B446F39A1076DA15764A777BA6239447CB91F4CF513325366E167D268DDB75F288B5C13415CE62F5C431181C044A28CA502FF14439E5C6F63D419CB6DE1360DB01593FE765459299E442EE24917C199AB5178F38461F8C4EBBC95344C5F2AB60F379813A87E2E3AFE3021198B8222CEAB870D9A353786079961184D63977917C7DF8FE6AFBBC795A832BDD454D6E3CD22C3FF7A58808923DD6F464C12A9A88FBFD0C71458AB0E4C1D566315181A9578ECE93670E5CAF13CF2553F68E64726C131F4A48B42A9E7F09EFEBA51D1FDA6BA0ECA0B02B951ECC04548F1D4D08DB69D0EFDCE6793537BB8E59DC442631A9CDBA13878D7493AADA0CD868C1C1C3A6A6FA17C4109205A83F9C0C43E0D2551D0A8592EA99D20D4B78B4EEEE2D53A543701F620C7D6FF46E800B0CEF9B3D23ACD62C7CBEA25FC8BD74D5A0E5C86B9CF3FCACBDBE585AFF85F9689CBF5BCBD267C580361D5B93AA9BD5A1BB6122BB87C04AE227211FE675A4650814F2285261E5641683D65E0454E2597F6025BB4AA1A044D7B97F57394EA5EC878B80FC0A82F12E2D3D9E1BCB062A7ABB290F02116BDED95761A67CE2FDDE42BDDDF34F22E49A0406D724FEBC86B93F80BB52A8D34B8D2B24288ECBC3F90CCB1EF36085E77F1E2AE0AB411FE60A033E704A21469EA5CE4BB8AF6B1C1F1C5F1F084472D57F458CC39F0B2FE583D0795159E9E38BD1102F5D96DB0F828B66F41A702BB0AE59E40CF53BE7F6342EF208434CFFABF845AFD771B288D484BB79952159E6EA27658A6B6230557AF16E86C4AFDF973DBD5A3A2B979AD9037441409D22A954DC50CBCEA8EA5AC500C4BC8282DCE2626BD2B2CB4B1E33B2E1F92533F7F04C48D061907DBCE3E21FF0A77F09C1AE33E769962CD1EDE6B688590D569409EE9EEC4DF1074DCC97C43B0EDAF1C38B5B2784ABC803D9B3B4FC35F46CB1E275E7F83036FC6AFF2E624D4D2E6AE1C2D4CE3FF219FA90A935957E0DE1A386E4AAB5C9F9D1CECA909F5698BFA86C57B6A73D3C0F9FDB94128B7BB9FDD19D57E4C2C2F4127A1F127A96ABF248B26D8B6EF12A1EA97D064564D33D46E5CA71F53FA121A7E5C91ED2B08BE64A0E3D22BE26FC251C0BF4CF21674DE19AF410E3EDFBD9A4BBEA6C709A1E42B5C17E1EE7AA33EFB0F375BF0858D49210A71662313FA5B8E04E508A5E9425D49C3C5D12CB8DCADBF8A148BFA042BBB0218AAC403AAB9CECC45FD33CEC6797FC984BF91FF638AF6E1F09546F595CFC779D2D867282C63B78DC6A6ED3C1C3887462C84AC07C756C5A8D8A8B2EFD39C28A68D47091A3312461BC20085636F4B41F22D5B46861F3E557777CDCFDFE6CAB8ABECFECA3634D779C0F21185772C426BE383BD26E1715DAB5EC4AE4CAC877ED6899CBCB31546F9C6144399C7C0257BBCCBE0EBD2E90EA901840211FCBB1655CB66FD9C51E90432B273CC4CFF3F8DBEB24CDB0ED6017FE68A7F3E9156E1BA276526DE9599A66921F0E2C3ED466FDE0076DCDA6745F29D28E406BEE5FFB0D4C5FB5D72029BAE56BC22496567FF64341F89469703987DD9D700C08346782F57BA62479812820C862D3D5C604C5C26A76F1A7EC503EB4892BAEB25BB44E783FFB3F1B2F5BECB16A5B48F4D769C3DD5713D2B00AEF4870248A1D561623C9418C285CEE86E1C8DF4A73ED729D7789577456281B4F4D1EE3447E6F8391341BB8F15CF9712E73DBE149164B95748F35D6A4CB4492B8D082AB372E96FC29D1578B41D85F0B7A04EFCBE928642D5D2825F978805C43062C3DF3F0915B33F58A8D82BCC523F3C36B9BAEB9226A39549408AFD3119A8B39B8887038107EB5A623D59186BFFF562E624E1BC25A0C8AA6DD298AEC09B06802A77DFD11799D29506307693DAB2962B98EF9F25E785619D05BDE7073474E187D59D41F6A2E06CC292AD406C2991D9C5E58812A1431B46AB634D548433B1E437D745A013EF5950818CC2A2860B4A7D93BE2DDF0AAAE0627B587A6FAD3FBFFCE90BEFB8ED5F71EAA17F6A0841841EC096D2F47AEA2203CD44B0D68A7B02FE7BEA80BA3CC925155610F00B7D631D150EBA6FFE62E756004A946F9E0F8728DC9E87768D1AC6560DA6A15382DFFA62B129957D4B23F4921C46631D24125DD5CACA73481F2CBDE5'
# 这里解释一下这个网站的小伎俩,当page为1时,isresearch值为true,这时不会显示Searchsql,
# 但是一旦你多翻两页,就会发现page>1时,isresearch值为false,这时就会显示Searchsql,这里我直接尝试设置page为1时issearch为false,发现并不影响,所以只改page即可
# s是查询sql在IsrSearch为false的html里面可以找到   GetgridTableHtml和ValidRight都可以找到sql
page = 0
head = head.replace('$s', s)
head = head.replace('$page', str(page))
data = dict([[y.strip() for y in x.strip().split(':', 1)] for x in head.strip().split('\n') if x.strip()])  #转成特定格式,可以打印出来看

def cal(url, headers, data, Curpage, database):
    data['CurPage'] = Curpage
    res = requests.post(url, timeout=30, data=data,
                        headers=headers).content.decode('utf-8')
    bs = BeautifulSoup(res, 'html.parser')
    target = bs.find('table', class_="result-table-list")
    names = target.find_all(class_='name')
    authors = target.find_all(class_="author")
    sources = target.find_all(class_="source")
    dates = target.find_all(class_="date")
    datas = target.find_all(class_="data")

    for k in range(len(names)):
        name = names[k].find('a').text.strip()
        author = authors[k].text.strip()
        source = sources[k].text.strip()
        date = dates[k].text.strip()
        data = datas[k].text.strip()
        href = "https://kns.cnki.net/" + names[k].find('a')['href']  # 提取文章链接
        database.append([name, author, source, date, data, href])


database = []

for m in range(1, 3, 1):
    cal(url, headers, data, str(m), database)

import pandas as pd
df = pd.DataFrame(database, columns=["Title", "Author", "Source", "Publish_date", "Database", "Link"])
df.to_excel('./paper.xlsx', sheet_name="Sheet1", index=False)

代码分析:

  • The first two lines import related libraries. After trying here, I found that requeststhe library and BeautifulSoupthe two most commonly used ones are enough to parse web pages. Of course, they can also be lxmt.htmlused to parse web pages.

  • Then assign the found target URL as url (as before, find it in Headers).

import requests
from bs4 import BeautifulSoup

url = 'https://kns.cnki.net/KNS8/Brief/GetGridTableHtml'

insert image description here

  • This section is very easy, that is, copy the request header in Headers and assign it to a new variable, and then convert it into a dictionary.
head = '''
Accept: text/html, */*; q=0.01
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6
Connection: keep-alive
Content-Length: 5134
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: Ecp_ClientId=b230309162401492276; Ecp_loginuserbk=gz0332; knsLeftGroupSelectItem=1%3B2%3B; Ecp_ClientIp=219.223.233.83; cnkiUserKey=3c1cc8af-46ae-8fb2-d5f7-1014b5e3d034; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22186c579a2ef11ea-0e8fd66c04c69e8-74525476-1327104-186c579a2f01680%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%7D%2C%22%24device_id%22%3A%22186c579a2ef11ea-0e8fd66c04c69e8-74525476-1327104-186c579a2f01680%22%7D; Hm_lvt_6e967eb120601ea41b9d312166416aa6=1678350324,1680238706; Ecp_session=1; SID_kns_new=kns15128006; ASP.NET_SessionId=4dobe0vxy4f4m2fnmchugzbg; SID_kns8=25124104; CurrSortFieldType=desc; LID=WEEvREcwSlJHSldSdmVqMDh6c3VFeXBleUNRb1BiK00rb0k3YXk5YllBQT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!; Ecp_LoginStuts={"IsAutoLogin":false,"UserName":"gz0332","ShowName":"%E6%B7%B1%E5%9C%B3%E5%A4%A7%E5%AD%A6%E5%9F%8E%E5%9B%BE%E4%B9%A6%E9%A6%86","UserType":"bk","BUserName":"","BShowName":"","BUserType":"","r":"77fOa5","Members":[]}; dsorder=pubdate; CurrSortField=%e5%8f%91%e8%a1%a8%e6%97%b6%e9%97%b4%2f(%e5%8f%91%e8%a1%a8%e6%97%b6%e9%97%b4%2c%27time%27); _pk_ref=%5B%22%22%2C%22%22%2C1681733499%2C%22https%3A%2F%2Fcn.bing.com%2F%22%5D; _pk_ses=*; _pk_id=b73a87af-2c79-4c0e-866e-1b6b2c1787d5.1678350286.7.1681733507.1681733499.; c_m_LinID=LinID=WEEvREcwSlJHSldSdmVqMDh6c3VFeXBleUNRb1BiK00rb0k3YXk5YllBQT0=$9A4hF_YAuvQ5obgVAqNKPCYcEjKensW4IQMovwHtwkF4VYPoHbKxJw!!&ot=04%2F17%2F2023%2020%3A16%3A32; c_m_expire=2023-04-17%2020%3A16%3A32; dblang=ch
Host: kns.cnki.net
Origin: https://kns.cnki.net
Referer: https://kns.cnki.net/kns8/defaultresult/index
sec-ch-ua: "Chromium";v="112", "Microsoft Edge";v="112", "Not:A-Brand";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.48
X-Requested-With: XMLHttpRequest
'''
headers = dict([[y.strip() for y in x.strip().split(':', 1)] for x in head.strip().split('\n') if x.strip()])

Find it here↓↓↓
insert image description here

  • This paragraph assigns another parameter, the method is the same as above, directly copy+assign+transfer to dictionary;
    *(代码注释和下面都有一段关于这个参数的独特之处解释,倘若是只看思路可以不去细究其加密方式,跳过分析阶段)*
# head的issearch参数我直接设置为false,具体分析看注释
head = '''
IsSearch: false
QueryJson: {"Platform":"","DBCode":"CFLS","KuaKuCode":"CJFQ,CDMD,CIPD,CCND,CISD,SNAD,BDZK,CCJD,CCVD,CJFN","QNode":{"QGroup":[{"Key":"Subject","Title":"","Logic":1,"Items":[{"Title":"主题","Name":"SU","Value":"碳中和","Operate":"%=","BlurType":""}],"ChildItems":[]}]},"CodeLang":"ch"}
SearchSql: $s
PageName: defaultresult
DBCode: CFLS
KuaKuCodes: CJFQ,CDMD,CIPD,CCND,CISD,SNAD,BDZK,CCJD,CCVD,CJFN
CurPage: $page
RecordsCntPerPage: 20
CurDisplayMode: listmode
CurrSortField: PT
CurrSortFieldType: desc
IsSentenceSearch: false
Subject: 
'''
s = '0645419CC2F0B23BC604FFC82ADF67C6E920108EDAD48468E8156BA693E89F481391D6F5096D7FFF3585B29E8209A884EFDF8EF1B43B4C7232E120D4832CCC8979F171B4C268EE675FFB969E7C6AF23B4B63CE6436EE93F3973DCB2E4950C92CBCE188BEB6A4E9E17C3978AE8787ED6BB56445D70910E6E32D9A03F3928F9AD8AADE2A90A8F00E2B29BD6E5A0BE025E88D8E778EC97D42EF1CF47C35B8A9D5473493D11B406E77A4FF28F5B34B8028FE85F57606D7A3FED75B27901EEF587583EBD4B63AC0E07735BE77F216B50090DEE5ABB766456B996D37EB8BDACA3A67E8126F111CF9D15B351A094210DB6B4638A21065F03B6F0B73BB4625BBECE66F8197909739D8FB4EB756DEF71864177DFA3CB468CFA6E8ABF7924234DED6B0DFD49D9269CBA4A2BF4075D517A61D094225D70C1B4C137DB9614758A5E097376F5F3E55A7063A4B7E437436D13FF3CC8FB435E131FFCD16FC30DD997098B4FC997D995E767E2712175BC05B960D3FEB5CAF12A13BD1CE3530AD72FC4DB93206996E216BC5DC294960A0CA05E986848E1E64FFC5A52BFCB41A97840A708E397F11EFF261E08F3A34094061AE8E8F819AF6A17A9E2176C3893C6DD3E3C06864C91989BDEF9790A38FAF2524B17743B30EBA4ADD550BF985F9C3097A608C697283CE37F8CB78BDC9EAA4874C3485E6F931B016EC41BFBC0EF91B2AD7E1B424E1DFB8FC8771DEA2458C5A7A4C9BF0192C101FD8EDDEE1BACB44C3E478361EF0D1B70FAD56BCF6870A6044D3A226611B9C1A43C6F9F7C021C98E0D5F778D72C87183F026071A730B8BB4FABF9F68FEC783AB1E6E79218B5D87FD1BB541817FB4F3C21DC849A803CB8A620A2EE00475BAF2CE6556638B7A949B446F39A1076DA15764A777BA6239447CB91F4CF513325366E167D268DDB75F288B5C13415CE62F5C431181C044A28CA502FF14439E5C6F63D419CB6DE1360DB01593FE765459299E442EE24917C199AB5178F38461F8C4EBBC95344C5F2AB60F379813A87E2E3AFE3021198B8222CEAB870D9A353786079961184D63977917C7DF8FE6AFBBC795A832BDD454D6E3CD22C3FF7A58808923DD6F464C12A9A88FBFD0C71458AB0E4C1D566315181A9578ECE93670E5CAF13CF2553F68E64726C131F4A48B42A9E7F09EFEBA51D1FDA6BA0ECA0B02B951ECC04548F1D4D08DB69D0EFDCE6793537BB8E59DC442631A9CDBA13878D7493AADA0CD868C1C1C3A6A6FA17C4109205A83F9C0C43E0D2551D0A8592EA99D20D4B78B4EEEE2D53A543701F620C7D6FF46E800B0CEF9B3D23ACD62C7CBEA25FC8BD74D5A0E5C86B9CF3FCACBDBE585AFF85F9689CBF5BCBD267C580361D5B93AA9BD5A1BB6122BB87C04AE227211FE675A4650814F2285261E5641683D65E0454E2597F6025BB4AA1A044D7B97F57394EA5EC878B80FC0A82F12E2D3D9E1BCB062A7ABB290F02116BDED95761A67CE2FDDE42BDDDF34F22E49A0406D724FEBC86B93F80BB52A8D34B8D2B24288ECBC3F90CCB1EF36085E77F1E2AE0AB411FE60A033E704A21469EA5CE4BB8AF6B1C1F1C5F1F084472D57F458CC39F0B2FE583D0795159E9E38BD1102F5D96DB0F828B66F41A702BB0AE59E40CF53BE7F6342EF208434CFFABF845AFD771B288D484BB79952159E6EA27658A6B6230557AF16E86C4AFDF973DBD5A3A2B979AD9037441409D22A954DC50CBCEA8EA5AC500C4BC8282DCE2626BD2B2CB4B1E33B2E1F92533F7F04C48D061907DBCE3E21FF0A77F09C1AE33E769962CD1EDE6B688590D569409EE9EEC4DF1074DCC97C43B0EDAF1C38B5B2784ABC803D9B3B4FC35F46CB1E275E7F83036FC6AFF2E624D4D2E6AE1C2D4CE3FF219FA90A935957E0DE1A386E4AAB5C9F9D1CECA909F5698BFA86C57B6A73D3C0F9FDB94128B7BB9FDD19D57E4C2C2F4127A1F127A96ABF248B26D8B6EF12A1EA97D064564D33D46E5CA71F53FA121A7E5C91ED2B08BE64A0E3D22BE26FC251C0BF4CF21674DE19AF410E3EDFBD9A4BBEA6C709A1E42B5C17E1EE7AA33EFB0F375BF0858D49210A71662313FA5B8E04E508A5E9425D49C3C5D12CB8DCADBF8A148BFA042BBB0218AAC403AAB9CECC45FD33CEC6797FC984BF91FF638AF6E1F09546F595CFC779D2D867282C63B78DC6A6ED3C1C3887462C84AC07C756C5A8D8A8B2EFD39C28A68D47091A3312461BC20085636F4B41F22D5B46861F3E557777CDCFDFE6CAB8ABECFECA3634D779C0F21185772C426BE383BD26E1715DAB5EC4AE4CAC877ED6899CBCB31546F9C6144399C7C0257BBCCBE0EBD2E90EA901840211FCBB1655CB66FD9C51E90432B273CC4CFF3F8DBEB24CDB0ED6017FE68A7F3E9156E1BA276526DE9599A66921F0E2C3ED466FDE0076DCDA6745F29D28E406BEE5FFB0D4C5FB5D72029BAE56BC22496567FF64341F89469703987DD9D700C08346782F57BA62479812820C862D3D5C604C5C26A76F1A7EC503EB4892BAEB25BB44E783FFB3F1B2F5BECB16A5B48F4D769C3DD5713D2B00AEF4870248A1D561623C9418C285CEE86E1C8DF4A73ED729D7789577456281B4F4D1EE3447E6F8391341BB8F15CF9712E73DBE149164B95748F35D6A4CB4492B8D082AB372E96FC29D1578B41D85F0B7A04EFCBE928642D5D2825F978805C43062C3DF3F0915B33F58A8D82BCC523F3C36B9BAEB9226A39549408AFD3119A8B39B8887038107EB5A623D59186BFFF562E624E1BC25A0C8AA6DD298AEC09B06802A77DFD11799D29506307693DAB2962B98EF9F25E785619D05BDE7073474E187D59D41F6A2E06CC292AD406C2991D9C5E58812A1431B46AB634D548433B1E437D745A013EF5950818CC2A2860B4A7D93BE2DDF0AAAE0627B587A6FAD3FBFFCE90BEFB8ED5F71EAA17F6A0841841EC096D2F47AEA2203CD44B0D68A7B02FE7BEA80BA3CC925155610F00B7D631D150EBA6FFE62E756004A946F9E0F8728DC9E87768D1AC6560DA6A15382DFFA62B129957D4B23F4921C46631D24125DD5CACA73481F2CBDE5'
# 这里解释一下这个网站的小伎俩,当page为1时,isresearch值为true,这时不会显示Searchsql,
# 但是一旦你多翻两页,就会发现page>1时,isresearch值为false,这时就会显示Searchsql,这里我直接尝试设置page为1时issearch为false,发现并不影响,所以只改page即可
# s是查询sql在IsrSearch为false的html里面可以找到   GetgridTableHtml和ValidRight都可以找到sql
page = 0
head = head.replace('$s', s)
head = head.replace('$page', str(page))
data = dict([[y.strip() for y in x.strip().split(':', 1)] for x in head.strip().split('\n') if x.strip()])  #转成特定格式,可以打印出来看

(((It should be noted that this sqlsearchparameter can only GetGridTableHtmlbe found in the file on the next page, and isreashthe parameter can be set directly false.

This parameter is copied here, and 第二页论文it will appear after entering ↓↓↓
insert image description here

  • Next, write a special function to repeatedly complete the crawling work of different pages of papers. The logic of the code is the same as that used to crawl station B before json, except that it is Beautifulsoupused to parse the webpage and use findthe and find_allmethod to locate elements.
def cal(url, headers, data, Curpage, database):
## 爬取指定页的论文信息
    data['CurPage'] = Curpage
    res = requests.post(url, timeout=30, data=data,
                        headers=headers).content.decode('utf-8')
    bs = BeautifulSoup(res, 'html.parser')
    target = bs.find('table', class_="result-table-list")
    names = target.find_all(class_='name')
    authors = target.find_all(class_="author")
    sources = target.find_all(class_="source")
    dates = target.find_all(class_="date")
    datas = target.find_all(class_="data")

    for k in range(len(names)):
    #  用text提取字符段,strip()去除空格或其他转义符
        name = names[k].find('a').text.strip()
        author = authors[k].text.strip()
        source = sources[k].text.strip()
        date = dates[k].text.strip()
        data = datas[k].text.strip()
        href = "https://kns.cnki.net/" + names[k].find('a')['href']  # 提取文章链接
        database.append([name, author, source, date, data, href])
  • The last part is to call this function to crawl paper data on different pages~ remember to save it in the excel file (^ ▽ ^)
database = []   # 建一个空列表存储数据

for m in range(1, 3, 1):  # 爬前两页的论文
    cal(url, headers, data, str(m), database)

import pandas as pd
#  用pandas操作写入,设置好列名
df = pd.DataFrame(database, columns=["Title", "Author", "Source", "Publish_date", "Database", "Link"])
df.to_excel('./paper.xlsx', sheet_name="Sheet1", index=False)
  • The result is as follows↓↓↓
    insert image description here

Questions and Reviews

  • question:

Although the final crawler did have a relatively ideal effect, the following three problems encountered in the process of writing the code have not yet been figured out. I hope that some big guys can leave answers in the comment area:

① There are certain discrepancies between the data crawled by station B and the real-time page. For example, the title of the first data I crawled is [This is too unreasonable], but the first one analyzed according to that webpage should be [AI glance I have seen through my essence], I currently have two suspicions: one is that the hot list is constantly changing, and it will change every once in a while; the other is that the website I turned to is not the target website. However, no trace of these two suspicions was found during the webpage analysis process, and the specific reason is unknown.

② When I was crawling papers, I actually did not specifically def a function at the beginning, it was entirely because this code would have the following error report. I
Traceback (most recent call last): File "D:\PyTest\爬虫\COVID-19\spider.py", line 91, in <module> data['CurPage'] = str(j) TypeError: 'str' object does not support item assignment
hope someone here can answer why there is an error report about characters that cannot be changed.

## 这段代码就是在  data = dict([[y.strip() for y in x.strip().split(':', 1)] for x in head.strip().split('\n') if x.strip()]) 那一行的后面
for j in range(2, 5):    
    data['CurPage'] = str(j)
    res = requests.post(url, timeout=30, data=data,
                        headers=headers).content.decode('utf-8')
    bs = BeautifulSoup(res, 'html.parser')
    target = bs.find('table', class_="result-table-list")
    names = target.find_all(class_='name')
    authors = target.find_all(class_="author")
    sources = target.find_all(class_="source")
    dates = target.find_all(class_="date")
    datas = target.find_all(class_="data")

    for k in range(len(names)):
        name = names[k].find('a').text.strip()
        author = authors[k].text.strip()
        source = sources[k].text.strip()
        date = dates[k].text.strip()
        data = datas[k].text.strip()
        href = "https://kns.cnki.net/" + names[k].find('a')['href']  # 提取文章链接

③ An important parameter was hidden on the first page of the thesis page sqlsearch, and it was not turned back at that time, resulting in a waste of time. How was this operation achieved? It is not very intuitive to guess and understand.

  • review:

This article tries to map the four steps of crawlers to actual needs for operation, but the technical level is limited, so I have the right to review the basic knowledge of crawlers and exchange opportunities with friends haha~

It can be seen that there are many ways to analyze web pages, and the way to encrypt websites is also constantly improving.

I played with ChatGPT in class today and asked him to write a popular crawling code on station B. Because of the problem of the GPT database, the code he wrote can indeed solve the crawling problem on station B for about 20 years, but now because of the website encryption The method has been improved, the code has no effect, and there are many other websites.

However, if it is only for text resource crawling, the ideas and steps involved in this article should be unavoidable for most crawlers.

Finally, I would like to thank every experience sharer, I am just a little crawler, and only by cruising in your knowledge website can I gain wisdom.

insert image description here
(By the way, the watermark of this picture on CSDN is really ugly ╮(╯▽╰)╭)

Guess you like

Origin blog.csdn.net/weixin_50593821/article/details/130212425
Recommended