"Offer arrives, dig friends to pick up! I am participating in the 2022 Spring Recruitment Check-in Event, click to view the details of the event ."
The data to be captured this time is "Digital Museum of China's Intangible Cultural Heritage", which is only for technical study. One thing to say, the intangible cultural heritage data website is very beautiful.
Target data source analysis
Target site: http://www.ihchina.cn/
, the data exists in the position shown in the figure below: in principle, it can be captured for all categories, in order to reduce the frequency of website visits, only a single category is collected, ie http://www.ihchina.cn/project#target1
.
The page data is loaded asynchronously, and the following data is obtained by clicking on the pagination:
http://www.ihchina.cn/Article/Index/getProject.html?province=&rx_time=&type=&cate=&keywords=&category_id=16&limit=10&p=1
http://www.ihchina.cn/Article/Index/getProject.html?province=&rx_time=&type=&cate=&keywords=&category_id=16&limit=10&p=2
复制代码
The parameters are:
province
: Affiliation area;rx_time
: announcement time;type
:category;cate
:Types of;keywords
:Key words;category_id
: Category ID;limit
: The amount of data per page;p
:page number.
The overall code of this case should be similar to the previous case, and the focus is on the data return this time. The following figure shows the server response data, in which the core data exists list
in , but the developer returns the paging format and the paging HTML tag. The above data has reference value in some projects.
encoding time
This is the last case of the threading
module . In the primary crawler stage, it is enough to master basic multi-threaded applications.
In the actual measurement process, there is a classic collection technique that can be used, that is, to test the maximum amount of data returned by a single interface of the server . In this case, you can manually modify the limit
parameters , for example, modify it to 100
, the value will be returned 100 pieces of data.
In the above situation, it means that the server has no restrictions on the data returned in a single time, so in principle, you can directly modify it to 3610 (all data in the target classification).
In this way, one access interface can be implemented to obtain all data (however, when the amount of returned data is large, the corresponding speed of the interface will be slower, and it is recommended to adjust it according to the actual situation).
The complete code is as follows
import threading
import requests
import random
class Common:
def __init__(self):
pass
def get_headers(self):
uas = [
"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
"你自己的UA值,或者可以从之前的博客中获取"
]
ua = random.choice(uas)
headers = {
"user-agent": ua,
"referer": "https://www.baidu.com"
}
return headers
def run(index, url, semaphore, headers):
semaphore.acquire() # 加锁
res = requests.get(url, headers=headers, timeout=5)
res.encoding = 'utf-8'
text = res.text
save(index,text)
semaphore.release() # 释放
# 存储的数据中文进行了 UNICODE 编码,分析的时候注意转化
def save(index, text):
with open(f"./非遗数据/{index}.json", "w", encoding="utf-8") as f:
f.write(f"{text}")
print("该URL地址数据写入完毕")
if __name__ == '__main__':
c = Common()
url_format = 'http://www.ihchina.cn/Article/Index/getProject.html?province=&rx_time=&type=&cate=&keywords=&category_id=16&limit=10&p={}'
# 拼接URL,全局共享变量,362 页直接设置,没有动态获取
urls = [url_format.format(i) for i in range(1, 362)]
# 最多允许5个线程同时运行
semaphore = threading.BoundedSemaphore(5)
for i, url in enumerate(urls):
t = threading.Thread(target=run, args=(i, url, semaphore, c.get_headers()))
t.start()
while threading.active_count() != 1:
pass
else:
print('所有线程运行完毕')
复制代码
Collection time
Code repository address: codechina.csdn.net/hihell/pyth… , give us a follow or Star.
During the data collection process, the generated JSON data
== Come here, don't leave a comment, give a like, and bookmark it? ==
Today is day 206/365 of continuous writing. You can follow me, like me, comment me, bookmark me.