Our pride! Intangible cultural heritage data collection, from official data, Python crawlers crawl everywhere

"Offer arrives, dig friends to pick up! I am participating in the 2022 Spring Recruitment Check-in Event, click to view the details of the event ."

The data to be captured this time is "Digital Museum of China's Intangible Cultural Heritage", which is only for technical study. One thing to say, the intangible cultural heritage data website is very beautiful.

Target data source analysis

Target site: http://www.ihchina.cn/, the data exists in the position shown in the figure below: in Our pride!  Intangible cultural heritage data collection, from official data, Python crawlers crawl everywhereprinciple, it can be captured for all categories, in order to reduce the frequency of website visits, only a single category is collected, ie http://www.ihchina.cn/project#target1.

Our pride!  Intangible cultural heritage data collection, from official data, Python crawlers crawl everywhereThe page data is loaded asynchronously, and the following data is obtained by clicking on the pagination:

http://www.ihchina.cn/Article/Index/getProject.html?province=&rx_time=&type=&cate=&keywords=&category_id=16&limit=10&p=1
http://www.ihchina.cn/Article/Index/getProject.html?province=&rx_time=&type=&cate=&keywords=&category_id=16&limit=10&p=2
复制代码

The parameters are:

  • province: Affiliation area;
  • rx_time: announcement time;
  • type:category;
  • cate:Types of;
  • keywords:Key words;
  • category_id: Category ID;
  • limit: The amount of data per page;
  • p:page number.

The overall code of this case should be similar to the previous case, and the focus is on the data return this time. The following figure shows the server response data, in which the core data exists listin , but the developer returns the paging format and the paging HTML tag. The above data has reference value in some projects.Our pride!  Intangible cultural heritage data collection, from official data, Python crawlers crawl everywhere

encoding time

This is the last case of the threadingmodule . In the primary crawler stage, it is enough to master basic multi-threaded applications.

In the actual measurement process, there is a classic collection technique that can be used, that is, to test the maximum amount of data returned by a single interface of the server . In this case, you can manually modify the limitparameters , for example, modify it to 100, the value will be returned 100 pieces of data.

In the above situation, it means that the server has no restrictions on the data returned in a single time, so in principle, you can directly modify it to 3610 (all data in the target classification).

In this way, one access interface can be implemented to obtain all data (however, when the amount of returned data is large, the corresponding speed of the interface will be slower, and it is recommended to adjust it according to the actual situation).

The complete code is as follows

import threading
import requests
import random

class Common:
    def __init__(self):
        pass

    def get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
            "你自己的UA值,或者可以从之前的博客中获取"
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua,
            "referer": "https://www.baidu.com"
        }
        return headers


def run(index, url, semaphore, headers):
    semaphore.acquire()  # 加锁
    res = requests.get(url, headers=headers, timeout=5)
    res.encoding = 'utf-8'
    text = res.text
    save(index,text)
    semaphore.release()  # 释放

# 存储的数据中文进行了 UNICODE 编码,分析的时候注意转化
def save(index, text):
    with open(f"./非遗数据/{index}.json", "w", encoding="utf-8") as f:
        f.write(f"{text}")
    print("该URL地址数据写入完毕")


if __name__ == '__main__':
    c = Common()
    url_format = 'http://www.ihchina.cn/Article/Index/getProject.html?province=&rx_time=&type=&cate=&keywords=&category_id=16&limit=10&p={}'
    # 拼接URL,全局共享变量,362 页直接设置,没有动态获取
    urls = [url_format.format(i) for i in range(1, 362)]
    # 最多允许5个线程同时运行
    semaphore = threading.BoundedSemaphore(5)
    for i, url in enumerate(urls):
        t = threading.Thread(target=run, args=(i, url, semaphore, c.get_headers()))
        t.start()
    while threading.active_count() != 1:
        pass
    else:
        print('所有线程运行完毕')
复制代码

Collection time

Code repository address: codechina.csdn.net/hihell/pyth… , give us a follow or Star.

During the data collection process, the generated JSON data

== Come here, don't leave a comment, give a like, and bookmark it? ==

Today is day 206/365 of continuous writing. You can follow me, like me, comment me, bookmark me.


Guess you like

Origin juejin.im/post/7080329581902692365