Data life cycle management

data storage

The era of "inclusive of all rivers, tolerance is great" means structured, semi-structured and unstructured diversified mass, and also means the storage and calculation of various data forms of batch data and streaming data. In the face of factors such as different data structures, data forms, timeliness and performance requirements, and storage and computing costs, an appropriate storage form and computing engine should be used. However, the rapid expansion of data capacity has brought great challenges to computing costs. Instead, different storage and computing resources should be used for different hot data to optimize storage and processing costs and improve availability.

Data Storage System Division

In terms of timeliness or data form, it is divided into batch data and real-time streaming data; in terms of structure, data is divided into structured, semi-structured and unstructured. Depending on the heat of the data, the storage capacity, timeliness, and read/write query performance requirements are different, so choose the appropriate storage technology.
Storage technologies are categorized as follows:

  • Traditional relational databases: Oracle, DB2, MySQL, and SQL Server are structured data storage.
  • Distributed relational databases: Hive, GreenPlumn, Teradata, and Vertica, etc., are structured data storage.
  • NoSql storage: HBase, Redis, Elasticsearch, MongoDB and Neo4J, etc., are semi-structured and unstructured data storage.
  • Message system: Kafka, RocketMQ and other message systems belong to unstructured and semi-structured short-term storage.
  • File systems: HDFS, S3, and OSS, etc., are structured, semi-structured, and unstructured data storage.

Data popularity

The so-called data popularity divides data into hot data, warm data, cold data, and ice data according to value density, access frequency, usage method, and timeliness level. Data popularity should change with the passage of time, and the data value will change. The data popularity level should be dynamically updated to promote data life cycle management from data generation to destruction.

  • Hot data: generally refers to data with high value density, high frequency of use, and support for real-time query and display.
  • Temperature data: between hot and cold data, mainly used for data analysis.
  • Cold data: generally refers to data with low value density, low frequency of use, and data used for data screening and retrieval.
  • Ice data: generally refers to data with extremely low value, zero use frequency, and temporarily archived data.
    insert image description here
  • Hot data serves decision managers, and it is recommended to adopt storage technology with low storage capacity but high requirements for timeliness, stability and availability;
  • Warm Data serves data analysts. It is recommended to use a storage and computing engine with slightly higher storage and high computing resource performance that can support the effective use of data analysis tools;
  • Cold data serves data scientists, recommending large capacity, capacity, and scalable storage technologies;
  • Bing data adopts ultra-large capacity and ultra-low-cost storage technology for archiving.
    For ice data archiving data, data destruction rules can be formulated according to the enterprise data strategy in accordance with the requirements of data age, legal mandatory retention period, etc., and on the premise of fully mining the value of data, the data can be destroyed to reduce unnecessary storage costs. In fact, this is also the data The meaning of life cycle management.

archive

Data archiving refers to saving those lifecycles to low-performance and cheap storage at the end of their lifecycles, which is an essential step in data lifecycle management
. During the normal operation of data, the transformation of data heat from hot, warm, cold and ice can be considered as an archiving process.
According to the requirements of corporate regulatory laws and regulations and corporate strategies, specify the boundaries between hot, warm, cold, and cold data, formulate corporate
data archiving strategies, and archive data according to the archiving strategies.
Which data needs to be archived is mainly related to the requirements of regulatory regulations and the data strategy of the enterprise. Some key indicators are available for reference:

  • aging data
  • Low usage and high volume data
  • Ice data with no data value
  • Data that is mandatory to be retained by corporate regulations
  • Data that is retained due to its critical value, regardless of probability of use

Data archiving should also consider
aspects such as data structure reconstruction, data compression format change, accessibility change, data recoverability and data comprehensibility, and metadata management.

destroy

With the further reduction of storage costs, more and more enterprises have adopted the strategy of "keeping all data". Because from the perspective of business and management, as well as the perspective of data value, no one can know what data will be used in the future. However, with the rapid growth of data volume, it may not be a good choice to store data exceeding business needs from the perspective of value and cost. Sometimes some historical data will also lead to legal risks for enterprises,
so the destruction of data is still an option that many enterprises should consider.

For data destruction, enterprises should have a strict management system, establish an approval process for data destruction, and create a strict data destruction checklist. Only data that has passed the checklist and been approved by the process can be destroyed.

Learning record; source from: WeChat account biggata53o

Guess you like

Origin blog.csdn.net/qq_37432174/article/details/130777690