Data asset management: How to create a data directory?

         After a top-down data sorting from a business perspective and a bottom-up data inventory from an IT perspective, a "hot" list of data assets was finally released.

         Through a data asset inventory, businesses finally know what data they have, how it is used, whether it is secure and where it is. However, according to the author's observation, most data asset inventory work in the industry is performed manually, using Excel tools for recording. Please don’t underestimate this method. The reason why Excel’s manual inventory of data assets is mainstream proves that it is widely recognized for its simplicity, ease of use, convenience, flexibility, and agile collaboration. Here I would like to remind the majority of data product managers: if you can come up with a data inventory tool that is more flexible and easy to use than Excel, it will definitely be very popular! Some people may ask, no matter how well the data is sorted and how clear the inventory is, the output is just a bunch of Excel, which has little value to the enterprise?

       Yes. At this time, the "data asset catalog" is needed to play its role!

01 What is the data asset catalog?

        I often compare the table of contents of a book with the table of contents of a book. If you open the table of contents of a book, it will tell you what the book is about, its content structure, the author's writing ideas, etc. If you are interested in a certain chapter, go through Table of contents to quickly find what you want to see. The book catalog plays the role of providing an outline and outline.

        The same goes for the data asset catalog. It also has the function of a "dictionary", which can help relevant business and technical personnel of the enterprise quickly locate data, interpret the data, find the data, and extract business value from it.

1. The essence of data asset catalog

      A data asset catalog is essentially a repository of metadata that provides an inventory of all data assets within a specific scope, regardless of location or source. The data catalog includes key attribute information about the data assets, such as: name, business meaning, type, size, schema, and other relevant attributes.

      The data asset catalog supports data governance, including: data classification and classification, data permission management, identifying redundant and inconsistent data, and laying the foundation for data lineage analysis and impact analysis.

2. Data asset catalog and data catalog

       Data asset catalogs and data catalogs are essentially the same, both are metadata management.

       In project practice, the data catalog is also called the data resource catalog , which generally refers to the data catalog formed by collecting metadata from relevant data sources (business system databases, data warehouses, data lakes, etc.) through metadata management tools. Since the directly collected data is basically technical metadata such as database table structure, data flow, ETL scripts, database operation logs, etc., the data directory must have a certain technical foundation to understand it, and it is positioned for technical personnel to see.

       The data asset catalog is a subset of the data catalog. It is more from a business perspective, aiming at the data needs of stakeholders, classifying and grading data that is expected to bring value to the enterprise, and defining and creating business metadata. Labels, authorizations, etc. Please refer to: "Data Asset Management: How to manage enterprise data assets?"

02 Why is the data asset catalog so important?

        Data-driven is an important means for enterprise digital transformation, and this goal requires business personnel to quickly locate, fully understand and effectively utilize data. As the volume of enterprise data continues to increase and the complexity of data structures increases, data asset catalogs will play an increasingly important role in the process of enterprise digital transformation.

1. Data asset catalog is crucial to business personnel

        Normally, when it comes to managing data, preparing data, and analyzing data, that's all IT's business, and business users are confused by IT's technical language and tools. However, only by allowing business personnel to find and understand data at any time can it be transformed into useful information and valuable business insights to guide business improvements. If key business decision-makers across departments can't trust the data, if they can't understand the data, if they can't find the data, then they can't use the data to discover their business problems and optimize their business.

       The data asset catalog is an organized list of data assets . It not only contains technical metadata such as database tables, data structures, and data flows that IT personnel are good at, but also includes data definitions, synonyms, usage methods, storage locations, data Key business attributes such as owner, data manager, data shelf time, etc. The data asset catalog provides business personnel with a portal to understand data, centrally locate data, and quickly access and evaluate data for faster and more effective data insight and analysis.

        A data asset catalog enables cross-department collaboration by identifying data owners, stewards, and subject matter experts, so business people know where to look when they encounter urgent data issues. The data asset catalog shields the underlying technical complexity and provides the ability to query data lineage, allowing business users to understand the source of their data and the full link of data flow and processing without the need or need to understand the underlying data collection, processing algorithms and process. With a data asset catalog, business users can easily communicate and ensure they are using the right data so that it is used correctly at the right time for maximum results.

2. The data asset catalog not only serves business personnel

       In addition to business personnel, users of the data asset catalog also include data analysts, data engineers, data scientists, data managers and CDOs, all of whom hope to have easy access to reliable data.

       Data analysts can understand and analyze existing data through the data asset catalog, such as data structure, data security and data quality, which greatly enhances data analysis and modeling capabilities.

        Data scientists can explore relevant data through the data asset catalog and gain more insights from the data by leveraging different data sets and building and evaluating more complex data models and algorithms.

        Data engineers can check related issues in the data link through the data asset catalog, determine the impact of a certain data change on the entire system, analyze the data structure of different data sets, establish mappings between business metadata and physical database table fields, etc. .

        Data administrators can view data status in real time through the data asset catalog, monitor data quality, control data access rights, define data standards for key data, and monitor compliance with standards, etc.

        For roles such as data owners, CDOs, etc. , a data asset catalog can help improve operational efficiency and reduce costs.

       Finally, a data asset catalog provides authorization and access control mechanisms for each user, making it easier for everyone to find and discover data across the enterprise at the level they have access to.

03 What are the functions of the data asset catalog?

       The data asset catalog is not a separate system. It is an important component of data asset management. The data asset catalog needs to be used in conjunction with other data management tools to exert its important value. According to the author's practice and observation, an excellent data asset catalog may be related to data management components.

1. Metadata collection

        The data asset catalog supports connecting multiple data sources and extracting metadata from data sources with different structures , including: locally deployed data sources, data sources in the cloud, IoT data sources, unstructured data sources, etc. Automated metadata collection can help users understand the data structure and relationships of the entire enterprise, allowing enterprises to automatically analyze and discover data that is difficult to find but contains value.

2. Metadata management

       The data asset catalog should support classification and grading, association mapping, labeling, user-defined annotations, sensitive field identification, etc. to manage the collected metadata to make it easier for users to understand and find data. The metadata here includes technical metadata and business metadata. Technical metadata describes the detailed storage location and structure of data, such as database, field and column information, allowing IT staff to understand the physical storage of data. Business metadata provides users with clear business context, including data definitions, synonyms, and business attributes, helping users understand the relationship between data and other data sets and discover data flows and dependencies.

3. Data lineage

        Data lineage refers to the end-to-end flow of data throughout the enterprise. As part of the data asset catalog, it provides tracking and tracing throughout the entire life cycle of data to understand its origin, transformation and who is using it. Generally, data lineage is one of the important functions of metadata management. It records and displays the relationship between systems, tables, views, fields, etc., and uses the DAG (directed acyclic graph) mode for visual display. Simply put, it is to visually display how the data came from and what processes and stages it went through.

4. Data standards

        For data to be transformed from a data resource into a data asset, it must be standardized and defined. A typical practice is a "business glossary". Through the data asset catalog, establishing the correlation mapping between data standards and technical metadata is an important means to achieve the implementation of data standards.

5. Data discovery

        Data asset catalogs enable self-service, allowing users to easily access and understand their data without relying on IT for support. Through automated data tagging, classification, and relationship mapping, users can use keywords, filters, query conditions, etc. to conduct data searches to locate, access, and query data. Data discovery also provides real-time visibility into the current state of the data, such as how the data is collected, integrated and used, and whether it is the latest data or out of date.

6. Data application/approval

        The data asset catalog provides users with a metadata-based list of data assets, but not all users have global permissions on this list. Each data asset needs to be included in the data asset catalog after confirmation of rights and responsibilities. Only users within the scope of authority can access relevant data. The data asset catalog supports application/approval functions, providing users with an opportunity to access more data to improve the utilization of data assets.

7. Data API service

       Users can find the data they need through the data asset catalog. The data asset catalog not only tells you: what the data is (definition), where it is (location) and how to access it (owner), it also generally provides a data-based The directory generates the function of data service API to help users realize the integration and sharing of data.

8. Data asset monitoring

       Provides data asset monitoring function, displays which data applications have high value in the form of heat maps, and evaluates the application of data assets through indicators such as the number of uses, objects used, and effect evaluation. Reorganize the data asset catalog according to the usage of data assets to maximize the value of data assets.

04 Steps to construct data asset catalog

Step 1: Data asset inventory

        Data asset inventory is the use of scientific data inventory methods, with the goal of "finding out the financial situation", to conduct overall planning and comprehensive combing of the company's data resources. On the one hand, data resources are sorted out and planned from a business perspective, including: interpretation of institutional documents, process form sorting, identification of key data, etc., and definition of data classification system and business attributes of data assets. On the other hand, we conduct an inventory of system data from a technical perspective, including: data relationships, data structures, data stocks, data increments, storage methods, etc., and sort out the technical attributes of data assets.

       The methods and steps of data asset inventory are in "Data Asset Management: How to inventory an enterprise's data assets?" 》 has a detailed description and will not be repeated here.

Step 2: Data asset registration

        Based on the data inventory results, complete the registration of data asset summary information in the data asset catalog. Data asset registration mainly includes three aspects of information: first, the business aspect, such as: data asset name, belonging data domain, belonging data classification, data asset description, etc.; second, technical aspect, data asset location (which system, which table), data asset type (structured data/unstructured data), data asset method (database/file/API interface), etc.; third, in terms of management, data asset owner (responsible department), data asset administrator, Data asset shelf time,

         Data asset sharing conditions, etc. Data asset registration can be done manually (according to the author's observation, most currently use this method) or AI-based data asset identification.

Step 3: Collect metadata

        After registering the basic information of the data assets, the next key step is to collect the metadata of the data assets. The data asset catalog uses metadata to identify data tables, files, and databases. Metadata collection crawls a company's database and brings metadata (not actual data) into a data asset catalog. Since data assets are distributed in different locations, the scope of metadata collection includes:

  • Relational databases - Oracle, SQL Server, MySQL, DB2, etc.

  • Data warehouse - Teradata, Creenplum, etc.

  • Stores object metadata.

  • Cloud platforms - Alibaba Cloud, Microsoft Azure Data Lake, AWS's Athena and Red Shift.

  • Non-relational/NoSQL databases - Cassandra, MongoDB.

  • Collection of relevant metadata for Hadoop big data platform.

  • BI platform, Tableau, Power BI, domestic BI software, etc.

  • ETL tools, Kettle, DataStage, Informatic, etc.

Step 4: Mark data relationships

       Tagging relationships is an important step in managing data assets, allowing users to discover related data across multiple databases. For example, analysts may need consolidated customer information. Through the data asset catalog, it was found that customer data exists in five different systems. With the help of a data catalog, it is possible to build an experimentation area where all the data can be connected, cleaned, and then used to achieve business goals with the merged customer data.

An example of a tagged relationship for the table "Accounts" is as follows:

Step 5: Establish blood relationship

      After marking relationships, the data catalog builds lineage,. Visual representation of data lineage helps trace data from source to destination, it explains the different processes involved in the data flow. Data analysts are able to trace the root cause of errors in analysis based on data lineage. Typically, ETL (Extract, Transfer, Load) tools are used to extract data from the source database, transform and clean the data and load it into the target database.

Some ETL tools that can parse blood relationships include: SQL parsing, Alteryx, Informatica, Talend, etc.

Step 6: Data Asset Organization

        The collected metadata is arranged in a technical format and lacks Chinese annotations for tables and columns, which is not conducive to business personnel understanding the data. At this time, it is necessary to build a semantic layer based on these technical metadata and mark relevant data tables and columns in Chinese so that business personnel can discover, access and understand them.

  • Markup - Create a data semantic layer

  • Organized by usage—data asset heat map

  • Organized by specific user usage - push to user's data portal

  • Automated organization that can organize data using advanced algorithms

Written at the end: Data governance and data asset catalog

        Data governance defines the overall strategy of data management, stipulates the organization, system and process of data management, clarifies the ownership of data, defines data standards, and points out the direction for data asset management. The data asset catalog is the specific implementation of the data governance strategy, which displays the enterprise's data assets and locations in a business-friendly way, helping users better find, understand and use their data.

        The construction of a data asset catalog is an important part of data governance. Creating an accessible data asset catalog allows non-technical personnel to locate and utilize data throughout the enterprise, and automatically discover data sources in the enterprise system, including business, technology and process. Data lineage provides complete data transparency so users can understand the origins, processes and dependencies of data, as well as the flow of data from source to completion and consumption. As a result, users can quickly discover the impact of data, adapt it to enterprise business processes and make more informed data decisions.

       The construction of a data asset catalog is the prerequisite for realizing self-service data preparation and self-service data analysis. Based on the data asset catalog, business data analysts can know what data resources or updated data assets the enterprise has available, who is the data owner, where these data assets are located, and how to process it. Most importantly, based on the data asset catalog, the speed and efficiency of locating and querying data can be improved to promote the use of data, obtain insights from the data, and enhance the competitiveness of enterprises.

references:

https://www.icode9.com/content-4-960194.html

Guess you like

Origin blog.csdn.net/iamonlyme/article/details/132744922