How to start big data governance from 0 to 1 (below)

Insert picture description here
about the author

@Super超

Spatial computing and urban big data

Sci-fi fans shaping the future

Continuously update the big data and data science series

The previous article talked about the background, goals, and core of big data governance. This article enters the practice link and talks about how to implement big data governance, the steps and effect verification of big data governance.

04 Implementation of Data Governance

  1. Storage optimization

Data expansion is the first problem to be solved in big data governance. It is directly related to the cost problem. The solution is to optimize storage, that is, to design standardized storage strategies to improve the degree of data sharing.

Thinking in terms of space:

The first keyword is merge, that is, merge redundant tables. On the one hand, it is to scan the dependency relationship of the data table. The upstream table is similar, and the table fields are also similar. The judgment may be a redundant table, and only one is left. On the other hand, the highly overlapping tables are merged, from small tables to large tables.

The second key word is discard, that is, discard redundant fields. Some fields do not have much storage meaning, or can be obtained from other sources, and can be excluded from the data table.

The third key word is split, that is, content compression. For example, a large json field is split into several content fields through a data compression node, the format-related parts are discarded, and when it needs to be restored, the data decompression node is used to reversely restore it back. On average, 30% of storage space can be released.

Think in terms of time

The first key word is life cycle. Reasonably plan the life cycle of the data, and the data retention time of different layers is different. Some need to be stored permanently, and some do not need to be stored permanently.

The second key word is hot and cold. For those cold data that has no business call for the time being, compress and archive.

image

In addition to generalized strategies, different industries and different types of data also have their own characteristic governance strategies. For example, when the device stays in a certain position for too long, a large number of repeated coordinates are returned.

  1. Calculation optimization

The purpose of calculation optimization is to save computing resources, and to increase the speed of data processing and shorten the data production cycle.

The first optimization point is to avoid wasting computing power on abnormal data. Although there is no problem with the format of some data, it is actually abnormal according to the definition of the business scenario and can be ignored. For example, a certain device is faulty, and the data generated by it will no longer participate in the calculation after it is identified.

The second optimization point is to identify and deal with data skew. There are two cases of so-called data skew. One is that the data of a certain area is larger than other areas, and the other is that the size of certain data is much larger than the average. Further segmentation of the data skew part can speed up the calculation.

The third optimization point is to improve the performance of the core UDF. The performance of UDF largely determines the length of the processing process. Through code review, find out the nodes that can be optimized for code optimization. In addition, changing Python UDF to Java UDF can also improve some performance.

The fourth optimization point is engine configuration tuning, such as enabling data compression and transmission, setting the number of map/reduce reasonably, and applying the Hash/Range Cluster index mechanism reasonably.

The fifth optimization point is to rewrite the MR streaming node to the SELECT TRANSFORM method. SELECT TRANSFORM has good performance and is more flexible, which can improve the scalability of computing nodes.

[Extension] Introduction of SELECT TRANSFORM

Many times we are faced with a scenario where SQL built-in functions cannot support the function of turning data A into data B, so we use a script to implement it, and we want it to be executed in a distributed manner. Such a scenario can be realized by using SELECT TRANSFORM.

The SELECT TRANSFORM function allows SQL users to specify to start a child process, input data into the child process through stdin in a certain format, and obtain output data by parsing the stdout output of the child process. SELECT TRANSFORM is very flexible, not only supports java and python, but also supports shell, perl and other scripts and tools.

  1. Tools to improve efficiency

Big data governance needs to involve a large number of tables and nodes online, offline, testing, adding monitoring, etc. If each link requires manual operation, it will consume a lot of manpower. Therefore, using some automated and semi-automatic tools can significantly improve efficiency and reduce Labor costs.

It mainly involves data comparison tools, node batch offline tools, automated testing tools, etc.

05 Steps of Data Governance

Big data governance and normal business development are carried out simultaneously, which requires a smooth transition process.

  1. Incremental data gray-scale translation
    The function of this step is to verify that the data after governance can be used normally by the downstream data application side and can meet the needs of the business side to use the data. Mainly need to solve the problems of new and old data table field mapping and data supplementary recording after field expansion.

The business migration is based on the principle of gray scale, the first migration is light and small in size, and the latter is more important. Continue to track and analyze data fluctuations after batch migration, and fix problems as soon as possible to ensure the reliability of data quality.

  1. Inventory data migration

After passing the verification on the incremental data, the next step is to migrate the existing data. This step needs to pay attention to the problem of storage space. Too much new data storage is added at one time, and the old data is too late to release, which will greatly increase the storage pressure.

image

06 Effectiveness verification of data governance

The effect of big data governance is reflected in whether the data storage cost is reduced, whether the data production cycle is shortened, whether the data quality is improved, whether the growth momentum of the data volume is slowing down, etc.

to sum up

The process of big data governance is a good opportunity to sort out existing businesses. A successful data governance not only brings cost and efficiency improvements to the enterprise, but also exercises the data team and lays the foundation for the construction of a data value system.

The private place of a data person is a big family that helps the data person grow, helping partners who are interested in data to clarify the learning direction and accurately improve their skills. Follow me and take you to explore the magical mysteries of data 1. Go back to "Data Products" and get <Dachang Data Product Interview Questions> 2. Go back to "Data Center" and get <Dachang Data Center Information>
3. Go back to "Business "Analyze" to get <Dachang Business Analysis Interview Questions>; 4. Go back to "Make a Friend", enter the exchange group, and get to know more data partners.

Guess you like

Origin blog.csdn.net/weixin_49880348/article/details/111317199