Data governance methodology Road to Jane's

Data governance methodology Road to Jane's - how to deal with you in the hands of all kinds of "dirty data"?

 
If you are a chef, just enraptured guests depicts how to match a color, aroma and taste of the dishes, and even cooking techniques are all accountable, and when you're ready for the various enriched by fresh dish when ready to cook the sauce, only to find the main raw material needed to have a problem.

 

Data Analyst role is like a chef, raw material problems, chefs cooking is certainly not good aroma and taste of the dishes, the data in question, the data analyst concluded naturally not reliable, even the best data analysis Methodology is only based on data on the distortion, painstakingly built data system certainly be wasted.

 

Project in the past, I have often encountered such a situation, customers with technology products Wing Hung did some fine professional data reporting, because of the use of inaccurate data affecting the value of the report.

 

The first two articles are the author discusses how to analyze the face of data indicators, and how to build systematic data system, this is the third "data series of operational methodology," the article, focusing on the core topic is - data governance.

The first chapter , "Road to Jane data analysis methodology."

The second "Road to Jane's data system construction methodology"

Data governance is a fundamental task in the eyes of many people is a hard job to hard work, but more so the more work can not be ignored, hit a solid foundation, the superstructure will be more stable.

Next, type I and start talking dirty data processing method.

A dirty data type and processing method

First, let's look at the kind of dirty data, understand what problems we may be facing.

1 missing data: Some missing record or a record in the lack of some values ​​(nulls), or both are missing. There are many possible reasons, or due to the possibility of human-induced system exists. If null values, in order not to affect the accuracy of the analysis, or is not a null value included in the analysis, carried out either fill value. The former reduces the amount of analyzed sample, which need to calculate the logical analysis choose to fill with average zero, the proportion of random numbers or the like. If you are missing some records, business systems if there are records, the import again through the system, if there is no such record business systems, and only manual makeup or give up.

 

2 data repetition: a plurality of the same record appears, this process is relatively better, to remove the duplicate records. But the fear of fear imperfect repeats, such as two members of the recording, the remaining values ​​are the same, is not the same address, which is in trouble, but also have time to judge the new property values ​​prevail, there is no time attribute can not start up only human judgment process.

 

3 data error: data not strictly in accordance with the authority records. Such as outliers, obviously the price range is less than 100, but why record price = 200; such as incorrect formatting, date format has become a record string; for example, the data is not uniform, and some records called Beijing, some called BJ, and some called beijing. For outliers can be identified and excluded by the limited range; for formatting errors, you need to find a cause from the system level; the data is not uniform, the system can do nothing, because it is not real "mistake", the system does not know BJ and beijing is the same thing, only manual intervention, do a regular cleaning table, matching relationship is given, the first column is the original value, the second column is cleaned values ​​associated with the rule table to the original table, do the analysis with a cleaning value, no matter how good by some approximation algorithms automatically detect possible non-uniform data.

 

4 data is not available: the data is correct, but is not available. Such as an address written as "Beijing Haidian Zhongguancun", the area want to analyze the level of "area" Shihai should "Haidian" to split out with. The best from the source to solve this case, that data governance. Remedial only through keyword matching, and not necessarily resolved.

Two, BI data requirements

Next, we look at the BI data requirements, combined with the above types of dirty data, in the middle of circumvention is data governance.

Structured 1: Data must be structured. This may be nonsense, if the data is large pieces of text, such as micro-blog, it can not do quantitative analysis with BI, but do semantic analysis technology with the word, such as is often said of public opinion analysis. Unlike BI semantic analysis quantitative analysis of the calculation as a hundred percent accurate, but the probability of human language ever-changing, people themselves are not in place to ensure complete understanding of the system even more impossible, can only improve the accuracy as possible.

 

Normative 2: Specification data sufficiently. So vague, simply, it is to solve the problem of the above types of dirty data, all the dirty washing into "clean data."

 

3 can be associated: If you want two dimensions / metrics do correlation analysis, these two dimensions / metrics must be able to associate on a table or in the same or related fields can have a table in two.

Third, the principle of data governance

Speaking in front of dirty data processing method, but those are only stopgap measures to deal with the method, and requires long-term work takes a lot of time and manpower to do this suffering. To improve the problem of dirty data from the fundamental norms still need to do the job data governance.

 

Simply put, data governance is to constraint input, output specifications.

1 constraint entry: you never know what will be the value of user input, so users do not give too much space to play, to do the work constraints. The user fills out, the system must be set to "required"; the system at the time of submission of entry to check on the well, wrong format, value is not in the normal range; the value of fixed options, be sure to let the user with a list of the election, do not manually enter direct error situation must allow the user to re-enter; atomization field as much as possible the design entry forms, such as the above said address, designed to split into multiple fields at national, provincial, city, district, address and other details, avoid after the split; saved data entry data table can also try to unify, not to produce a large number of the same data table, resulting in data duplication problems.

 

2 specification output: the boss to see different people doing the report, with a "rate of return" index, the value of each report is not the same, the boss's heart must be a collapse, I do not know Mashui, only the whole curse. Exclude the calculation error, usually caused by statistical inconsistencies. So to unified semantic, semantic dictionary to make a company level (not the data dictionary database). Index name on posters all reports must be filed in the semantic dictionary, semantic dictionary and a clear definition of its statistical meaning. Different statistical indicators have to by different names. If a word has been found in the semantic dictionary, you must take the process to apply to register a new word to the dictionary semantics.

Fourth, data governance landing

ETL tools require processing, semantic data dictionary dirty do not have to use system. In fact, since these systems are too complex, rare domestic implementation success stories with Excel plus system can achieve good results.

About landing promotion strategy, but also simple to say, the boss said must be implemented finalized, then the priority right to speak attract a department pilot, then scale. Which department first floor, which department will be able to press the word that best meet their own habits to name the index, accounting for the equivalent of the pit. The back of the departments should comply with previous standards, the same name but different meanings indicators need to find another word name. Pa Moren not so active.

The above is a refined version of the data governance methodology. We all know that this is a dirty work, but I also like to remind that the later hands the more bitter. With the experience after doing new business system design, we can fully consider the specifications of the data governance.

Guess you like

Origin www.cnblogs.com/zwt20120701/p/11408834.html