Talking about the method of cleaning the master data of the enterprise legal person

      In early December, under the pressure of the epidemic in a certain airport in the capital, my colleagues and I were called to the headquarters of the group urgently. The task was to "clean the master data of the organization in the group's four unified construction systems and find differences."

       When I first saw this proposition, my head was buzzing, because department-level organizational data cannot be counted as "master data" in a strict sense, because departments with the same function or similar functions are called in each unit. It is likely to be different, and with business changes, department-level changes and merger frequency are much greater than the closure and relocation of corporate companies. The most important thing is that the amount of data is large-in terms of a unit of our size, tens of thousands of organizations are light and easy.

       Speaking of this, it is very possible that some friends who wander in the "big data ocean" are going to laugh at us. Is that a problem with tens of thousands of data? Yes, for big data with PB as the unit of calculation, tens of thousands of pieces of data are really nothing? However, the leader's requirement is to do "precise data cleaning", these data are to be used as the primary key to participate in SQL calculations, not as a correlation, let alone modeling. To put it bluntly, the cleaned data must be absolutely true and usable. It is impossible to simply remove duplicates and nulls, and they are all strings. Don't add up all the mean values. The difficulty can be imagined. .

       Fortunately, I saw that their first level of demand was only the cleaning of the "corporate master data", and a hanging heart finally landed (the original plan to escape from the magic city in a week has a chance to be realized). When the unit was working on the master data project, it clearly defined the concept of legal person master data: all companies within the group's legal merger structure. That is, the top level of the organization, the concept of the company. Although some equity participations, joint ventures, joint ventures, and even branches and offices were also required to be included in the scope of the system’s jurisdiction, fortunately, there are only thousands of records in the current master data management system. “Accurate data cleaning” is still Can be completed in a short time.

----------------------------------------------------------------------------------------------------------------------------------------------

      Next, I will explain how my colleagues and I carried out "pure manual data cleaning", which is also considered to be an inspiration. I will share and encourage each other with my friends, hoping to make progress together:

1. Preparation

1. Data source acquisition

This is relatively simple. Just open your mouth and save ETL. The data you get is all excel.

2. Data cleaning tool selection

Due to the small amount of data and the fixed source data format, excel was chosen this time.

3. Version management

Small projects, svn is useless. However, when the two people operate on the same data source, they still name the version of each document in the form of date + time to strengthen version management.

In addition, three folders were created for the data to be compared for the three unified construction systems (finance, investment, and personnel), and the process data were saved separately. A collection folder to save the integrated data. Finally, name a "to the leader" folder for reporting.

4. Data dictionary

This data cleaning only retains the code, name, management level, name of superior management unit, legal person level (equity level), name of superior holding (shareholding) unit, and remarks of the corporate legal person.

2. Data cleaning process

1. Data preprocessing

1.1 Leading and trailing spaces

Not much to say, this is very simple.

1.2 Special symbol check

What I want to declare here is that it is rare for domestic companies to add special symbols to their company names (thanks to the industry and commerce department!), but after the toss of information systems and ETL, it is inevitable that there are special symbols in the source data obtained To process it. Foreign companies will not deal with it, and a separate article will be written later on the naming standards of foreign companies.

1.3 Bracket category check

Those who have dealt with this kind of data must also feel deeply. This problem of full-width and half-width brackets is very unfriendly to many functions. Unless a function is defined, full-width and half-width brackets are considered different. Here, all the brackets are processed as full-width in batches. This part of the data will be labeled. If the names are the same, it is just a problem with brackets, and it will be included in the next step.

1.4 Hide formula check

It is also the result of tossing. Check whether there are hidden formulas in the excel cell. Just format if there is any.

1.5 Data consistency check

Define the fields and data types in advance, and format the inconsistent types in the source data.

2. Null value processing

It turns out that we thought this step could be omitted. After all, it is a system built by the group. How could there be Null values ​​on the primary key level fields? Do not! It really does! Regardless of the reason, extracting the null value of the corporate legal person name is not included in the scope of data integration.

3. Data deduplication

The first step is not omitted, so let's do the second step too. As a result, through the sorting and function processing, there are really duplicate corporate legal persons, really! After consulting the system administrator and related business departments, it is not included in the scope of data integration.

4. Data comparison

Use the function to compare the inconsistent corporate legal persons in the source data and the master data, and find out the different objects, that is, the corporate legal persons with inconsistent names, and save them for data integration.

Three, data integration

The next step is to work on a more deliberate (ku) thinking (zao). Label all types of difference data. Don't ask me why. Asking requires business personnel to handle it.

1. Inconsistent names

The first type of differential data we found is the situation of "the existing enterprise, but the source system name is inconsistent with the master data name". There are several reasons for this situation:

1.1 Inconsistent Chinese and English

Overseas and China's Hong Kong, Macau, and Taiwan regions usually use English names when registering companies, and business personnel are likely to translate the English names when maintaining in the system. Yes, he translated them, and it is very likely It is a literal translation!

1.2 Abbreviations or errors in English names

Kong Yiji asked: There are several ways to write the fennel of fennel beans, four kinds, it seems. Can you imagine how many ways to write the word Limited (equivalent to limited company)? What about synonymous?

1.3 Retained the old name

The company has changed its name, but the old name is still in the system. This really depends on the real ability. After all, without a unified social credit code, Dun & Bradstreet or tax number, it requires technology to link two units with completely different names.

2. Non-existent corporate legal person

2.1 Non-corporate legal persons

Believe me, Finance will really throw the name of an account into the main data of the corporate legal person, and Personnel will throw @#¥%……&*() into it.

2.2 Incomplete names

For a certain regional branch of a company with a long name, there may be a helpless situation: the name field of an AP system is 40 characters long, which is not enough, and it is truncated. Under the circumstance that they couldn't see the following words, the sales staff spent several months silent!

2.3 Deregistered companies

This is easy to say, just go to the national organization website to check the company's continued existence.

Four, data interpretation

After labeling the integrated data, we checked with the business one by one, made simple tables and visualizations, and made suggestions for the next processing of each type of data for reporting to company leaders.

------------------------------------------------------------------------------------------------------------------------------------

       It is said that before the leader recruited us, he scolded me, saying that our work was not done properly, leading to a series of situations. However, after three days of fierce battle, when the cleaned and integrated data was finally presented to the leader, he still received some praise, which can be regarded as an explanation of his previous work. We are not professional data processing personnel. It is the accumulated work experience and the spirit of responsibility that have supported this job. Here are two brief overviews, let us express this:

Understand the business, do more with less

The colleague who came with me this time is the brother in the project team who is responsible for the master data of customers, suppliers, and legal persons. The old man has been working on master data for more than 5 years. He has seen all kinds of bizarre and complicated situations, and has trained himself. The magical skill: Seeing some English keywords, you can tell the Chinese name of the unit, the surviving status and management, and the level of legal person! This saves us a lot of time when cleaning the data of overseas companies. Brother can also collect user data integration information through various tools and websites according to the company's operating conditions. This is really not something that can be done by just grabbing a programmer. The so-called programmer knows the business, no one can stop it!

Eighteen martial arts, must be proficient in everything

Many data analysts have a big head when facing data cleaning, especially for string-level data processing, and for the generalization of continuous data, they can avoid or avoid problems. Even my friends who work abroad showed off to me that the data they collected is very integrated and can be used for analysis and modeling without any data cleaning. But everyone should be clear that the above situations are all premised. Either you have a relatively good system and the quality of the collected data is good; or you have a large team with a dedicated person to do this for you. child. Normally, the above two situations are difficult to achieve in China. This requires that we who are motivated to become data analysts must be proficient in all technologies related to data, and we must continue to enrich ourselves and continue to expand the technology stack, otherwise we can only Stay in the simple "data exploration" stage and make reports with unintegrated data to deal with business transactions. To discover the true value of data, and to improve our own capabilities, we need to make continuous efforts to improve ourselves. The so-called 18 martial arts are all proficient.

The above views only represent a shallow understanding of the individual, and I hope to correct the improprieties. Share with everyone, please indicate the source and author for reprinting.

I wish you all the best in 2021, and best wishes!





Guess you like

Origin blog.51cto.com/725905620/2575507