ETL uses Kettle to process bank credit card application projects

1. Project overview

        Check the information of those who apply for credit cards on the same day, mark the risks of those who do not meet the requirements, and classify and deliver the information of risk-free personnel according to their location

Information Sources:

1. Web terminal: bank web page application | 2. Mobile terminal: online banking, mobile banking | 3. Three parties: various portal websites, mobile APP | 4. Counter: manual counter, ATM, CRS | 5. Salesperson: local push

Main table preview:

 Information cleaning process:

Information acquisition → information input → information deduplication → code information replacement → add/supplement corresponding information fields → risk notes → remake information table → output table/excel/sql/csv according to actual needs/situations

4. Purpose of data cleaning:

(1 The salesman does not understand part of the data information, remove useless data

(2 The analysis department/risk control department conducts data analysis, conducts risk analysis and risk control, and confirms whether it meets the card issuance requirements

5. Risk Remarks:

Carry out risk assessment for applicants who do not match the information filled in, and make risk notes. Which item of risk information appears in the applicant, explain it in the notes

*Age risk: age verification is performed based on ID card information, and risk notes are made for those who do not match

*Household registration risk: check the household registration according to the ID card information, and make a risk note for those who do not match

*Address risk: According to regional information, those who cannot be verified will make a risk note

*Education risk: those with a master's degree <22 years old, and those with a doctoral degree <24 years old, if they do not match, please make a risk note

*Salary risk: For those with an annual salary > 20W, make a risk note

*Gender risk: check according to the ID card information, and make a risk note for those who do not match

Six. Sub-table:

1. Analysis Department/Risk Control Department: No risk labeling information and risk labeling information

2. Salesperson: Classify table by region → salesperson (region) → issue card

7. Data export:

Team leader or department manager→analysis department/risk control department→salesman

Interpretation of ID card information:

1. The first and second digits represent provinces: autonomous regions, municipalities directly under the central government, and special administrative regions

2. The third and fourth digits represent cities: summary codes of prefecture-level cities, autonomous prefectures, leagues, and municipalities directly under the central government, and counties. Among them, 01-20, 51-70 represent provinces and municipalities directly under the central government; 21-50 represent regions (autonomous prefectures, Alliance).

3. The fifth and sixth digits represent counties: municipal districts, county-level cities, and banners; 01-18 represent county-level cities under the jurisdiction of municipal districts or regions (autonomous prefectures, leagues); 21-80 represent counties (banners); 81-99 represent A county-level city directly under the provincial government.

4. The seventh to fourteenth digits are date codes of birth, indicating the year, month and day of the coded object’s birth

5. The fifteenth to seventeenth digits are sequence codes, which are the sequence numbers assigned to persons born in the same year, month, and day within the area identified by the address code. Among them, the seventeenth odd number is allocated to men, and the even number is allocated to women.

6. The last digit is a check code, which is calculated by the numbering unit according to a unified formula.

7. Ⅹ is 10 in Roman numerals. Using X to replace 10 can ensure that the citizen's ID card meets the national standard

2. Project preparation

Check project documents and make appropriate records, always ready for supplementary recording:

1. Analyze the relationship between the main table and the sub-table, and write sql statement statistics

2. Select the appropriate extraction and conversion method according to the amount of data

For example, replace the data in the secondary table corresponding to the serial number of a field in the main table, and output the main table

Choose different ways according to different projects:

1. Select the column in excel >>>ctrl+f>>>replace (P)>>>find the content and set it to the code number>>>replace with the data to be replaced>>>search (s)[by column]>> > replace all

2. SQL query statement: select b. type from total table a join data matching table b on a. field (type) = b. field (code);

3. Sql table creation statement: convert the table to sql statement, perform batch replacement, and finally run sql

4. Kettle: value mapping + modification type

5. Connecting to the database: 2 table input + separate sorting + record connection + field selection + excel output

7. Database query + mapper conversion

...

According to project requirements, run verification step by step, and establish mapping specifications appropriately

The following figure is the mapping specification for establishing the location of the company. I will compare the map placement below. Note (complete first and then improve)

Use Switch/case to match the data in the main table

The database query gets in touch with the main table

Set mapping specification input: write the fields transferred by the main processing table into it

 Master table of engineering processing:

 After the replacement, use the field selection to extract the fields that need to be modified, such as updating the name, or removing them, and then put the processing results into the newly created excel output for comparison and verification. If there is no problem, proceed to the next step

The ID card number corresponds to the data to be extracted and converted as follows:

Set up an increase constant and rename the step to increase the year. This step is mainly to verify whether the age obtained from the difference between the date of birth of the ID card and the set year is consistent with the filled in

 perform subtraction

 Use the mapping value to compare the 17th (2nd last) digit of the ID card with gender to verify the gender risk.

In the next step, the database query compares the first 6 digits of the ID card with the second table to verify whether the subsequent area is consistent with the filled area

 Carry out conditional filtering based on the filtered records, output the qualified ones into the true table, and output the unqualified ones into the false table

If the age risk is judged, is it consistent

 

 Increase constants for risk labeling

Judging whether the household registration is risky or not, and whether it is consistent

 

Judging whether the address-area is risky or not

 

Judging whether the street is risky or not

 

Judging whether the academic qualifications are risky or not, whether it is unreasonable, such as false reports need to be further dealt with

Judging whether there is any risk in wages, according to the local average level division standard, carry out risk labeling and verification on exceeding the level

 For gender confirmation, whether there is a false gender

 

 All processed outputs are all risk-free tables

 

Those classified as risky can be output to the risk table

 END

Guess you like

Origin blog.csdn.net/qq_53521409/article/details/126689503