File data deduplication example

In the data processing business, sometimes it is necessary to clear the duplicate data in the file or leave only the duplicate data. This article will introduce several processing methods for deduplication of entire rows and deduplication of key columns from both small and large files, and provide the use of esProc Code examples written in SPL. esProc is a professional data calculation engine. SPL has a complete set of function libraries in the field of set operations, which is very suitable for processing files to remove duplication, and the code written is very concise.

 

1. Small files

1.1 Deduplication of the entire line

There is a text file, each line of which is a string, and only one line of repeated lines in the file should be kept. To deal with this problem, each line of the file can be read as a string to form a set, and then the result can be obtained through the set deduplication operation.

Example: The student ID name of the students who signed up for the painting interest class is recorded in paint.txt. Some of them may have been in the same journal several times. Please delete the duplicate registration in the file and save it in paint1.txt. Part of the data of the original file is as follows:
    20121102-Joan
    20121107-Jack
    20121113-Mike
    20121107-Jack

The esProc SPL script is as follows:


A Comment
1 =file("e:/txt/paint.txt").read@n() Read each line of paint.txt to form a set
2 =A1.id() Delete duplicate members in the A1 set
3 =file("e:/txt/paint1.txt").write(A2) Write A2 after deleting duplicate lines into the file paint1.txt

 

1.2 Comparison of key columns

A file has multiple columns of data. The first row is the column name, and the second row is the data record. The content of the key columns in the file should be compared, and the rows with duplicate key column contents should be deleted or only the duplicate rows should be kept.

The current sales order table order_2018.xlsx for 2018, some of the data are as follows:

    ..

 

1.2.1. Remove duplication

Example 1: Request all the different customer IDs who purchased products in 2018 and save them in the file 2018c.xlsx.

The esProc SPL script is as follows:


A Comment
1 =file("e:/txt/order_2018.xlsx").xlsimport@t() Read 2018 order form data
2 =A1.id(CustomerId) Remove all unique customer IDs
3 =file("e:/txt/2018c.xlsx").xlsexport(A2) Save the customer ID to the file 2018c.xlsx

 

Example 2: Request which different products each customer purchased in 2018, and save CustomerId and ProductId in the file 2018c_p.xlsx.

The esProc SPL script is as follows:


A Comment
1 =file("e:/txt/order_2018.xlsx").xlsimport@t(CustomerId,ProductId) Read the key column data of the 2018 order table
2 =A1.group@1(CustomerId,ProductId) Group by key column, @1 means only take one record in the group
3 =file("e:/txt/2018c_p.xlsx").xlsexport@t(A2) Save the result A2 to the file 2018c_p.xlsx

 

1.2.2. Only keep duplicates

Example: Request the order status of repeat customers in 2018 (that is, customers who bought the same product multiple times), and save the results in the file 2018c_rebuy.xlsx.

The esProc SPL script is as follows:


A Comment
1 =file("e:/txt/order_2018.xlsx").xlsimport@t() Read 2018 order form data
2 =A1.group(CustomerId,ProductId) Orders of the same customer for the same product are grouped together
3 =A2.select(~.count()>1).conj() Select groups with orders greater than 1, and combine the orders of each group into a data table
4 =file("e:/txt/2018c_rebuy.xlsx").xlsexport@t(A3) Save the result A3 to the file 2018c_rebuy.xlsx

 

 

2. Large files

Large file data cannot be loaded into the memory all at once, and cannot be read out like small file data for repetitive comparison. It is necessary to read data in batches for comparison. esProc SPL provides cursors to handle large file operations, making it very convenient to perform large file deduplication operations.

2.1 Deduplication of the entire line

There are large text files, each line of which is a string, and only one line of repeated lines in the file should be kept. To deal with this problem, each line of the file must be read as a string, which becomes a record in the cursor, and then the result is obtained through the de-duplication operation of the cursor.

Example: The existing large file all.txt of the national real estate property owner registration form records the identity card and name of the property owner. Part of the data is as follows:
    510121198802213364-Joan
    110113199203259852-Jack
    201264197206271113-Mike

Since some people own real estate in multiple states, there will be duplicate registrations in the file. Please keep only one duplicate registration and save the result in all2.txt. The esProc SPL script is as follows:


A Comment
1 =file("e:/txt/all.txt").cursor@s() Create a cursor, @s means that the entire row is used to form a table sequence of single-field strings
2 =A1.groupx(_1) Group the single field in the cursor to remove duplicate rows
3 =file("e:/txt/all2.txt").export(A2) Write the result after deduplication to the file all2.txt

 

2.2 Key column comparison

This section still uses the sales order table as an example. It is the consolidated sales order table orders.xlsx for all years, which is a large file.

2.2.1. Remove duplication

Example 1: Please find out all the different customer IDs who purchased the product and save them in the file customers.xlsx.

The esProc SPL script is as follows:


A Comment
1 =file("e:/txt/orders.xlsx").xlsimport@tc() Create an order table data cursor
2 =A1.groupx(CustomerId) Group by CustomerId to get unique customer Id
3 =file("e:/txt/customers.xlsx").xlsexport@t(A2) Save the customer ID to the file customers.xlsx

 

Example 2: Please find out which different products each customer has purchased, and save the CustomerId and ProductId in the file c_p.xlsx.

The esProc SPL script is as follows:


A Comment
1 =file("e:/txt/orders.xlsx").xlsimport@tc() Create an order table data cursor
2 =A1.groupx(CustomerId,ProductId) Group by key column, you can get unique customer Id and product Id
3 =file("e:/txt/c_p.xlsx").xlsexport@t(A2) Save the result A2 to the file c_p.xlsx

 

2.2.2. Only keep duplicates

Example: Please find out the order status of repeat customers (that is, customers who have purchased the same product multiple times), and save the results in the file c_rebuy.xlsx.

The esProc SPL script is as follows:


A Comment
1 =file("e:/txt/orders.xlsx").xlsimport@tc().sortx(CustomerId,ProductId) Create a data cursor in the order table and sort by key column
2 =A1.group(CustomerId,ProductId) Orders of the same customer for the same product are grouped together
3 =A2.select(~.count()>1).conj() Select groups with orders greater than 1, and combine the orders of each group into a data table
4 =file("e:/txt/c_rebuy.xlsx").xlsexport@t(A3) Save the result A3 to the file c_rebuy.xlsx

 There are more examples of related calculations in the SPL CookBook.




Guess you like

Origin blog.51cto.com/12749034/2552108