Data Import and Preprocessing-Chapter 7-Data Cleaning Tool OpenRefine

Data cleaning tool OpenRefine

Introduction to OpenRefine

OpenRefine is a free and open source powerful tool for cleaning data. It can help users complete the cleaning work before using the data, and intuitively display related operations on the data through the interface run by the browser. It is suitable for users with weak programming skills. A good choice.
Insert image description here

OpenRefine is a visualization tool developed by Java. Users can clean and format data directly on the operation interface. It supports Windows, Linux and macOS systems, and provides multiple languages ​​such as English, Chinese and Japanese. It can be used in Users who lack professional programming skills can quickly clean data.
Insert image description here

Download and install

Download the installation package
openrefine-3.7.2.zip

After decompression, it is as follows:
Insert image description here

Click "openrefine.exe" to start the OpenRefine tool. If the Java environment is not configured on the current computer, the "Download Java for Windows" page will be opened in the default browser. If it has been installed, the interface shown in the figure below will pop up.
Insert image description here

Configuration

In order to ensure that readers can use the OpenRefine tool smoothly and conveniently in the future, some basic configurations need to be made before using the OpenRefine tool: language settings and increasing memory. Increasing memory can avoid problems due to huge data sets during subsequent operations. Import problem.

Language settings
Insert image description here

Increase memory
OpenRefine allocates 1G memory space by default in Windows systems. If the data processed requires larger memory space, you can increase the memory used by OpenRefine through the configuration file. space.
You can increase the memory space for the OpenRefine tool by modifying the configuration items of the openrefine.l4j.ini file.
Insert image description here

If you use 2GB or higher memory, you need to upgrade the currently configured Java environment version to a 64-bit version, otherwise the OpenRefine tool will not be started after editing the openrefine.l4j.ini file.

Create project

Insert image description here

Insert image description here

Insert image description here

It is worth mentioning that the OpenRefine tool displays the first 10 rows of data by default. You can specify the number of rows to display by clicking the numbers (5, 10, 25, 50) behind the "Display" option at the top of the page.
Insert image description here

Operation column

Common operations include
collapsing a column,
moving and rearranging a column,
removing the column and remove columns,
rename columns

Collapse column

Insert image description here

After collapsing the column, a blank column will appear. Click the blank column to restore the name2 column.
Insert image description here

Moving columns and rearranging

The OpenRefine tool supports both moving a single column at a time and moving multiple columns at a time to achieve the purpose of rearranging data columns. The OpenRefine tool supports 4 ways to move columns, namely "move column to the beginning", "move column to the end", "move column left" and "move column right".
Insert image description here

After moving right
Insert image description here

Rearrange/move columns
Insert image description here

After selecting Retake/Remove Column as follows
Insert image description here

The left side of the window displays the headers of all columns in order. You can rearrange them by dragging the column headers to the corresponding position.

Insert image description here

After selecting OK, as follows (if it does not appear, it may be that the name2 column is in a collapsed state)
Insert image description here

Remove this column and remove column

OpenRefine tool

Removing this column is to remove the currently specified single column;
Removing a column is to remove unnecessary columns in batches.

Insert image description here

A column titled "gender" does not exist in the current project.
Insert image description here

Remove column
Insert image description here

Select columns to remove
Insert image description here

Click OK, as follows
Insert image description here

Columns titled "name2" and "nation" do not exist in the current project.

Redefine column headers

If the column heading does not clearly convey the meaning of the column data, you can redefine the column heading by renaming the column.
Insert image description here

Undo and redo

Insert image description here

Insert image description here

export data

Although the OpenRefine project supports moving, removing, and renaming column operations, these operations will not modify the original data. This occurs because OpenRefine will copy the original data. If you want the column operations to take effect in the original data, you need to Export the modified data.

Insert image description here

The OpenRefine tool supports exporting data as projects, HTML tables, Excel files, ODF spreadsheets, etc. It should be noted that the "Export Project" option will export the project into a compressed package in the openrefine.tar.gz format.

Insert image description here

Insert image description here

Insert image description here

It should be noted that subsequent chapters will still use the Athletes_info project as an example to demonstrate the operation steps. To ensure the integrity of the data in the project, all operations on the Athletes_info project will be revoked here.

Insert image description here

Advanced operation

Data sorting

Data sorting is a common data cleaning operation. It mainly arranges data in a specified way, so that not only can the data be checked and corrected, but also the characteristics or trends of the data can be viewed by browsing the sorted data to find solutions to the problem. clues.
Insert image description here

Insert image description here

Insert image description here

Insert image description here

Data classification

Data classification is one of the common functions in OpenRefine tools. It is mainly used to obtain a changed subset from the data, allowing users to view the data from multiple angles without changing the data itself. The OpenRefine tool supports a variety of classification operations, including text classification, numerical classification, timeline classification, scatter plot classification and custom classification.

Text classification is used to classify and group specific text values. Open the drop-down menu of the event column in the Athletes_info project, and select [Classification] → [Text Classification] in the drop-down menu. The "Category/Filter" that displays the classified results will open on the left side of the page.
Insert image description here

Insert image description here

Insert image description here

numerical classification
Insert image description here

Custom classification
Insert image description here

Insert image description here

Insert image description here

Repeat testing

Insert image description here

If you want to delete duplicate values ​​in the name column, you need to sort the data containing duplicate values ​​first, and then delete the results that are true after being classified by plural numbers.

Insert image description here

Insert image description here

Insert image description here

Insert image description here

Insert image description here

Insert image description here

Insert image description here

The duplicate detection function in the OpenRefine tool only works with text type data.

data filling

Data filling is to fill in the empty positions with specified characters or numbers, and its purpose is to ensure the integrity of the data.

Insert image description here

Insert image description here

Insert image description here

Insert image description here

Text filtering

Text filtering is used to quickly match a specific string.
Insert image description here

Insert image description here

Insert image description here

data conversion

Data conversion function can convert a column of data into a specified type according to needs

Insert image description here

Commonly used conversions include removing leading and trailing blanks, closing consecutive blanks, capitalizing the first letter, all uppercase, all lowercase, textualization, etc.

Insert image description here

Insert image description here

Insert image description here

Insert image description here

Insert image description here

It should be noted that when writing an expression in the Python language, you need to ensure that there must be a return statement in the expression.

Summarize

This article mainly introduces the introduction, installation, project creation and other basic operations of the data cleaning tool OpenRefine, and provides steps such as operation columns and advanced operations.

Guess you like

Origin blog.csdn.net/m0_38139250/article/details/134664186