Article directory
Data cleaning tool OpenRefine
Introduction to OpenRefine
OpenRefine is a free and open source powerful tool for cleaning data. It can help users complete the cleaning work before using the data, and intuitively display related operations on the data through the interface run by the browser. It is suitable for users with weak programming skills. A good choice.
OpenRefine is a visualization tool developed by Java. Users can clean and format data directly on the operation interface. It supports Windows, Linux and macOS systems, and provides multiple languages such as English, Chinese and Japanese. It can be used in Users who lack professional programming skills can quickly clean data.
Download and install
Download the installation package
openrefine-3.7.2.zip
After decompression, it is as follows:
Click "openrefine.exe" to start the OpenRefine tool. If the Java environment is not configured on the current computer, the "Download Java for Windows" page will be opened in the default browser. If it has been installed, the interface shown in the figure below will pop up.
Configuration
In order to ensure that readers can use the OpenRefine tool smoothly and conveniently in the future, some basic configurations need to be made before using the OpenRefine tool: language settings and increasing memory. Increasing memory can avoid problems due to huge data sets during subsequent operations. Import problem.
Language settings
Increase memory
OpenRefine allocates 1G memory space by default in Windows systems. If the data processed requires larger memory space, you can increase the memory used by OpenRefine through the configuration file. space.
You can increase the memory space for the OpenRefine tool by modifying the configuration items of the openrefine.l4j.ini file.
If you use 2GB or higher memory, you need to upgrade the currently configured Java environment version to a 64-bit version, otherwise the OpenRefine tool will not be started after editing the openrefine.l4j.ini file.
Create project
It is worth mentioning that the OpenRefine tool displays the first 10 rows of data by default. You can specify the number of rows to display by clicking the numbers (5, 10, 25, 50) behind the "Display" option at the top of the page.
Operation column
Common operations include
collapsing a column,
moving and rearranging a column,
removing the column and remove columns,
rename columns
Collapse column
After collapsing the column, a blank column will appear. Click the blank column to restore the name2 column.
Moving columns and rearranging
The OpenRefine tool supports both moving a single column at a time and moving multiple columns at a time to achieve the purpose of rearranging data columns. The OpenRefine tool supports 4 ways to move columns, namely "move column to the beginning", "move column to the end", "move column left" and "move column right".
After moving right
Rearrange/move columns
After selecting Retake/Remove Column as follows
The left side of the window displays the headers of all columns in order. You can rearrange them by dragging the column headers to the corresponding position.
After selecting OK, as follows (if it does not appear, it may be that the name2 column is in a collapsed state)
Remove this column and remove column
OpenRefine tool
Removing this column is to remove the currently specified single column;
Removing a column is to remove unnecessary columns in batches.
A column titled "gender" does not exist in the current project.
Remove column
Select columns to remove
Click OK, as follows
Columns titled "name2" and "nation" do not exist in the current project.
Redefine column headers
If the column heading does not clearly convey the meaning of the column data, you can redefine the column heading by renaming the column.
Undo and redo
export data
Although the OpenRefine project supports moving, removing, and renaming column operations, these operations will not modify the original data. This occurs because OpenRefine will copy the original data. If you want the column operations to take effect in the original data, you need to Export the modified data.
The OpenRefine tool supports exporting data as projects, HTML tables, Excel files, ODF spreadsheets, etc. It should be noted that the "Export Project" option will export the project into a compressed package in the openrefine.tar.gz format.
It should be noted that subsequent chapters will still use the Athletes_info project as an example to demonstrate the operation steps. To ensure the integrity of the data in the project, all operations on the Athletes_info project will be revoked here.
Advanced operation
Data sorting
Data sorting is a common data cleaning operation. It mainly arranges data in a specified way, so that not only can the data be checked and corrected, but also the characteristics or trends of the data can be viewed by browsing the sorted data to find solutions to the problem. clues.
Data classification
Data classification is one of the common functions in OpenRefine tools. It is mainly used to obtain a changed subset from the data, allowing users to view the data from multiple angles without changing the data itself. The OpenRefine tool supports a variety of classification operations, including text classification, numerical classification, timeline classification, scatter plot classification and custom classification.
Text classification is used to classify and group specific text values. Open the drop-down menu of the event column in the Athletes_info project, and select [Classification] → [Text Classification] in the drop-down menu. The "Category/Filter" that displays the classified results will open on the left side of the page.
numerical classification
Custom classification
Repeat testing
If you want to delete duplicate values in the name column, you need to sort the data containing duplicate values first, and then delete the results that are true after being classified by plural numbers.
The duplicate detection function in the OpenRefine tool only works with text type data.
data filling
Data filling is to fill in the empty positions with specified characters or numbers, and its purpose is to ensure the integrity of the data.
Text filtering
Text filtering is used to quickly match a specific string.
data conversion
Data conversion function can convert a column of data into a specified type according to needs
Commonly used conversions include removing leading and trailing blanks, closing consecutive blanks, capitalizing the first letter, all uppercase, all lowercase, textualization, etc.
It should be noted that when writing an expression in the Python language, you need to ensure that there must be a return statement in the expression.
Summarize
This article mainly introduces the introduction, installation, project creation and other basic operations of the data cleaning tool OpenRefine, and provides steps such as operation columns and advanced operations.