The easiest method to clean up data using Python and the Pandas

Original link: https://www.marsja.se/easiest-data-cleaning-method-using-python-pandas-pyjanitor/

In this article, we will learn how to use Python package Pyjanitor simplify data preprocessing. Specifically, we will learn how to:

  • Adding a column to a Pandasdataframe (data frame)

  • Delete missing values

  • Delete an empty column

  • Cleaning Column Name

That is, we will learn how to use Pyjanitor to clean Pandas data frame. All data manipulation in Python example, we will also see how to use only the Pandas functions to implement these operations.

What Pyjanitor that?

What Pyjanitor that? Before we continue to learn how to use the Pandas and Pyjanitor to clean up the data set, we will study the package. Python package Pyjanitor use text-based API extends the Pandas. This easy to use API provides us with a convenient data clean-up technology. Obviously, it is a beginning part of the janitor R package. In addition, it is inspired by the expressiveness and ease of dplyr R package. Note that there are a number of different ways to use these methods, it will not cover all the way (see the document: https: //pyjanitor.readthedocs.io/).

How to install Pyjanitor

There are two relatively easy ways to install Pyjanitor:

1. Installation Pyjanitor Pip
Here Insert Picture Description

2. Use Conda installation Pyjanitor:
Here Insert Picture Description

Now that we know what Pyjanitor is and how to install this package, then we will soon be able to continue to learn Python data clean-up tutorial, learn how to remove Pandas from the missing value. Note that this Pandas tutorial describes in detail how to use the Pandas and Pyjanitor to achieve this. Finally, we will have a complete example of using only data cleansing Pyjanitor and a link to Jupyter Notebook contains all the code.

Use Pandas data manipulation: Concise Guide (https://www.marsja.se/data-manipulation-pandas-tutorial/)

False data

In a first operation data Python example, we will use a dummy data set. More specifically, we will create a data frame, which has an empty column and a number of missing values. In this part of this article, we will further use Python package SciPy and NumPy. In other words, we also need to install these packages.

In this example, we will create three columns; Subject, RT (response time) and Deg. To create a response time column, we will use the data to create SciPy the norm normal distribution.
Here Insert Picture Description
Scipy created using normal Python

In the next code block, we create a normal distribution is used for the variable response time.
Here Insert Picture Description
Rearrange the list and add the missing value

In addition, we add some missing values, and reorder the list of normally distributed data:
Here Insert Picture Description
creating a data frame from the dictionary

Finally, we will create our two variables, a dictionary, and the dictionary to use to create a data frame Pandas:
Here Insert Picture Description
Here Insert Picture Description
Creating DataFrame from the dictionary

Use Pandas and Pyjanitor data cleansing in Python

How to add a column to the Pandas Dataframe

Now that we have created our data frames from a dictionary, we are ready to add a column to it. In the following example, we will use Pandsa and Pyjanitors method.

1. adding a column to the Pandas Dataframe

Use Pandas is very easy to add a row to a data frame. In the following example, we will append a null data frame Pandas column:
Here Insert Picture Description
Here Insert Picture Description
add columns to the data frame

2. Use Pyjanitor add a column to the Pandas Dataframe

现在,我们将使用add_column方法向该数据帧中追加一个列。添加一个空列不像使用上面的方法那么容易。然而,正如您将在本文末尾看到的,我们可以在创建我们的数据帧时使用所有方法:
Here Insert Picture Description
Here Insert Picture Description
向数据帧中追加列

如何删除Pandas Dataframe中的缺失值

我们的数据集远远不够完整,这是很常见的。这可能是由于测量仪器的错误,人们忘记或拒绝回答某些问题,以及许多其他事情。尽管缺失的信息背后有各种原因,但这些行被称为缺失值。在Pandas的框架中,缺失值由符号NA编码,这与在R统计环境中很像。Pandas有isna()函数来帮助我们识别数据集中的缺失值。如果我们想删除缺失值,Pandas有一个函数dropna()。

1.使用Pandas dropna方法删除缺失值

在下面的代码示例中,我们删除所有具有缺失值的行。注意,如果我们想修改该数据帧,我们应该添加inplace参数并将其设置为true。
Here Insert Picture Description
Here Insert Picture Description

2.使用PyJanitor从Pandas Dataframe中删除缺失值

使用Pyjanitor从Pandas Dataframe中删除缺失值的方法与上面的方法相同。也就是说,我们将使用dropna方法。但是,当我们使用Pyjanitor从该数据帧中删除缺失数据时,我们还会使用subset参数来选择要使用哪些列:
Here Insert Picture Description
如何从Pandas Dataframe中删除一个空列

在下一个Pandas数据操作示例中,我们将从数据帧中删除空列。首先,我们将使用Pandas删除空列,然后,我们将使用Pyjanitor。请记住,在本文的最后,我们将有一个完整的示例,其中我们在实际创建Pandas Dataframe的同时对所有数据进行清理。

1. 从Pandas Dataframe中删除一个空列

当我们想删除一个空列(例如,带有缺失值)时,我们将再次使用Pandas的dropna方法。然而,我们还将使用axis方法并将其设置为1(针对列)。此外,我们还必须使用参数how并将其设置为’ all '。如果我们不这样做,它将删除任何带有缺失值的列。
Here Insert Picture Description
Here Insert Picture Description
删除空列

2. 使用Pyjanitor从Pandas Dataframe中删除一个空列

使用Pyjanitor删除一个空列要更容易一点:
Here Insert Picture Description
如何在Pandas Dataframe中重命名列

现在我们知道了如何删除缺失值、向一个Pandas 数据帧中添加一个列以及如何删除一个列,我们将继续这个数据清理教程来学习如何重命名列。

例如,在我们学习了《如何将数据从一个JSON文件加载到一个Pandas数据帧》的文章中,我们重新命名了列,以便稍后更容易地使用该数据帧。在下面的示例中,我们将读取一个JSON文件,并使用Pandas 数据帧方法rename和Pyjanitor来重命名列。
Here Insert Picture Description
Here Insert Picture Description
更多关于将数据加载到数据帧的文章:

如何使用Python和Pandas读写JSON文件
https://www.marsja.se/how-to-read-and-write-json-files-using-python-and-pandas/

Pandas读取CSV教程 https://www.marsja.se/pandas-read-csv-tutorial-to-csv/

Pandas Excel教程:如何读写Excel文件
https://www.marsja.se/pandas-excel-tutorial-how-to-read-and-write-excel-files/

1.在Pandas Dataframe中重命名列

如上图所示,我们想要删除一些空格和特殊字符。在第一个重命名列的例子中,我们将使用Pandas的 rename方法和正则表达式一起来重命名列(即,我们将用下划线替换空格和)。
Here Insert Picture Description

2. 如何使用Pyjanitor和clean_names重命名列

使用Pyjanitor重命名一个列(或多个列)要容易得多。实际上,当我们导入了这个Python包之后,我们就可以使用clean_names方法,它将给出与使用Pandas的rename方法相同的结果。事实上,使用clean_names,我们还可以将列名称中的所有字母转换为小写:
Here Insert Picture Description
当从磁盘加载数据时,如何清理数据

使用Pyjanitor清理我们的数据的一个很酷的地方是,我们可以在加载数据时使用上述所有方法。例如,在最后一个数据清理示例中,我们将向该数据帧添加一个列,删除空列,删除缺失的数据,并清理列名称。这就是与Pyjanitor一起工作使我们的生活更容易的原因。
Here Insert Picture Description
Here Insert Picture Description
使用Pyjanitor聚合数据

在最后一个例子中,我们将使用Pandas方法agg、groupby和reset_index,以及Pyjanitor方法collapse_levels来计算每个扇区的平均值和标准:
Here Insert Picture Description

更多关于使用Python和Pandas对数据进行分组和聚合的文章:

Python Pandas分组教程
https://www.marsja.se/python-pandas-groupby-tutorial-examples/

使用Python进行描述性统计
https://www.marsja.se/pandas-python-descriptive-statistics/

结论

In this article, we learned some of the data clean-up methods. Specifically, we learned how to frame append a column to a Pandas data, delete empty columns, handle missing values ​​and rename columns (ie, better column name). Of course, when we use the Pandas and Pyjanitor, there is more data cleaning methods are available.

In summary, this method of increasing Python package with package janitor and R dplyr Method is similar. When pre-processing of data, these methods will make our lives easier.

What is your favorite method of data cleaning and / or package is? Whether you use R, Python or any other programming language. Please leave a comment below!

English text: https: //www.marsja.se/easiest-data-cleaning-method-using-python-pandas-pyjanitor/
Translator: wild pandas

Guess you like

Origin blog.csdn.net/qdPython/article/details/102744374