Article Directory

Pandas data processing basics

Pandas data processing basics

Introduction

Pandas is a very famous open source data processing library, through which we can complete a series of operations such as fast reading, conversion, filtering, and analysis of data sets. In addition, Pandas has powerful missing data processing and data pivot functions, which can be described as an indispensable tool in data preprocessing.

Knowledge points

type of data
Data read
Data reduction
Data fill

Pandas is a very well-known open source data processing library, which is developed based on NumPy. This tool is designed to solve data analysis tasks in the Scipy ecosystem. Pandas incorporates a large number of libraries and some standard data models, and provides the functions and methods needed to efficiently manipulate large data sets.

The unique data structure is the strength and core of Pandas. To put it simply, we can convert data in any format into Pandas data types, and use a series of methods provided by Pandas for conversion and operation, and finally get the results we expect.

Therefore, we first need to understand and be familiar with the data types supported by Pandas.

type of data

The data types of Pandas are mainly as follows, they are: Series (one-dimensional array), DataFrame (two-dimensional array), Panel (three-dimensional array), Panel4D (four-dimensional array), PanelND (more dimensional array). Among them, Series and DataFrame are the most widely used, accounting for almost 90% of the frequency of use.

series

Series is the most basic one-dimensional array form in Pandas. It can store integers, floating point numbers, character strings, and other types of data. The basic structure of Series is as follows:

pandas.Series(data=None, index=None)

Among them, data can be a dictionary, or an ndarray object in NumPy, etc. Index is a data index. Index is a major feature of Pandas data structure. Its main function is to help us locate data more quickly.

Next, we create a new example Series based on the Python dictionary.

%matplotlib inline
import pandas as pd

s = pd.Series({'a': 10, 'b': 20, 'c': 30})
s

As shown above, the data value of this Series is 10, 20, 30, the index is a, b, c, and the type of data value is recognized as int64 by default. You can confirm the type of s by type.

type(s)

Because Pandas is developed based on NumPy. Then NumPy's data type ndarray multidimensional array can naturally be converted to data in Pandas. And Series can be based on one-dimensional data conversion in NumPy.

import numpy as np

s = pd.Series(np.random.randn(5))
s

As shown above, we give a one-dimensional random array generated by NumPy. The final Series index starts from 0 by default, and the value type is float64.

DataFrame

DataFrame is the most common, important and frequently used data structure in Pandas. The DataFrame structure is similar to a normal spreadsheet or SQL table. You can think of DataFrame as an extended type of Series, which seems to be composed of multiple Series. The intuitive difference between it and Series is that data not only has a row index, but also a column index.

The basic structure of DataFrame is as follows:

pandas.DataFrame(data=None, index=None, columns=None)

Different from Series, it adds the column index. DataFrame can be constructed from the following multiple types of data:

One-dimensional array, list, dictionary, or Series dictionary.
Two-dimensional or structured numpy.ndarray.
A Series or another DataFrame.

For example, we first use a dictionary composed of Series to build a DataFrame.

df = pd.DataFrame({'one': pd.Series([1, 2, 3]),
                   'two': pd.Series([4, 5, 6])})
df

[(img-7LSeGuMy-1587181526725)(https://dn-simplecloud.shiyanlou.com/courses/uid1350742-20200418-1587178230184)]

When no index is specified, the index of the DataFrame also starts from 0. We can also generate a DataFrame directly from a dictionary composed of a list.

Or conversely, a DataFrame is generated from a list with a dictionary.

NumPy's multidimensional arrays are very commonly used, and a DataFrame can also be constructed based on two-dimensional values.

At this point, you should already be aware of the commonly used Series and DataFrame data types in Pandas. Series can actually be seen at first glance as a DataFrame with only 1 column of data. Of course, this statement is not rigorous. The core difference between the two is still that the Series has no column index. You can observe the Series and DataFrame generated by NumPy one-dimensional random array as shown below.

We will not introduce data types such as Panel in Pandas. First of all, these data types are rarely used, and secondly, even if you use them, you can use the techniques learned from DataFrame to migrate and apply them.

Data read

If we want to use Pandas to analyze data, we first need to read the data. In most cases, the data comes from external data files or databases. Pandas provides a series of methods to read external data, very comprehensive. Below, we will introduce the most commonly used CSV data files as examples.

The method to read the data CSV file is pandas.read_csv(), you can directly pass in a relative path or a network URL.

Readers can also follow the link to download and get a CSV file

Since CSV is a two-dimensional table when stored, Pandas will automatically read it as a DataFrame type.

Now you should understand that DataFrame is the core of Pandas. All data, whether read externally or generated by ourselves, we need to convert it to Pandas DataFrame or Series data type. In fact, in most cases, all of this is designed and no additional conversion work is required.

The method starting with the pd.read_ prefix can also read all kinds of data files and supports connecting to the database. Here, we will not repeat them in turn. You can read the corresponding chapters of the official documentation to familiarize yourself with these methods and figure out the parameters included in these methods.

You may have another question: Why convert the data to Series or DataFrame structure?

In fact, I can answer this question first. Because all Pandas methods for data manipulation are designed based on the data structures supported by Pandas. In other words, only Series or DataFrame can use the methods and functions provided by Pandas for processing. Therefore, before learning the real data processing method, we need to convert the data into Series or DataFrame type.

Basic operation

Through the above content, we already know that a DataFrame structure roughly consists of 3 parts, which are column name, index and data.

Next, we will learn the basic operations for DataFrame. In this course, we will not deliberately emphasize Series, because most of the methods and techniques you learn on DataFrame are suitable for processing Series, and the two have the same root.

Above, we have read an external data, which is the census data of Los Angeles. Sometimes, the files we read are very large. If all these files are output and previewed, it will be unsightly and time-consuming. Fortunately, Pandas provides head() and tail() methods, which can help us preview only a small piece of data.

Pandas also provides statistical and descriptive methods to help you understand the data set from a macro perspective. describe() is equivalent to an overview of the data set, and will output the count, maximum, minimum, etc. of each column of the data set.

Pandas is developed based on NumPy, so you can convert DataFrame to NumPy array through .values at any time.

This also shows that you can use the API provided by Pandas and NumPy to operate on the same data at the same time, and convert between the two at will. This is a very flexible tool ecosystem.
In addition to .values, the common attributes supported by DataFrame can be viewed in the corresponding chapter of the official document. The commonly used ones are:
df.index # View index

df.columns # View column names

df.shape # View shape

Data selection

In the process of data preprocessing, we often split the data set, and only retain certain rows, columns, or data blocks that are needed, and output to the next process. This is the so-called data selection, or data index.

Because there are indexes and labels in the Pandas data structure, we can complete the selection of data through multi-axis indexing.

Select based on index number

When we create a new DataFrame, if we do not specify the row index or the label corresponding to the column, Pandas will default to starting from 0 in the form of a number as the row index, and use the first row of the data set as the label corresponding to the column. In fact, the "column" here also has a numeric index, which starts from 0 by default, but it is not displayed.

Therefore, we can first select the data set based on the number index. The .iloc method in Pandas used here. The acceptable types of this method are:

Integer. For example: 5

A list or array of integers. For example: [1, 2, 3]

Boolean array.

A function or parameter that can return an index value.

Below, we use the sample data above for demonstration.

First, we can select the first 3 rows of data. This is similar to slices in Python or NumPy.

We can also select a specific row.

Then select multiple rows, is it df.iloc[1, 3, 5] like this?

The answer is wrong. The [[row], [column]] of df.iloc[] can accept row and column positions at the same time. If you type df.iloc[1, 3, 5] directly, an error will be reported.

So, it's very simple. If you want to select 1, 3, and 5 rows, you can do so.

After choosing the line to learn, you should be able to think of what to do with the selection column. For example, we want to select columns 2-4.

Here select 2-4 columns, but the input is 1:4. This is very similar to the slice operation in Python or NumPy. Now that we can locate rows and columns, we only need to combine them, and we can select any data in the data set.

Selection based on label name

In addition to selecting based on the number index, you can also select directly based on the name of the label. The method used here is very similar to the iloc above, with the missing i being df.loc[].

Next, we will demonstrate the usage of df.loc[]. First select the first 3 rows:
df.loc[0:2]

Then select 1, 3, 5 rows:
df.loc[[0, 2, 4]]

Then, select 2-4 columns:
df.loc[:,'Total Population':'Total Males']

Finally, select rows 1, 3 and the column after Median Age:.

df.loc[[0, 2], ‘Median Age’:]

Data reduction

Although we can get the data we need from a complete data set through the data selection method, sometimes it is simpler and more straightforward to delete the unnecessary data directly. In Pandas, methods starting with .drop are all related to data reduction.

DataFrame.drop can directly remove the specified columns and rows in the data set. Generally in use, we specify the labels label parameter, and then specify to delete by column or row through axis. Of course, you can also delete data through the index parameter, check the official document for details.

DataFrame.drop_duplicates is usually used for data deduplication, that is, to remove duplicate values in the data set. The usage method is very simple, specify the rules for removing duplicate values, and whether the axis is removed by column or row.

df.drop_duplicates()

In addition, another method for data reduction, DataFrame.dropna, is also very commonly used. Its main purpose is to delete missing values, that is, missing data columns or rows in the data set.

df.dropna ()

For the three commonly used data reduction methods mentioned, you must read the official documents through the links given. These commonly used methods don't have too many places to pay attention to, just understand their usage through the documentation, so we will not simplify the introduction to complicated ones.

Data fill

Since data reduction is mentioned, the situation of data filling may be encountered on the contrary. For a given data set, we generally do not fill in data randomly, but fill in missing values.

In a real production environment, the data files we need to process are often not as good as imagined. Among them, the most likely situation that will be encountered is the missing value. Missing value mainly refers to the phenomenon of data loss, that is, a certain piece of data in the data set does not exist. In addition, data that exists but is obviously incorrect are also classified as missing values. For example, in a time series data set, a certain piece of data suddenly has a time flow disorder, then this small piece of data is meaningless and can be classified as a missing value.

Detect missing values

In order to detect missing values more conveniently, Pandas uses NaN to mark the missing data of different types. The NaN here stands for Not a Number, and it is only used as a mark. The exception is that in a time series, the loss of a time stamp is marked with NaT.

There are two main methods for detecting missing values in Pandas, namely: isna() and notna(), so the name implies "is a missing value" and "not a missing value". By default, a boolean value is returned for judgment.

Next, we artificially generate a set of sample data containing missing values.

df = pd.DataFrame(np.random.rand(9, 5), columns=list('ABCDE'))
# 插入 T 列，并打上时间戳
df.insert(value=pd.Timestamp('2017-10-1'), loc=0, column='Time')
# 将 1, 3, 5 列的 1，3，5 行置为缺失值
df.iloc[[1, 3, 5, 7], [0, 2, 4]] = np.nan
# 将 2, 4, 6 列的 2，4，6 行置为缺失值
df.iloc[[2, 4, 6, 8], [1, 3, 5]] = np.nan
df

Then, one of isna() or notna() can determine the missing values in the data set.

The generation and detection of default values have been introduced above. In fact, the face of missing values is generally filled and eliminated two operations. Filling and clearing are two extremes. If you feel it is necessary to keep the column or row where the missing value is located, then you need to fill in the missing value. If there is no need to keep, you can choose to clear the missing values.

Among them, the method dropna() for removing missing values has been introduced above. Let's take a look at the fillna() method of filling missing values.

First, we can replace NaN with the same scalar value, such as 0.

In addition to directly filling in the value, we can also fill in the value before or after the missing value to the corresponding missing value through the parameter. For example, use the value before the missing value to fill:

df.fillna(method=‘pad’)

Or the following value:

df.fillna(method=‘bfill’)

Since there is no corresponding post-order value in the last line, missing values naturally continue to exist.

In the above example, our missing values exist at intervals. So, what if there are consecutive missing values? Give it a try. First, we set the 3rd and 5th rows of the 2nd, 4th and 6th columns of the data set as missing values.

Then come forward filling:

It can be seen that the continuous missing values are also filled in according to the pre-order value and are completely filled. Here, we can set the limit number of continuous filling through the limit= parameter.

df.fillna(method='pad', limit=1) # Fill one item at most

In addition to the above filling methods, you can also fill specific columns or rows through the averaging method that comes with Pandas. for example:

Fill columns C to E with average values

Interpolation fill

Interpolation is a method in numerical analysis. In short, it is to use a function (linear or non-linear) to solve the value of the unknown data based on the known data. Interpolation is very common in the data field. Its advantage is that it can try to restore the data itself.

We can complete linear interpolation through the interpolate() method. Of course, some other interpolation algorithms can be found in official documents.

# 生成一个 DataFrame
df = pd.DataFrame({'A': [1.1, 2.2, np.nan, 4.5, 5.7, 6.9],
                   'B': [.21, np.nan, np.nan, 3.1, 11.7, 13.2]})
df

For the missing values that exist above, it is not very likely to reflect the trend if it is filled by the before and after values, or the average value. At this time, interpolation is best used. Let's try it with the default linear interpolation.

The figure below shows the data after interpolation, and it is obvious that the interpolation result conforms to the trend of the data. If you fill in the order of the preceding and following data, you cannot do this.

data visualization

NumPy, Pandas, and Matplotlib constitute a complete data analysis ecosystem, so the compatibility of the three tools is also very good, and even share a large number of interfaces. When our data is presented in DataFrame format, we can directly use the DataFrame.plot method provided by Pandas to call the Matplotlib interface to draw common graphics.

For example, we use the interpolated data df_interpolate above to draw a line graph.

Other styles of graphics are also very simple, just specify the kind= parameter.

For more graphic styles and parameters, read the detailed instructions in the official document. Although Pandas drawing cannot achieve the flexibility of Matplotlib, it is simple and easy to use, suitable for rapid presentation and preview of data.

In addition to some of the methods and techniques mentioned above, in fact Pandas also commonly used:

Data calculation, for example: DataFrame.add, etc.
Data aggregation, for example: DataFrame.groupby, etc.
Statistical analysis, for example: DataFrame.abs, etc.
Time series, for example: DataFrame.shift, etc.

22 Python Pandas data processing basics