23 The main method of Python's Pandas

The main methods of Pandas

Pandas is a tool based on NumPy that provides a large number of data exploration methods. Pandas can process and analyze data in formats such as .csv, .tsv, and .xlsx in a SQL-like manner.

The main data structures used by Pandas are Series and DataFrame classes. The following two categories are briefly introduced:

  • Series is an object similar to a one-dimensional array, which consists of a set of data (various NumPy data types) and a set of related data labels (ie indexes).
  • DataFrame is a two-dimensional data structure, that is, a table, in which each column of data has the same type. You can think of it as a dictionary composed of Series instances.
    Let's start this experiment. We will show the main methods of Pandas by analyzing the data set of customer churn rates of telecom operators.

First load the necessary libraries, namely NumPy and Pandas.

Teaching code:

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Read the data through the read_csv() method, and then use the head() method to view the first 5 rows of data.
image.png

Each row in the figure above corresponds to a customer, and each column corresponds to a characteristic of the customer.

Let us look at the dimensions, feature names, and feature types of the database.

df.shape

Next we try to print the column names.

df.columns

image.png

We can also use the info() method to output some overall information about the DataFrame.
image.png

The astype() method can change the type of the column. The following formula changes the Churn off-grid rate feature to int64 type.

df['Churn']=df['Churn'].astype('int64')

describe () method to display numerical features (Int64 and float64) of the basic statistical characteristics , if not the value of the missing values, mean, standard deviation, range, quartiles and the like.

df.describe()

image.png

By explicitly specifying the type of data to include through the include parameter, you can view statistical data of non-numeric features.
image.png

The value_counts() method can view the characteristics of category (type is object) and Boolean (type is bool). Let us look at the distribution of Churn's off-grid rate.
image.png

The above results show that out of 3333 customers, 2850 are loyal customers and their Churn value is 0. When calling value_counts() function, add normalize=True parameter to display the scale.
image.png

Sort

DataFrame can be sorted according to the value of a variable (that is, column). For example, sort according to daily consumption (set ascending=False to sort in reverse order).

image.png

In addition, you can sort according to the values ​​of multiple columns. The function implemented by the following function is: first sort by Churn off-grid rate in ascending order, and then sort by Total day charge in descending order, with priority Churn> Tatal day charge.
image.png

Index and get data

DataFrame can be indexed in different ways.

Use DataFrame['Name'] to get a single column. For example, how high is the off-grid rate?
image.png

For a company, the churn rate of 14.5% is a very bad figure. Such a high churn rate may lead to bankruptcy of the company.

Boolean indexing is also very convenient. The syntax is df[P(df['Name'])], where P is the logical condition used when checking each element in the Name column. The output of this index is the row that meets the P condition in the Name column of the DataFrame.

Let us use boolean indexing to answer the following question: What is the mean value of the numerical variable of the off-grid users?
image.png

What is the average of the total time that off-grid users make calls during the day?
image.png

How long is the longest international long-distance call made by loyal users (Churn == 0) who have not used the international plan (International plan == NO)?
image.png

DataFrame can be indexed by column name, row name, and row number. The loc method is indexed by name, and the iloc method is indexed by number.

Use the loc method to output data from rows 0 to 5, from State to Area code.
image.png

Output the first 3 columns of the first 5 rows of data through the iloc method (same as a typical Python slice, without the maximum value).
image.png

df[:1] and df[-1:] can get the first and last rows of the DataFrame.

Apply functions to cells, columns, rows

Next, apply the function max to each column through the apply() method, that is, output the maximum value of each column.
image.png

The apply() method can also apply a function to each row, just specify axis=1. In this case, it is very convenient to use lambda functions. For example, the function below selects all states starting with W.
image.png

The map() method can replace the value in a column with a dictionary in the form of {old_value:new_value}.
image.png

Of course, using the repalce() method can also achieve the purpose of replacement.
image.png

Grouping

The general form of grouping data under Pandas is:
df.groupby(by=grouping_columns)[columns_to_show].function()

Explanation:

  • The groupby() method groups according to the value of grouping_columns
  • Next, select the column of interest (columns_to_show). If this item is not included, all non-groupby columns (that is, all columns except grouping_colums) will be selected.
  • Finally, apply one or more functions.

In the following example, we group the data according to the value of the Churn churn rate variable (0 or 1), and display the statistics of each group.
image.png

Similar to the above example, except that this time some functions are passed to agg(), and the grouped data is aggregated through the agg() method.
image.png

Summary Table

The definition of
Pivot Table in Pandas is as follows: Pivot Table is a common data summary tool in spreadsheet programs and other data exploration software. It aggregates data based on one or more keys, and distributes the data to each rectangular area based on grouping on rows and columns.

image.png

Now, use the pivot_table() method to view the average call volume during the day, night, and late night in different area codes.
image.png

Cross Tabulation (Cross Tabulation) is a special pivot table used to calculate grouping frequency. In Pandas, the crosstab() method is generally used to construct a cross tabulation.

Construct a cross table to view the Churn off-grid rate of the sample and the distribution of International plan.
image.png

Build a cross table to view Churn's off-grid rate and the distribution of Voice mail plan.
image.png

The above results show that most of the users are loyal users and they do not use additional services (international packages, voice mail).

Increase or decrease the ranks of the DataFrame

There are many ways to add columns to the DataFrame, for example, use the insert() method to add columns to calculate the total number of Total calls for all users.
image.png

The above code creates an intermediate Series instance, namely tatal_calls. In fact, you can add columns directly without creating this instance. (In the last line)
image.png

Total charge item is added to the last column

Use the drop() method to delete columns and rows
image.png
Part of the explanation of the above code:

  • Pass the corresponding index ['Total charge','Total calls'] and the axis parameter (1 means delete column, 0 means delete row, and the default value is 0) to drop.
  • The inplace parameter indicates whether to modify the original DataFrame (False means not to modify the existing DataFrame and return a new DataFrame, True means to modify the current DataFrame).

Predicted off-grid rate

First, build a crosstab to view the correlation between International plan international package variables and Churn off-grid rate through the crosstab() method described above, and use the countplot() method to build a count histogram to visualize the results

# 加载模块,配置绘图
import matplotlib.pyplot as plt
import seaborn as sns

image.png

The figure above shows that the churn rate of users with international packages is much higher, which is an interesting observation. Perhaps the high cost of international calls makes customers very dissatisfied.

Similarly, check the correlation between Customer service calls and Chunrn's off-grid rate, and visualize the results.
image.png

The above figure shows that after 4 customer service calls, the customer's churn rate has increased significantly.

In order to better highlight the relationship between Customer service call and Churn's off-grid rate, a binary attribute Many_service_calls can be added to the DataFrame, that is, more than 3 customer calls (Customer service calls> 3). Look at the correlation between it and the off-grid rate and visualize the results.
image.png

image.png

Now we can create another cross table to associate the Churn churn rate with the International plan and the newly created Many_service_calls multiple customer service calls.
image.png
image.png

Review the content of this experiment:

  • The share of loyal customers in the sample is 85.5%. This means that the simplest model that predicts "loyal customers" has an 85.5% probability of guessing right. In other words, the accuracy of subsequent models should not be less than this number, and hopefully it will be significantly higher than this number.
  • Based on a simple "(customer service call times> 3) & (international package = True) => Churn = 1, else Churn = 0" rule prediction model, an accuracy rate of 85.8% can be obtained. In the future, we will discuss decision trees and see how to automatically find similar rules based only on input data, instead of manually setting them. We have obtained two accuracy rates (85.5% and 85.8%) without applying machine learning methods, which can be used as baselines for other subsequent models. If after a lot of effort, we only increase the accuracy rate by 0.5%, then the direction of our efforts may be biased, because just using a simple model with two restriction rules has improved the accuracy rate by 0.3%.
  • Before training a complex model, it is recommended to preprocess the data, draw some charts, and make some simple assumptions. In addition, when applying machine learning to practical tasks, you usually start with simple solutions and then try more complex solutions.

Guess you like

Origin blog.csdn.net/bj_zhb/article/details/105551490