Getting Started with Pandas Easily: Practical Basics of Python Data Analysis

Preface

During this period of time, I was chatting with some students who were doing data analysis, and I found that there is a common problem in the entry-level stage of data analysis skills. Many students who entered the field out of interest were able to quickly become familiar with the basic syntax of Python, and then dived in without hesitation . Among the classics "Using Python for Data Analysis", after biting the bullet and reading it, I seemed to know a little bit about everything. However, in actual operation, I didn't know where to start and it was full of loopholes.

As for the reasons, insufficient understanding and insufficient practice are two old stumbling blocks that can only be overcome by oneself. There is also a very interesting and often overlooked factor - falling into a state of confusion.

What does that mean? Suppose I was a landlubber and wanted to learn swimming. The coach carefully analyzed the breaststroke movements for me, held my waist and let me paddle in the water for 5 minutes. Then he immediately explained the butterfly stroke to me and paddled for another 5 minutes. minutes, and then forced me to swim in a swimming position, which was still paddling for 5 minutes. Finally, the coach threw me into the bottomless swimming pool and cheered me on.

As a newbie, the coach taught me 3 swimming techniques and let me practice each for 5 minutes. The result of this is that I didn't learn any swimming skills , I only learned how to drink water. When a beginner falls into multiple solutions to a single problem at the beginning, and the practice of each method is limited, he will often be in a hurry when facing specific problems.

Take Pandas as an example. Its multiple construction methods, multiple indexing methods and multiple implementation methods of similar effects can easily put beginners into a state of confusion. Therefore, trying to avoid this pitfall is also my original intention in writing the basics of Pandas . I hope that by sorting out and streamlining knowledge points, I can give some inspiration to students who need it.

Now let’s get to the point (I’m so nagging).

1. Introduction to Pandas

There is a saying circulating in the world - if you don't know Master Pan (PANDAS) in analysis, even if you are a veteran, your analysis will be in vain.

Pandas is a professional data analysis tool based on Numpy, which can handle various data sets flexibly and efficiently. It is also an artifact for our later analysis cases. It provides two types of data structures, namely DataFrame and Series. We can simply and roughly understand DataFrame as a table in Excel, and Series is a column in the table . All the Pandas tricks we will learn and use later Operations are all based on these tables and columns (for the image relationship between Pandas and Excel, I recommend "Compare EXCEL, Easily Learn Python Data Analysis" written by my good friend Zhang Junhong).

One thing needs to be emphasized here. Compared with Excel and SQL, Pandas only changes the way of calling and processing data. The core is to perform a series of processing on the source data . Before formal processing, it is more important to make a decision before taking action. Clarifying the meaning of analysis and clarifying the analysis ideas before processing and analyzing the data will often get twice the result with half the effort.

2. Create, read and store

2.1 Create

How should we construct the following table in Pandas?

Insert image description here

Don’t forget, the first step must be to import our library - import pandas as pd

The most common way to construct a DataFrame is dictionary + list . The statement is very simple. First, enclose the dictionary, and then type out each column title and its corresponding column value in sequence (a list must be used here). The order of the columns here is not important:

Insert image description here

The left side is what the dataframe in jupyter notebook looks like. If it is mapped to excel, it will look like the table on the right. The data can be controlled by changing the values ​​of columns, index and values.

PS, if we do not specify index when creating, the system will automatically generate an index starting from 0.

2.2 Reading

More often, we read relevant file data directly into PANDAS for operation. Here are two very similar reading methods, one is a file in CSV format, and the other is in EXCEL format (.xlsx and xls suffix). document.

Read csv file:

Insert image description here

engine is the analysis engine used. When reading csv files, python is generally specified to avoid errors caused by Chinese and encoding. Reading Excel files has the same flavor:

Insert image description here

It's very easy. In fact, read_csv and read_excel also have some parameters, such as header, sep, names, etc. You can learn more about them. In practice, the format of data sources is generally relatively regular, and in most cases it is read directly.

2.3 Storage

Storage is very simple and similar:

Insert image description here

3. Quickly understand data

Here we take our case data as an example to quickly become familiar with viewing N rows, an overview of the data format, and basic statistical data.

3.1 Check the data and look at the beginning and end

Many times we want to get an overview of the data content. We can use the df.head() function to directly view the default first 5 rows. Correspondingly, df.tail() can view the 5 rows of data at the end of the data. These two A value can be passed in the parameter to control the number of rows to be viewed. For example, df.head(10) means to view the first 10 rows of data.

Insert image description here

3.2 Format View

df.info() helps us find out the type of data in each column and the missing conditions in one step:

Insert image description here

From the above, you can directly know the number of rows and columns of the data set, the size of the data set, the data type of each column, and how many non-empty data there are.

3.3 Overview of statistical information

Quickly calculate key statistical indicators of numerical data, such as mean, median, standard deviation, etc.

Insert image description here

We originally had 5 columns of data, why are there only two columns in the returned result? That's because this operation only works on numeric columns. Among them, count is a count of how many non-empty values ​​there are in each column. Mean, std, min, and max correspond to the mean, standard deviation, minimum value, and maximum value of the column respectively. 25%, 50%, and 75% correspond to is the quantile.

4. Basic processing methods of columns

Here, we use the logic of the four magic weapons of SQL to simply sort out the basic processing methods for columns - add, delete, select, and modify .

**Warm reminder:** When using Pandas, try to avoid using rows or EXCEL cells to process data. Gradually develop a column-oriented thinking. Each column is of the same origin and has the same origin, so processing is very fast. .

4.1 increase

To add a column, use the form df['new column name'] = new column value, and assign the value based on the original data:
Insert image description here

4.2 Delete

We use the drop function to specify the deletion of the corresponding column. axis = 1 indicates the operation on the column. If inplace is True, the source data will be modified directly, otherwise the source data will remain unchanged.

Insert image description here

4.3 Select

What should I do if I want to select a certain column? df['column name'] can be:

Insert image description here

What about selecting multiple columns? Need to be passed in a list: df[['First column', 'Second column', 'Third column'...]]

Insert image description here

4.4 Change

It’s hard work. The complicated filtering and modification of specific conditions and rows will be discussed in detail later with cases. Here we only talk about the simplest changes: df['old column name'] = a certain value or a certain column value, and it’s done. Modify the original column value.

5. Common data types and operations

5.1 String

The string type is one of the most commonly used formats. String operations in Pandas are almost the same as native string operations. The only difference is that ".str" needs to be added before the operation.

Warm reminder: When we first use df2.info() to check the data type, non-numeric columns all return the object format, and the difference in the deep mechanism between the str type and the str type will not be expanded. In regular practical applications, we can first It is understood that object corresponds to str format, int64 corresponds to int format, and float64 corresponds to float format.

In the case data, we found that the source details column may be a historical issue left over from system export. There is a "-" symbol in front of each string, which is ugly and useless, so we removed it:

Insert image description here

Generally speaking, the cleaned columns will replace the original columns:

Insert image description here

5.2 Numeric type

For numerical data, a common operation is calculation, which is divided into operations on single values ​​and operations on columns of equal length.

Take the case data as an example. We know the number of visitors in the source data. Now we want to add 10,000 visitors from all channels. How to do it?

Insert image description here

Just select the column with the number of visitors and add 10,000. Pandas will automatically add 10,000 to the values ​​in each row, as will other operations on a single value (subtraction, multiplication, and division).

The operation statements between columns are also very concise. The source data includes the number of visitors, conversion rate and customer unit price, but in actual work we are more interested in the sales contributed by each channel. (Sales = number of visitors X conversion rate X customer unit price)

Corresponding operation statement : df['sales'] = df['number of visitors'] * df['conversion rate'] * df['unit price']

But why the crazy error reporting?

The reason for the error is that numerical data and non-numeric data are calculated against each other. PANDAS recognizes the conversion rate with the "%" sign as a string type. We need to remove the percent sign first, and then convert this column into floating point data:

Insert image description here

It should be noted that this operation turns 9.98% into 9.98, so we also need to divide the payment conversion rate by 100 to restore the true value of the percentage:

Insert image description here

Then, multiply the three indicators to calculate sales:

Insert image description here

5.3 Time type

The water related to time series in PANDAS is very deep. Here we only explain the most basic time format in daily life. Students who are interested in time series can check the relevant information by themselves to learn more.

Take the case data as an example. Our channel data was extracted on August 2, 2019. It may involve channel data on other dates later, so it is necessary to add a column of time to distinguish it. The commonly used time format in EXCEL is '2019 -8-3' or '2019/8/3', we use PANDAS to implement it:

Insert image description here

In actual business, sometimes PANDAS will read the date format fields in the file into string format. Here we first assign the string '2019-8-3' to the new date column, and then use the to_datetime() function Convert string type to time format:

Insert image description here

After converting to time format (here is datetime64), we can use the idea of ​​processing time to process this data efficiently. For example, I now want to know how many days are left until the end of the year ('2019-12-31') when extracting data, directly Do the subtraction (the function accepts a sequence of strings in time format, but also a single string):

Insert image description here

Isn't it very simple?

If you are interested in Python, you can try this complete set of Python learning materials I compiled.

For beginners with 0 basics to get started:

If you are a novice and want to get started with Python quickly, you can consider it.
On the one hand, the learning time is relatively short and the learning content is more comprehensive and focused. The second aspect is that you can find a study plan that suits you

Including: Python permanent installation package, Python web development, Python crawler, Python data analysis, artificial intelligence, machine learning and other learning tutorials. Let you learn Python systematically from scratch!

Introduction to zero-based Python learning resources

1. Learning routes in all directions of Python

The Python all-direction route is to organize the commonly used technical points of Python to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the above knowledge points to ensure that you learn more comprehensively.
Insert image description here

2. Python learning software

If a worker wants to do his job well, he must first sharpen his tools. The commonly used development software for learning Python is here!
Insert image description here

3. Python introductory learning video

There are also many learning videos suitable for beginners. With these videos, you can easily get started with Python~Insert image description here

4. Python exercises

After each video lesson, there are corresponding exercises to test your learning results haha!
Insert image description here

5. Python practical cases

Optical theory is useless. You must learn to type code along with it and practice it in order to apply what you have learned to practice. At this time, you can learn from some practical cases. This information is also included~Insert image description here

6. Python interview materials

After we learn Python, we can go out and find a job if we have the skills! The following interview questions are all from first-tier Internet companies such as Alibaba, Tencent, Byte, etc., and Alibaba bosses have given authoritative answers. I believe everyone can find a satisfactory job after reviewing this set of interview materials.
Insert image description here
Insert image description here

7. Data collection

The complete set of Python learning materials mentioned above has been uploaded to CSDN official. Friends who need it can scan the CSDN official certification QR code below on WeChat and enter "receive materials" to get it for free! !

Guess you like

Origin blog.csdn.net/maiya_yaya/article/details/131780105