Big data (5): Detailed explanation of the basic application of Pandas (2)

Column introduction

Combining my own experience and internal materials to summarize the Python tutorials, 3-5 chapters a day, a minimum of 1 month will be able to complete the learning of Python in an all-round way and carry out practical development. After learning, I will definitely become a boss! Come on! roll up!

For all articles, please visit the column: "Python Full Stack Tutorial (0 Basics)"
and recommend the most recent update: "Detailed Explanation of High-frequency Interview Questions in Dachang Test" This column provides detailed answers to interview questions related to high-frequency testing in recent years, combined with your own Years of work experience, as well as the guidance of peer leaders summed up. It aims to help students in testing and python to pass the interview smoothly and get a satisfactory offer!



Detailed explanation of the basic application of Pandas (2)

Applications of DataFrames

Create DataFrame object

DataFrameCreate an object from a two-dimensional array

code:

scores = np.random.randint(60, 101, (5, 3))
courses = ['语文', '数学', '英语']
ids = [1001, 1002, 1003, 1004, 1005]
df1 = pd.DataFrame(data=scores, columns=courses, index=ids)
df1

output:

		语文	数学	英语
1001    69    80	79
1002    71	  60	100
1003    94    81	93
1004    88	  88	67
1005    82	  66    60
DataFramecreate object from dictionary

code:

scores = {
    
    
    '语文': [62, 72, 93, 88, 93],
    '数学': [95, 65, 86, 66, 87],
    '英语': [66, 75, 82, 69, 82],
}
ids = [1001, 1002, 1003, 1004, 1005]
df2 = pd.DataFrame(data=scores, index=ids)
df2

output:

		语文	数学	英语
1001    69    80	79
1002    71	  60	100
1003    94    81	93
1004    88	  88	67
1005    82	  66    60
Read CSV file to create DataFrameobject

The CSV file can be read through the function pandasof the module read_csv. read_csvThere are many parameters of the function, and several important parameters are accepted below.

  • sep/ delimiter: delimiter, default is ,.
  • header: The position of the table header (column index), the default value is inferto use the content of the first row as the table header (column index).
  • index_col: Column used as row index (label).
  • usecols: The column to be loaded, you can use the serial number or the column name.
  • true_values/ false_values: Which values ​​are treated as booleans True/ False.
  • skiprows: Specify the line to be skipped by line number, index or function.
  • skipfooter: The number of last lines to skip.
  • nrows: The number of rows to read.
  • na_values: Which values ​​are considered null.

code:

df3 = pd.read_csv('2018年北京积分落户数据.csv', index_col='id')
df3

output:

     name   birthday    company       score
id				
1    杨x    1972-12    北京利德xxxx	  122.59
2    纪x    1974-12    北京航天xxxx	  121.25
3    王x    1974-05	  品牌联盟xxxx    118.96
4    杨x    1975-07	  中科专利xxxx    118.21
5    张x    1974-11	  北京阿里xxxx    117.79
...  ...    ...        ...            ...
6015 孙x    1978-08	  华为海洋xxxx	  90.75
6016 刘x    1976-11	  福斯流体xxxx    90.75
6017 周x    1977-10	  赢创德固xxxx    90.75
6018 赵x	   1979-07	  澳科利耳xxxx    90.75
6019 贺x	   1981-06	  北京宝洁xxxx    90.75
6019 rows × 4 columns

Note : If you need the CSV file in the above example, you can obtain it through the following Baidu cloud disk address, and the data is in the directory of "Learning Data Analysis from Scratch". Link: https://pan.baidu.com/s/1rQujl5RQn9R7PadB2Z5g_g, extraction code: e7b4.

Read Excel file to create DataFrameobject

The Excel file can be read through the function pandasof the module read_excel. This function is very similar to the above read_csv, and there is an additional sheet_nameparameter to specify the name of the data table, but unlike the CSV file, there is no sepor delimitersuch parameter. In the following code, the parameter read_excelof the function skiprowsis a Lambda function, through which the Lambda function is specified to read only the header and 10% of the data in the Excel file, and skip other data.

code:

import random

df4 = pd.read_excel(
    io='小宝剑大药房2018年销售数据.xlsx',
    usecols=['购药时间', '社保卡号', '商品名称', '销售数量', '应收金额', '实收金额'],
    skiprows=lambda x: x > 0 and random.random() > 0.1
)
df4

Note : If you need the Excel file in the above example, you can get it through the following Baidu cloud disk address, and the data is in the directory of "Learning Data Analysis from Scratch". Link: https://pan.baidu.com/s/1rQujl5RQn9R7PadB2Z5g_g, extraction code: e7b4.

output:

    购药时间			社保卡号	    商品名称    销售数量	应收金额	实收金额
0	2018-03-23 星期三	10012157328		强力xx片	 1			13.8		13.80
1	2018-07-12 星期二	108207828	    强力xx片	 1	        13.8		13.80
2	2018-01-17 星期日	13358228	    清热xx液	 1		    28.0		28.00
3	2018-07-11 星期一	10031402228		三九xx灵	 5			149.0		130.00
4	2018-01-20 星期三	10013340328		三九xx灵	 3			84.0		73.92
...	...					...				...		...			...			...
618	2018-03-05 星期六	10066059228		开博xx通	 2			56.0		49.28
619	2018-03-22 星期二	10035514928		开博xx通	 1			28.0		25.00
620	2018-04-15 星期五	1006668328	    开博xx通	 2			56.0		50.00
621	2018-04-24 星期日	10073294128		高特xx灵	 1			5.6			5.60
622	2018-04-24 星期日	10073294128		高特xx灵	 10			56.0		56.0
623 rows × 6 columns
DataFrameCreate objects by reading data from the database through SQL

pandasThe function of the module read_sqlcan read data from the database to create DataFrameobjects through SQL statements, and the second parameter of the function represents the database to be connected. For the MySQL database, we can use pymysqlor mysqlclientto create a database connection to get an Connectionobject, and this object is read_sqlthe second parameter required by the function, the code is as follows.

code:

import pymysql

# 创建一个MySQL数据库的连接对象
conn = pymysql.connect(
    host='47.104.31.138', port=3306,
    user='guest', password='Guest.618',
    database='hrs', charset='utf8mb4'
)
# 通过SQL从数据库读取数据创建DataFrame
df5 = pd.read_sql('select * from tb_emp', conn, index_col='eno')
df5

Tip : To execute the above code, you need to install pymysqlthe library first. If you have not installed it, you can first execute it in the cell of Notebook !pip install pymysql, and then run the above code. The above code connects to my MySQL database deployed on Alibaba Cloud, public network IP address: 47.104.31.138, user name: guest, password: Guest.618, database: hrs, table name: tb_emp, character set: utf8mb4, you can use this database, but do not do malicious things Access.

output:

        ename    job     mgr      sal    comm    dno
eno						
1359	胡一刀   销售员	3344.0   1800   200.0   30
2056	乔峰	   分析师	 7800.0   5000   1500.0	 20
3088	李莫愁	  设计师	2056.0   3500   800.0   20
3211	张无忌	  程序员	2056.0   3200   NaN     20
3233	丘处机	  程序员	2056.0   3400	NaN     20
3244	欧阳锋	  程序员	3088.0   3200	NaN     20
3251	张翠山	  程序员	2056.0   4000	NaN     20
3344	黄蓉	   销售主管	7800.0   3000	800.0   30
3577	杨过	   会计	  5566.0   2200   NaN	  10
3588	朱九真	  会计	 5566.0   2500   NaN	 10
4466	苗人凤	  销售员	3344.0   2500	NaN     30
5234	郭靖	   出纳	  5566.0   2000   NaN	  10
5566	宋远桥	  会计师	7800.0   4000   1000.0  10
7800	张三丰	  总裁	 NaN      9000   1200.0  20

Basic properties and methods

Before starting to explain DataFramethe properties and methods, we first read the data of the three tables from the previously mentioned hrsdatabase and create three DataFrameobjects. The code is as follows.

import pymysql

conn = pymysql.connect(
    host='47.104.31.138', port=3306, 
    user='guest', password='Guest.618', 
    database='hrs', charset='utf8mb4'
)
dept_df = pd.read_sql('select * from tb_dept', conn, index_col='dno')
emp_df = pd.read_sql('select * from tb_emp', conn, index_col='eno')
emp2_df = pd.read_sql('select * from tb_emp2', conn, index_col='eno')

The resulting three DataFrameobjects are shown below.

Department table ( dept_df), which dnois the number of the department, dnameand dlocis the name and location of the department respectively.

    dname  dloc
dno
10	会计部	北京
20	研发部	成都
30	销售部	重庆
40	运维部	天津

Employee table ( emp_df), which enois the employee number, ename, job, mgr, sal, command dnorepresent the employee's name, position, supervisor number, monthly salary, allowance and department number respectively.

        ename    job        mgr      sal     comm    dno
eno
1359	胡一刀    销售员	   3344.0	1800	200.0	30
2056	乔峰	    分析师	    7800.0	 5000	 1500.0	 20
3088	李莫愁	   设计师	   2056.0	3500	800.0	20
3211	张无忌	   程序员	   2056.0	3200	NaN     20
3233	丘处机	   程序员	   2056.0	3400	NaN	    20
3244	欧阳锋	   程序员	   3088.0	3200	NaN     20
3251	张翠山	   程序员	   2056.0	4000	NaN	    20
3344	黄蓉	    销售主管   7800.0	3000	800.0	30
3577	杨过	    会计	     5566.0	  2200	  NaN	  10
3588	朱九真	   会计	    5566.0	 2500	 NaN	 10
4466	苗人凤	   销售员	   3344.0	2500	NaN	    30
5234	郭靖	    出纳	     5566.0	  2000	  NaN	  10
5566	宋远桥	   会计师	   7800.0	4000	1000.0	10
7800	张三丰	   总裁	    NaN      9000	 1200.0	 20

Explanationmgr : The data type of commthe two columns in the database is yes int, but because there are missing values ​​(null values), after reading DataFrame, the data type of the column becomes float, because we usually use floatthe type NaNto represent the null value.

The employee table ( emp2_df) has the same structure as the above employee table, but saves different employee data.

        ename    job    mgr     sal      comm    dno
eno
9800	骆昊	   架构师	7800	30000	 5000	 20
9900	王小刀	  程序员  9800	   10000	1200	20
9700	王大锤	  程序员  9800    8000 	600	    20

DataFrameThe properties of the object are shown in the table below.

attribute name illustrate
at / iat DataFrameGet a single value in by tag .
columns DataFramethe index of the object column
dtypes DataFrameThe data type of each column of the object
empty DataFramewhether the object is empty
loc/iloc DataFrameGet an array of values ​​in by label .
ndim DataFrameobject dimensions
shape DataFrameThe shape of the object (number of rows and columns)
size DataFramethe number of elements in the object
values DataFrameThe two-dimensional array corresponding to the data of the object

Regarding DataFramethe method, the first thing we need to understand is info()the method, which can help us understand DataFramethe relevant information, as shown below.

code:

emp_df.info()

output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 1359 to 7800
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ename   14 non-null     object 
 1   job     14 non-null     object 
 2   mgr     13 non-null     float64
 3   sal     14 non-null     int64  
 4   comm    6 non-null      float64
 5   dno     14 non-null     int64  
dtypes: float64(2), int64(2), object(2)
memory usage: 1.3+ KB

If you need to view DataFramethe head or tail data, you can use head()the or tail()method. The default parameters of these two methods are 5to get DataFramethe data of the first 5 rows or the last 5 rows, as shown below.

emp_df.head()

output:

        ename    job    mgr    sal    comm  dno
eno						
1359	胡一刀   销售员	3344   1800  200   30
2056	乔峰	   分析师	 7800   5000  1500	20
3088	李莫愁	  设计师	2056   3500  800   20
3211	张无忌	  程序员	2056   3200  NaN   20
3233	丘处机	  程序员	2056   3400	 NaN   20

retrieve data

Indexing and Slicing

If you want to get DataFramea certain column, for example, take out the above emp_dfcolumn ename, you can use the following two methods.

emp_df.ename

or

emp_df['ename']

Executing the above code shows that what we get is an Seriesobject. In fact, DataFrameobjects are Seriesthe result of combining multiple objects together.

If you want to get DataFramea certain row, you can use the integer index or the index we set, for example, to fetch the employee 2056data whose employee number is, the code is as follows.

emp_df.iloc[1]

or

emp_df.loc[2056]

By executing the above code, we found that all the DataFrameobjects obtained by fetching a certain row or a certain column are Seriesobjects. Of course, we can also obtain data of multiple rows or columns through fancy indexing, and the result of fancy indexing is still an DataFrameobject.

Get multiple columns:

emp_df[['ename', 'job']]

Get multiple rows:

emp_df.loc[[2056, 7800, 3344]]

If you want to obtain or modify DataFramethe data of a certain cell of the object, you need to specify the index of the row and column at the same time, for example, to obtain the 2056position information of the employee whose employee number is, the code is as follows.

emp_df['job'][2056]

or

emp_df.loc[2056]['job']

or

emp_df.loc[2056, 'job']

We recommend that you use the third method, because it only does one index operation. If you want to modify the employee's job title to "Architect", you can use the code below.

emp_df.loc[2056, 'job'] = '架构师'

Of course, we can also obtain multiple rows and multiple columns through slicing operations, I believe everyone must have thought of this.

emp_df.loc[2056:3344]

output:

        ename    job        mgr      sal     comm    dno
eno
2056	乔峰	    分析师	    7800.0	 5000	 1500.0	 20
3088	李莫愁	   设计师	   2056.0	3500	800.0	20
3211	张无忌	   程序员	   2056.0	3200	NaN     20
3233	丘处机	   程序员	   2056.0	3400	NaN	    20
3244	欧阳锋	   程序员	   3088.0	3200	NaN     20
3251	张翠山	   程序员	   2056.0	4000	NaN	    20
3344	黄蓉	    销售主管   7800.0	3000	800.0	30
data filtering

We mentioned fancy indexes above, I believe you have already thought of Boolean indexes. As with ndarrayand Series, we can DataFramefilter data on objects through Boolean indexes. For example, we want to emp_dffilter out 3500employees whose monthly salary exceeds . The code is as follows.

emp_df[emp_df.sal > 3500]

output:

        ename    job        mgr      sal     comm    dno
eno
2056	乔峰	    分析师	    7800.0	 5000	 1500.0	 20
3251	张翠山	   程序员	   2056.0	4000	NaN	    20
5566	宋远桥	   会计师	   7800.0	4000	1000.0	10
7800	张三丰	   总裁	    NaN      9000	 1200.0	 20

Of course, we can also combine multiple conditions to filter data, for example, emp_dfto filter out employees whose monthly salary is over 3500and 20whose department number is , the code is as follows.

emp_df[(emp_df.sal > 3500) & (emp_df.dno == 20)]

output:

        ename    job        mgr      sal     comm    dno
eno
2056	乔峰	    分析师	    7800.0	 5000	 1500.0	 20
3251	张翠山	   程序员	   2056.0	4000	NaN	    20
7800	张三丰	   总裁	    NaN      9000	 1200.0	 20

In addition to using Boolean indexes, DataFrameobject querymethods can also implement data filtering. queryThe parameter of the method is a string, which represents the expression used to filter data, and is more in line with the usage habits of Python programmers. Next, we use querythe method to re-implement the above effect, the code is as follows.

emp_df.query('sal > 3500 and dno == 20')

reshape data

Sometimes, the raw data we need for data analysis may not come from one place, like in the above example, we read three tables from the relational database and got three objects, but the actual work may DataFramerequire We bring their data together. For example: emp_dfand emp2_dfare actually employee data, and the data structure is exactly the same, we can use the pandasprovided concatfunction to achieve two or more DataFramedata splicing, the code is as follows.

all_emp_df = pd.concat([emp_df, emp2_df])

output:

        ename    job        mgr      sal     comm    dno
eno
1359    胡一刀    销售员	   3344.0	1800	200.0	30
2056    乔峰	    分析师	    7800.0	 5000	 1500.0	 20
3088    李莫愁	   设计师	   2056.0	3500	800.0	20
3211    张无忌	   程序员	   2056.0	3200	NaN     20
3233    丘处机	   程序员	   2056.0	3400	NaN	    20
3244    欧阳锋	   程序员	   3088.0	3200	NaN     20
3251    张翠山	   程序员	   2056.0	4000	NaN	    20
3344    黄蓉	    销售主管   7800.0	3000	800.0	30
3577    杨过	    会计	     5566.0	  2200	  NaN	  10
3588    朱九真	   会计	    5566.0	 2500	 NaN	 10
4466    苗人凤	   销售员	   3344.0	2500	NaN	    30
5234    郭靖	    出纳	     5566.0	  2000	  NaN	  10
5566    宋远桥	   会计师	   7800.0	4000	1000.0	10
7800    张三丰	   总裁	    NaN      9000	 1200.0	 20
9800    骆昊	    架构师     7800.0	 30000	 5000.0	 20
9900    王小刀	   程序员     9800.0	10000	1200.0	20
9700    王大锤	   程序员     9800.0	8000	600.0	20

The above code DataFramesplices the two employee data together, and then we use mergethe function to merge the data of the employee table and the department table into one table, the code is as follows.

First use the index reset_indexreset all_emp_dfby the method, so that enoit is no longer an index but a common column. The parameter reset_indexof the method inplaceis set to Trueindicate that the operation of resetting the index is all_emp_dfperformed directly on it instead of returning a new modified object.

all_emp_df.reset_index(inplace=True)

Merge data through mergefunctions, of course, you can also call DataFrameobject mergemethods to achieve the same effect.

pd.merge(dept_df, all_emp_df, how='inner', on='dno')

output:

    dno dname  dloc eno   ename  job      mgr     sal    comm
0   10	会计部	北京	3577  杨过	会计	   5566.0  2200   NaN
1   10	会计部	北京	3588  朱九真  会计     5566.0  2500   NaN
2   10	会计部	北京	5234  郭靖	出纳	   5566.0  2000   NaN
3   10	会计部	北京	5566  宋远桥  会计师   7800.0	 4000   1000.0
4   20	研发部	成都	2056  乔峰	架构师   7800.0  5000	 1500.0
5   20	研发部	成都	3088  李莫愁  设计师   2056.0	 3500   800.0
6   20	研发部	成都	3211  张无忌  程序员   2056.0	 3200   NaN
7   20	研发部	成都	3233  丘处机  程序员   2056.0	 3400   NaN
8   20	研发部	成都	3244  欧阳锋  程序员   3088.0	 3200   NaN
9   20	研发部	成都	3251  张翠山  程序员   2056.0	 4000   NaN
10  20	研发部	成都	7800  张三丰  总裁     NaN     9000   1200.0
11  20	研发部	成都	9800  骆昊    架构师   7800.0  30000	 5000.0
12  20	研发部	成都	9900  王小刀  程序员	 9800.0	 10000  1200.0
13  20	研发部	成都	9700  王大锤  程序员	 9800.0	 8000   600.0
14  30	销售部	重庆	1359  胡一刀  销售员	 3344.0	 1800   200.0
15  30	销售部	重庆	3344  黄蓉    销售主管 7800.0	 3000   800.0
16  30	销售部	重庆	4466  苗人凤  销售员   3344.0	 2500   NaN

mergeOne parameter of the function represents the merged left table, and the second parameter represents the merged right table. Students with SQL programming experience feel very familiar with these two words. As everyone guessed, DataFramethe merging of objects is very similar to the table connection in the database, so the above code howrepresents the way of merging two tables, with left, right, inner, outerfour options; and onit represents the realization of the table based on which column Merge is equivalent to the table join condition in SQL table join. If the column names of the left and right tables are different, you can use left_onand right_onparameters instead of onparameters to specify respectively.

If you make a slight modification to the above code and howchange the parameter to left, you can think about the result of the code execution.

pd.merge(dept_df, all_emp_df, how='left', on='dno')

The running result has one more row as shown below than the previous output. This is because it leftrepresents the left outer connection, which means that dept_dfthe data in the left table will be completely checked out, but all_emp_dfthere is no 40employee numbered as department in the So the corresponding positions are filled with null values.

17  40  运维部  天津  NaN  NaN  NaN  NaN  NaN  NaN

Guess you like

Origin blog.csdn.net/ml202187/article/details/132488078