Column introduction
Combining my own experience and internal materials to summarize the Python tutorials, 3-5 chapters a day, a minimum of 1 month will be able to complete the learning of Python in an all-round way and carry out practical development. After learning, I will definitely become a boss! Come on! roll up!
For all articles, please visit the column: "Python Full Stack Tutorial (0 Basics)"
and recommend the most recent update: "Detailed Explanation of High-frequency Interview Questions in Dachang Test" This column provides detailed answers to interview questions related to high-frequency testing in recent years, combined with your own Years of work experience, as well as the guidance of peer leaders summed up. It aims to help students in testing and python to pass the interview smoothly and get a satisfactory offer!
Article directory
Detailed explanation of the basic application of Pandas (2)
Applications of DataFrames
Create DataFrame object
DataFrame
Create an object from a two-dimensional array
code:
scores = np.random.randint(60, 101, (5, 3))
courses = ['语文', '数学', '英语']
ids = [1001, 1002, 1003, 1004, 1005]
df1 = pd.DataFrame(data=scores, columns=courses, index=ids)
df1
output:
语文 数学 英语
1001 69 80 79
1002 71 60 100
1003 94 81 93
1004 88 88 67
1005 82 66 60
DataFrame
create object from dictionary
code:
scores = {
'语文': [62, 72, 93, 88, 93],
'数学': [95, 65, 86, 66, 87],
'英语': [66, 75, 82, 69, 82],
}
ids = [1001, 1002, 1003, 1004, 1005]
df2 = pd.DataFrame(data=scores, index=ids)
df2
output:
语文 数学 英语
1001 69 80 79
1002 71 60 100
1003 94 81 93
1004 88 88 67
1005 82 66 60
Read CSV file to create DataFrame
object
The CSV file can be read through the function pandas
of the module read_csv
. read_csv
There are many parameters of the function, and several important parameters are accepted below.
sep
/delimiter
: delimiter, default is,
.header
: The position of the table header (column index), the default value isinfer
to use the content of the first row as the table header (column index).index_col
: Column used as row index (label).usecols
: The column to be loaded, you can use the serial number or the column name.true_values
/false_values
: Which values are treated as booleansTrue
/False
.skiprows
: Specify the line to be skipped by line number, index or function.skipfooter
: The number of last lines to skip.nrows
: The number of rows to read.na_values
: Which values are considered null.
code:
df3 = pd.read_csv('2018年北京积分落户数据.csv', index_col='id')
df3
output:
name birthday company score
id
1 杨x 1972-12 北京利德xxxx 122.59
2 纪x 1974-12 北京航天xxxx 121.25
3 王x 1974-05 品牌联盟xxxx 118.96
4 杨x 1975-07 中科专利xxxx 118.21
5 张x 1974-11 北京阿里xxxx 117.79
... ... ... ... ...
6015 孙x 1978-08 华为海洋xxxx 90.75
6016 刘x 1976-11 福斯流体xxxx 90.75
6017 周x 1977-10 赢创德固xxxx 90.75
6018 赵x 1979-07 澳科利耳xxxx 90.75
6019 贺x 1981-06 北京宝洁xxxx 90.75
6019 rows × 4 columns
Note : If you need the CSV file in the above example, you can obtain it through the following Baidu cloud disk address, and the data is in the directory of "Learning Data Analysis from Scratch". Link: https://pan.baidu.com/s/1rQujl5RQn9R7PadB2Z5g_g, extraction code: e7b4.
Read Excel file to create DataFrame
object
The Excel file can be read through the function pandas
of the module read_excel
. This function is very similar to the above read_csv
, and there is an additional sheet_name
parameter to specify the name of the data table, but unlike the CSV file, there is no sep
or delimiter
such parameter. In the following code, the parameter read_excel
of the function skiprows
is a Lambda function, through which the Lambda function is specified to read only the header and 10% of the data in the Excel file, and skip other data.
code:
import random
df4 = pd.read_excel(
io='小宝剑大药房2018年销售数据.xlsx',
usecols=['购药时间', '社保卡号', '商品名称', '销售数量', '应收金额', '实收金额'],
skiprows=lambda x: x > 0 and random.random() > 0.1
)
df4
Note : If you need the Excel file in the above example, you can get it through the following Baidu cloud disk address, and the data is in the directory of "Learning Data Analysis from Scratch". Link: https://pan.baidu.com/s/1rQujl5RQn9R7PadB2Z5g_g, extraction code: e7b4.
output:
购药时间 社保卡号 商品名称 销售数量 应收金额 实收金额
0 2018-03-23 星期三 10012157328 强力xx片 1 13.8 13.80
1 2018-07-12 星期二 108207828 强力xx片 1 13.8 13.80
2 2018-01-17 星期日 13358228 清热xx液 1 28.0 28.00
3 2018-07-11 星期一 10031402228 三九xx灵 5 149.0 130.00
4 2018-01-20 星期三 10013340328 三九xx灵 3 84.0 73.92
... ... ... ... ... ... ...
618 2018-03-05 星期六 10066059228 开博xx通 2 56.0 49.28
619 2018-03-22 星期二 10035514928 开博xx通 1 28.0 25.00
620 2018-04-15 星期五 1006668328 开博xx通 2 56.0 50.00
621 2018-04-24 星期日 10073294128 高特xx灵 1 5.6 5.60
622 2018-04-24 星期日 10073294128 高特xx灵 10 56.0 56.0
623 rows × 6 columns
DataFrame
Create objects by reading data from the database through SQL
pandas
The function of the module read_sql
can read data from the database to create DataFrame
objects through SQL statements, and the second parameter of the function represents the database to be connected. For the MySQL database, we can use pymysql
or mysqlclient
to create a database connection to get an Connection
object, and this object is read_sql
the second parameter required by the function, the code is as follows.
code:
import pymysql
# 创建一个MySQL数据库的连接对象
conn = pymysql.connect(
host='47.104.31.138', port=3306,
user='guest', password='Guest.618',
database='hrs', charset='utf8mb4'
)
# 通过SQL从数据库读取数据创建DataFrame
df5 = pd.read_sql('select * from tb_emp', conn, index_col='eno')
df5
Tip : To execute the above code, you need to install
pymysql
the library first. If you have not installed it, you can first execute it in the cell of Notebook!pip install pymysql
, and then run the above code. The above code connects to my MySQL database deployed on Alibaba Cloud, public network IP address:47.104.31.138
, user name:guest
, password:Guest.618
, database:hrs
, table name:tb_emp
, character set:utf8mb4
, you can use this database, but do not do malicious things Access.
output:
ename job mgr sal comm dno
eno
1359 胡一刀 销售员 3344.0 1800 200.0 30
2056 乔峰 分析师 7800.0 5000 1500.0 20
3088 李莫愁 设计师 2056.0 3500 800.0 20
3211 张无忌 程序员 2056.0 3200 NaN 20
3233 丘处机 程序员 2056.0 3400 NaN 20
3244 欧阳锋 程序员 3088.0 3200 NaN 20
3251 张翠山 程序员 2056.0 4000 NaN 20
3344 黄蓉 销售主管 7800.0 3000 800.0 30
3577 杨过 会计 5566.0 2200 NaN 10
3588 朱九真 会计 5566.0 2500 NaN 10
4466 苗人凤 销售员 3344.0 2500 NaN 30
5234 郭靖 出纳 5566.0 2000 NaN 10
5566 宋远桥 会计师 7800.0 4000 1000.0 10
7800 张三丰 总裁 NaN 9000 1200.0 20
Basic properties and methods
Before starting to explain DataFrame
the properties and methods, we first read the data of the three tables from the previously mentioned hrs
database and create three DataFrame
objects. The code is as follows.
import pymysql
conn = pymysql.connect(
host='47.104.31.138', port=3306,
user='guest', password='Guest.618',
database='hrs', charset='utf8mb4'
)
dept_df = pd.read_sql('select * from tb_dept', conn, index_col='dno')
emp_df = pd.read_sql('select * from tb_emp', conn, index_col='eno')
emp2_df = pd.read_sql('select * from tb_emp2', conn, index_col='eno')
The resulting three DataFrame
objects are shown below.
Department table ( dept_df
), which dno
is the number of the department, dname
and dloc
is the name and location of the department respectively.
dname dloc
dno
10 会计部 北京
20 研发部 成都
30 销售部 重庆
40 运维部 天津
Employee table ( emp_df
), which eno
is the employee number, ename
, job
, mgr
, sal
, comm
and dno
represent the employee's name, position, supervisor number, monthly salary, allowance and department number respectively.
ename job mgr sal comm dno
eno
1359 胡一刀 销售员 3344.0 1800 200.0 30
2056 乔峰 分析师 7800.0 5000 1500.0 20
3088 李莫愁 设计师 2056.0 3500 800.0 20
3211 张无忌 程序员 2056.0 3200 NaN 20
3233 丘处机 程序员 2056.0 3400 NaN 20
3244 欧阳锋 程序员 3088.0 3200 NaN 20
3251 张翠山 程序员 2056.0 4000 NaN 20
3344 黄蓉 销售主管 7800.0 3000 800.0 30
3577 杨过 会计 5566.0 2200 NaN 10
3588 朱九真 会计 5566.0 2500 NaN 10
4466 苗人凤 销售员 3344.0 2500 NaN 30
5234 郭靖 出纳 5566.0 2000 NaN 10
5566 宋远桥 会计师 7800.0 4000 1000.0 10
7800 张三丰 总裁 NaN 9000 1200.0 20
Explanation
mgr
: The data type ofcomm
the two columns in the database is yesint
, but because there are missing values (null values), after readingDataFrame
, the data type of the column becomesfloat
, because we usually usefloat
the typeNaN
to represent the null value.
The employee table ( emp2_df
) has the same structure as the above employee table, but saves different employee data.
ename job mgr sal comm dno
eno
9800 骆昊 架构师 7800 30000 5000 20
9900 王小刀 程序员 9800 10000 1200 20
9700 王大锤 程序员 9800 8000 600 20
DataFrame
The properties of the object are shown in the table below.
attribute name | illustrate |
---|---|
at / iat |
DataFrame Get a single value in by tag . |
columns |
DataFrame the index of the object column |
dtypes |
DataFrame The data type of each column of the object |
empty |
DataFrame whether the object is empty |
loc /iloc |
DataFrame Get an array of values in by label . |
ndim |
DataFrame object dimensions |
shape |
DataFrame The shape of the object (number of rows and columns) |
size |
DataFrame the number of elements in the object |
values |
DataFrame The two-dimensional array corresponding to the data of the object |
Regarding DataFrame
the method, the first thing we need to understand is info()
the method, which can help us understand DataFrame
the relevant information, as shown below.
code:
emp_df.info()
output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14 entries, 1359 to 7800
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ename 14 non-null object
1 job 14 non-null object
2 mgr 13 non-null float64
3 sal 14 non-null int64
4 comm 6 non-null float64
5 dno 14 non-null int64
dtypes: float64(2), int64(2), object(2)
memory usage: 1.3+ KB
If you need to view DataFrame
the head or tail data, you can use head()
the or tail()
method. The default parameters of these two methods are 5
to get DataFrame
the data of the first 5 rows or the last 5 rows, as shown below.
emp_df.head()
output:
ename job mgr sal comm dno
eno
1359 胡一刀 销售员 3344 1800 200 30
2056 乔峰 分析师 7800 5000 1500 20
3088 李莫愁 设计师 2056 3500 800 20
3211 张无忌 程序员 2056 3200 NaN 20
3233 丘处机 程序员 2056 3400 NaN 20
retrieve data
Indexing and Slicing
If you want to get DataFrame
a certain column, for example, take out the above emp_df
column ename
, you can use the following two methods.
emp_df.ename
or
emp_df['ename']
Executing the above code shows that what we get is an Series
object. In fact, DataFrame
objects are Series
the result of combining multiple objects together.
If you want to get DataFrame
a certain row, you can use the integer index or the index we set, for example, to fetch the employee 2056
data whose employee number is, the code is as follows.
emp_df.iloc[1]
or
emp_df.loc[2056]
By executing the above code, we found that all the DataFrame
objects obtained by fetching a certain row or a certain column are Series
objects. Of course, we can also obtain data of multiple rows or columns through fancy indexing, and the result of fancy indexing is still an DataFrame
object.
Get multiple columns:
emp_df[['ename', 'job']]
Get multiple rows:
emp_df.loc[[2056, 7800, 3344]]
If you want to obtain or modify DataFrame
the data of a certain cell of the object, you need to specify the index of the row and column at the same time, for example, to obtain the 2056
position information of the employee whose employee number is, the code is as follows.
emp_df['job'][2056]
or
emp_df.loc[2056]['job']
or
emp_df.loc[2056, 'job']
We recommend that you use the third method, because it only does one index operation. If you want to modify the employee's job title to "Architect", you can use the code below.
emp_df.loc[2056, 'job'] = '架构师'
Of course, we can also obtain multiple rows and multiple columns through slicing operations, I believe everyone must have thought of this.
emp_df.loc[2056:3344]
output:
ename job mgr sal comm dno
eno
2056 乔峰 分析师 7800.0 5000 1500.0 20
3088 李莫愁 设计师 2056.0 3500 800.0 20
3211 张无忌 程序员 2056.0 3200 NaN 20
3233 丘处机 程序员 2056.0 3400 NaN 20
3244 欧阳锋 程序员 3088.0 3200 NaN 20
3251 张翠山 程序员 2056.0 4000 NaN 20
3344 黄蓉 销售主管 7800.0 3000 800.0 30
data filtering
We mentioned fancy indexes above, I believe you have already thought of Boolean indexes. As with ndarray
and Series
, we can DataFrame
filter data on objects through Boolean indexes. For example, we want to emp_df
filter out 3500
employees whose monthly salary exceeds . The code is as follows.
emp_df[emp_df.sal > 3500]
output:
ename job mgr sal comm dno
eno
2056 乔峰 分析师 7800.0 5000 1500.0 20
3251 张翠山 程序员 2056.0 4000 NaN 20
5566 宋远桥 会计师 7800.0 4000 1000.0 10
7800 张三丰 总裁 NaN 9000 1200.0 20
Of course, we can also combine multiple conditions to filter data, for example, emp_df
to filter out employees whose monthly salary is over 3500
and 20
whose department number is , the code is as follows.
emp_df[(emp_df.sal > 3500) & (emp_df.dno == 20)]
output:
ename job mgr sal comm dno
eno
2056 乔峰 分析师 7800.0 5000 1500.0 20
3251 张翠山 程序员 2056.0 4000 NaN 20
7800 张三丰 总裁 NaN 9000 1200.0 20
In addition to using Boolean indexes, DataFrame
object query
methods can also implement data filtering. query
The parameter of the method is a string, which represents the expression used to filter data, and is more in line with the usage habits of Python programmers. Next, we use query
the method to re-implement the above effect, the code is as follows.
emp_df.query('sal > 3500 and dno == 20')
reshape data
Sometimes, the raw data we need for data analysis may not come from one place, like in the above example, we read three tables from the relational database and got three objects, but the actual work may DataFrame
require We bring their data together. For example: emp_df
and emp2_df
are actually employee data, and the data structure is exactly the same, we can use the pandas
provided concat
function to achieve two or more DataFrame
data splicing, the code is as follows.
all_emp_df = pd.concat([emp_df, emp2_df])
output:
ename job mgr sal comm dno
eno
1359 胡一刀 销售员 3344.0 1800 200.0 30
2056 乔峰 分析师 7800.0 5000 1500.0 20
3088 李莫愁 设计师 2056.0 3500 800.0 20
3211 张无忌 程序员 2056.0 3200 NaN 20
3233 丘处机 程序员 2056.0 3400 NaN 20
3244 欧阳锋 程序员 3088.0 3200 NaN 20
3251 张翠山 程序员 2056.0 4000 NaN 20
3344 黄蓉 销售主管 7800.0 3000 800.0 30
3577 杨过 会计 5566.0 2200 NaN 10
3588 朱九真 会计 5566.0 2500 NaN 10
4466 苗人凤 销售员 3344.0 2500 NaN 30
5234 郭靖 出纳 5566.0 2000 NaN 10
5566 宋远桥 会计师 7800.0 4000 1000.0 10
7800 张三丰 总裁 NaN 9000 1200.0 20
9800 骆昊 架构师 7800.0 30000 5000.0 20
9900 王小刀 程序员 9800.0 10000 1200.0 20
9700 王大锤 程序员 9800.0 8000 600.0 20
The above code DataFrame
splices the two employee data together, and then we use merge
the function to merge the data of the employee table and the department table into one table, the code is as follows.
First use the index reset_index
reset all_emp_df
by the method, so that eno
it is no longer an index but a common column. The parameter reset_index
of the method inplace
is set to True
indicate that the operation of resetting the index is all_emp_df
performed directly on it instead of returning a new modified object.
all_emp_df.reset_index(inplace=True)
Merge data through merge
functions, of course, you can also call DataFrame
object merge
methods to achieve the same effect.
pd.merge(dept_df, all_emp_df, how='inner', on='dno')
output:
dno dname dloc eno ename job mgr sal comm
0 10 会计部 北京 3577 杨过 会计 5566.0 2200 NaN
1 10 会计部 北京 3588 朱九真 会计 5566.0 2500 NaN
2 10 会计部 北京 5234 郭靖 出纳 5566.0 2000 NaN
3 10 会计部 北京 5566 宋远桥 会计师 7800.0 4000 1000.0
4 20 研发部 成都 2056 乔峰 架构师 7800.0 5000 1500.0
5 20 研发部 成都 3088 李莫愁 设计师 2056.0 3500 800.0
6 20 研发部 成都 3211 张无忌 程序员 2056.0 3200 NaN
7 20 研发部 成都 3233 丘处机 程序员 2056.0 3400 NaN
8 20 研发部 成都 3244 欧阳锋 程序员 3088.0 3200 NaN
9 20 研发部 成都 3251 张翠山 程序员 2056.0 4000 NaN
10 20 研发部 成都 7800 张三丰 总裁 NaN 9000 1200.0
11 20 研发部 成都 9800 骆昊 架构师 7800.0 30000 5000.0
12 20 研发部 成都 9900 王小刀 程序员 9800.0 10000 1200.0
13 20 研发部 成都 9700 王大锤 程序员 9800.0 8000 600.0
14 30 销售部 重庆 1359 胡一刀 销售员 3344.0 1800 200.0
15 30 销售部 重庆 3344 黄蓉 销售主管 7800.0 3000 800.0
16 30 销售部 重庆 4466 苗人凤 销售员 3344.0 2500 NaN
merge
One parameter of the function represents the merged left table, and the second parameter represents the merged right table. Students with SQL programming experience feel very familiar with these two words. As everyone guessed, DataFrame
the merging of objects is very similar to the table connection in the database, so the above code how
represents the way of merging two tables, with left
, right
, inner
, outer
four options; and on
it represents the realization of the table based on which column Merge is equivalent to the table join condition in SQL table join. If the column names of the left and right tables are different, you can use left_on
and right_on
parameters instead of on
parameters to specify respectively.
If you make a slight modification to the above code and how
change the parameter to left
, you can think about the result of the code execution.
pd.merge(dept_df, all_emp_df, how='left', on='dno')
The running result has one more row as shown below than the previous output. This is because it left
represents the left outer connection, which means that dept_df
the data in the left table will be completely checked out, but all_emp_df
there is no 40
employee numbered as department in the So the corresponding positions are filled with null values.
17 40 运维部 天津 NaN NaN NaN NaN NaN NaN