Python scientific computing: Pandas
DataFrame Pandas data structure provided with a high degree wedge json, the conversion is very convenient
Series and DataFrame two core data structure, representing a two-dimensional table structure and sequence of one-dimensional, based on these two data structures, the data can be imported Pandas like
Data structures: Series and DataFrame
Series dictionary is a fixed-length sequence, when stored, corresponding to two ndarray, dictionary structure, and this is the biggest difference, because the structure of the dictionary, the number of elements is not fixed
There are two basic properties, index values, and, in a series arrangement, index default is 0, 1, ...... increasing integer sequence, may specify the index, such as index = [ 'a', 'b', 'c ']
import pandas as pd
from pandas import Series , DataFrame
x1 = Series([1,2,3,4])
x1 = Series(data = [1,2,3,4], index = ['a' , ' b' , 'c' , 'd'])
print x1
print x2
0 1
1 2
2 3
3 4
dtype:int64
a 1
b 2
c 3
d 4
dtype:int64
The index x1 is the default value, x2 is specified index has been, in front of data (values), followed by index
Or you can use a dictionary way to create series:
d = {'a' :1,'b':2,'c':3,'d':4}
x3 = Series(d)
print x3
DataFrame similar database table type data structures
Comprising row and column indices can be regarded as DataFrame indexed by the same type of dictionary consisting Series
Some students want to output test scores:
import pandas as pd
from pandas import Series,DataFrame
data = {'Chineses' :[66,96,93,90,80],'English':[75,65,85,88,90],'Math':[30,98,66,77,90]}
df1 = DataFrame(data)
df2 = DataFrame(data,index = ['Zhangfei','Guanyu','Zhaoyun','Huangzhong','Dianwei'],columns=['English','Math','Chineses'])
print df1
print df2
Column index df2 is [ 'English', 'Math', 'Chineses'], the row index is [ 'Zhangfei', 'Guanyu', 'Zhaoyun', 'Huangzhong', 'Dianwei'], the output df2 is:
English Math Chineses
Zhangfei 75 30 66
Guanyu 65 98 96
Zhaoyun 85 66 93
Huangzhong 88 77 90
Dianwei 90 90 80
Data import and export
Pandas allows importing data from xlsx, csv, or may be output to xlsx, csv file, etc.
import pandas as pd
from pandas import Series,DataFrame
score = DataFrame(pd.read_excel('data.xlsx'))
score.to_excel('datal.xlsx')
print score
If during operation of the case may be missing and openpyxl xlrd package using the installation pip install
Data cleaning
data = {'Chineses' :[66,96,93,90,80 ],'English':[75,65,85,88,90],'Math':[30,98,66,77,90]}
df2 = DataFrame(data,index= ['Zhangfei','Guanyu','Zhaoyun','Huangzhong','Dianwei'],columns=['English','Math','Chineses'])
Clean-up process, generally experience the following:
1. Remove DataFrame unnecessary rows and columns
Use drop () function to delete unwanted rows or columns, such as trying to delete language that column
df2 = df2.drop(columns=['Chineses'])
Zhang Fei want to delete this line
df2 = df2 . drop(index = ['Zhangfei'])
2. Rename set out in columns, make it easier to identify a list of names
Use rename (columns = new_names, inplace = True) function, such as the Chineses -> TuWen, English-> YingYu
df2.rename(columns={'Chineses':'YuWen' , 'English':'YingYu' } , inplace = True)
3, remove duplicate values
drop_duplicates () automatically remove duplicate rows
df = df.drop_duplicates() #去掉重复行
4. formatting problems
Change data format
In many cases non-standard data formats, using the canonical format astype functions, such as automatically change the value str Chineses type, or int64
df2['Chineses'].astype('str')
df2['Chineses'].astype(np.int64)
Between the data space
Str first turned into the type of format is to facilitate manipulation of the data, then you want to delete the spaces between the data, the use of strip function
#删除左右两边空格
df2['Chineses'] = df2['Chineses'].map(str.strip)
#删除左边的空格
df2['Chineses'] = df2['Chineses'].map(str.lstrip)
#删除右边的空格
df2['Chineses'] = df2['Chineses'].map(str.rstrip)
You can also use the strip function to delete a special symbol, if the dollar Chineses field symbol
df2['Chineses'] = df2['Chineses'].str.strip('$')
Case conversion
Unified names, city names, etc. may be used in the case of conversion, you can use the upper (), lower (), title ()
#全部大写
df2.columns = df2.columns.str.upper()
#全部小写
df2.columns = df2.columns.str.lower()
#首字母大写
df2.columns = df2.columns.str.title()
Find null
There may be some fields null values Nan, look for a function with isnull
If you want to see what place there is a null value NaN, can df.isnull for the data table df (), the result is that the null True
Want to know that there is a null column, you can use df.isnull (). Any (), the result is that the null True
Apply function using the data cleaning
The value of the name column will be capitalized transformation:
df['name'] = df['name'].apply(str.upper)
Defined functions, apply in use, the function is defined double_df the original value of *
2 to return, then the numerical values of df1 'language' column of *
the second process:
def double_df(x):
return 2*x
df1[u'语文'] = df1[u'语文'].apply(double_df)
You may define more complex functions, for DataFrame, add two, where 'named new1' column is m times the 'language' and 'English' score sum, 'to new2' column 'language' and 'English' and Achievements n times
def plus(df,n,m):
df['new1'] = (df[u'语文']+df[u'英语']) * m
df['new2'] = (df[u'语文'] + df[u'英语']) * n
return df
df1 = df1.apply(plus,axis=1,args=(2,3,))
axis = 1 as a representative of Follow-axis, axis = 0 axis acts on behalf follow, args is passed two parameters, i.e., n = 2, m = 3, n and m are in use plus function, thereby generating new df
Statistics
describe () function, we can have a comprehensive understanding of the data
df1 = DataFrame({'name' : ['Zhangfei' , 'Guanyu' , 'a' , 'b' , 'c'] , ' data1':range(5)})
print df1.describe()
count 5.000000
mean 2.000000
std 1.******
I 0.000000
25% 1.000000
50% 2.000000
75% 3.000000
max 4.000000
Merge Data Sheet
A plurality of data channels, a plurality of source tables are combined
# 创建两个DataFrame
df1 = DataFrame({'name' : ['Zhangfei' , 'Guanyu' , 'a' , 'b' , 'c'] , ' data1':range(5)})
df2 = DataFrame({'name' : ['Zhangfei' , 'Guanyu' , 'A' , 'B' , 'C'] , ' data1':range(5)})
1. Based on the designated column are connected
# 基于name列进行连接
df3 = pd.merge(df1,df2,on = 'name') //然后返回df1 和 df2 两个表中name列相同元素的部分
Within 2.inner connection
# inner内连接是merge合并的默认情况,inner内连接是键的交集,这里df1 & df2 相同的键是name,所以是基于name字段做的连接
df3 = pd.merge(df1,df2,how='inner')
3.left left join
# 左连接是以第一个 DataFrame为主进行的连接,第二个 DataFrame作为补充
df3 = pd.merge(df1, df2,how='left') //返回的是先是data1 和name是第一个表的内容,还有data2除了相同的数值其他都是NaN
def3:
data1 name data2
0 Zhangfei 0.0
1 Guanyu 1.0
2 of
3 b NaN
4 C in
4. right right connection
df3 = pd.merge(df1, df2,how='right')
def3:
data1 name data2
0.0 Zhangfei 0
1.0 Guanyu 1
To 2
NaN B 3
In C 4
The outer external connection
# 相当与求两个DataFrame的并集
df3 = pd.merge(df1,df2,how = 'outer')
df3:
data1 name data2
0.0 Zhangfei 0.0
1.0 Guanyu 1.0
2.0 of
3.0 b NaN
4.0 C in
In A 2.0
NaN B 3.0
In C 4.0
How to open a SQL way Pandas
SQL statements can be used directly in python to operate pandas
pandasql: The main functions are sqldf, accepts two parameters, a SQL query, there is a set of environment variables, Globals () or about locals (), it can use SQL statements directly in python DataFrame operation:
import pandas as pd
from pandas import DataFrame
from pandasql import sqldf , load_meat, load_births
df1 = DataFrame({'name' : ['Zhangfei' , 'Guanyu' , 'a' , 'b' , 'c'] , ' data1':range(5)})
pysqldf = lambda sql:sqldf(sql , globals())
sql = "select * from df1 where name = 'Zhangfei'"
print pysqldf(sql)
lambda is used to define an anonymous function:
lambda arguement_list:experssion
arguement_list parameter list, the parameter is an expression experssion, it is returned in accordance with the expression evaluates experssion
pysqldf = lambda sql:sqldf(sql , globals())
Sql parameters are entered, returns the result of the operation sqldf sql, of course, also sqldf global globals input parameter, because the use of global parameters df1 in the sql
DataFrame be created using data and cleaning, while adding a "sum" is calculated for each of the three subjects of the person and
Full name | Chinese | English | mathematics |
---|---|---|---|
Zhang Fei | 66 | 65 | |
Guan Yu | 95 | 85 | 98 |
Zhao | 95 | 92 | 96 |
Huang | 90 | 88 | 77 |
Wade | 80 | 90 | 90 |
Wade | 80 | 90 | 90 |
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import pandas as pd
data = {'Chinese': [66, 95, 93, 90, 80, 80], 'English': [65, 85, 92, 88, 90, 90],
'Math': [None, 98, 96, 77, 90, 90]}
df = pd.DataFrame(data, index=['张飞', '关羽', '赵云', '黄忠', '典韦', '典韦'],
columns=['English', 'Math', 'Chinese'])
# 去除重复行
df = df.drop_duplicates()
# 列名重新排序
cols = ['Chinese', 'English', 'Math']
df = df.filter(cols, axis=1)
# 列名改为中文
df.rename(columns={'Chinese': '语文', 'English': '英语',
'Math': '数学'}, inplace=True)
def total_score(df):
df['总分'] = df['语文'] + df['英语'] + df['数学']
return df
# 求成绩的和,用老师讲的 apply 方法
df = df.apply(total_score, axis=1)
# 或者可以用这个方法求和
# df['总分'] = df['语文'] + df['英语'] + df['数学']
# 按照总分排序,从高到低,此时有缺失值
df.sort_values(['总分'], ascending=[False], inplace=True)
# 打印显示成绩单信息,张飞有空值
print(df.isnull().sum())
print(df.describe())
print(df)
# 使用数学成绩均值填充张飞同学的缺失值
df['数学'].fillna(df['数学'].mean(), inplace=True)
# 再次求成绩的和并打印显示成绩单情况
df = df.apply(total_score, axis=1)
print(df.isnull().sum())
print(df.describe())
print(df)