Python scientific computing: Pandas

Python scientific computing: Pandas

DataFrame Pandas data structure provided with a high degree wedge json, the conversion is very convenient

Series and DataFrame two core data structure, representing a two-dimensional table structure and sequence of one-dimensional, based on these two data structures, the data can be imported Pandas like

Data structures: Series and DataFrame

Series dictionary is a fixed-length sequence, when stored, corresponding to two ndarray, dictionary structure, and this is the biggest difference, because the structure of the dictionary, the number of elements is not fixed

There are two basic properties, index values, and, in a series arrangement, index default is 0, 1, ...... increasing integer sequence, may specify the index, such as index = [ 'a', 'b', 'c ']

import pandas as pd 
from pandas import Series , DataFrame
x1 = Series([1,2,3,4])
x1 = Series(data = [1,2,3,4], index = ['a' , ' b' , 'c' , 'd'])
print x1
print x2

0 1

1 2

2 3

3 4

dtype:int64

a 1

b 2

c 3

d 4

dtype:int64

The index x1 is the default value, x2 is specified index has been, in front of data (values), followed by index

Or you can use a dictionary way to create series:

d = {'a' :1,'b':2,'c':3,'d':4}
x3 = Series(d)
print x3

DataFrame similar database table type data structures

Comprising row and column indices can be regarded as DataFrame indexed by the same type of dictionary consisting Series

Some students want to output test scores:

import pandas as pd
from pandas import  Series,DataFrame
data = {'Chineses' :[66,96,93,90,80],'English':[75,65,85,88,90],'Math':[30,98,66,77,90]}
df1 = DataFrame(data)
df2 = DataFrame(data,index = ['Zhangfei','Guanyu','Zhaoyun','Huangzhong','Dianwei'],columns=['English','Math','Chineses'])
print df1
print df2

Column index df2 is [ 'English', 'Math', 'Chineses'], the row index is [ 'Zhangfei', 'Guanyu', 'Zhaoyun', 'Huangzhong', 'Dianwei'], the output df2 is:

                                            English       Math     Chineses
Zhangfei                         75                  30         66
Guanyu                           65                  98         96
Zhaoyun                        85                  66          93
Huangzhong                88                 77           90
Dianwei                         90                 90             80

Data import and export

Pandas allows importing data from xlsx, csv, or may be output to xlsx, csv file, etc.

import pandas as pd
from pandas import  Series,DataFrame
score = DataFrame(pd.read_excel('data.xlsx'))
score.to_excel('datal.xlsx')
print score

If during operation of the case may be missing and openpyxl xlrd package using the installation pip install

Data cleaning

data = {'Chineses' :[66,96,93,90,80 ],'English':[75,65,85,88,90],'Math':[30,98,66,77,90]}
df2 = DataFrame(data,index= ['Zhangfei','Guanyu','Zhaoyun','Huangzhong','Dianwei'],columns=['English','Math','Chineses'])

Clean-up process, generally experience the following:

1. Remove DataFrame unnecessary rows and columns

Use drop () function to delete unwanted rows or columns, such as trying to delete language that column

df2 = df2.drop(columns=['Chineses'])

Zhang Fei want to delete this line

df2 = df2 . drop(index = ['Zhangfei'])
2. Rename set out in columns, make it easier to identify a list of names

Use rename (columns = new_names, inplace = True) function, such as the Chineses -> TuWen, English-> YingYu

df2.rename(columns={'Chineses':'YuWen' , 'English':'YingYu' } , inplace = True)
3, remove duplicate values

drop_duplicates () automatically remove duplicate rows

df = df.drop_duplicates()  #去掉重复行

4. formatting problems

Change data format

In many cases non-standard data formats, using the canonical format astype functions, such as automatically change the value str Chineses type, or int64

df2['Chineses'].astype('str')
df2['Chineses'].astype(np.int64)
Between the data space

Str first turned into the type of format is to facilitate manipulation of the data, then you want to delete the spaces between the data, the use of strip function

#删除左右两边空格
df2['Chineses'] = df2['Chineses'].map(str.strip)
#删除左边的空格
df2['Chineses'] = df2['Chineses'].map(str.lstrip)
#删除右边的空格
df2['Chineses'] = df2['Chineses'].map(str.rstrip)

You can also use the strip function to delete a special symbol, if the dollar Chineses field symbol

df2['Chineses'] = df2['Chineses'].str.strip('$')
Case conversion

Unified names, city names, etc. may be used in the case of conversion, you can use the upper (), lower (), title ()

#全部大写
df2.columns = df2.columns.str.upper()
#全部小写
df2.columns = df2.columns.str.lower()
#首字母大写
df2.columns = df2.columns.str.title()
Find null

There may be some fields null values ​​Nan, look for a function with isnull

If you want to see what place there is a null value NaN, can df.isnull for the data table df (), the result is that the null True

Want to know that there is a null column, you can use df.isnull (). Any (), the result is that the null True

Apply function using the data cleaning

The value of the name column will be capitalized transformation:

df['name']  =  df['name'].apply(str.upper)

Defined functions, apply in use, the function is defined double_df the original value of *2 to return, then the numerical values of df1 'language' column of *the second process:

def double_df(x):
	return 2*x
df1[u'语文'] = df1[u'语文'].apply(double_df)

You may define more complex functions, for DataFrame, add two, where 'named new1' column is m times the 'language' and 'English' score sum, 'to new2' column 'language' and 'English' and Achievements n times

def plus(df,n,m):
	df['new1'] = (df[u'语文']+df[u'英语']) * m
	df['new2'] = (df[u'语文'] + df[u'英语']) * n 
	return df
df1 = df1.apply(plus,axis=1,args=(2,3,))

axis = 1 as a representative of Follow-axis, axis = 0 axis acts on behalf follow, args is passed two parameters, i.e., n = 2, m = 3, n and m are in use plus function, thereby generating new df

Statistics

describe () function, we can have a comprehensive understanding of the data

df1 = DataFrame({'name' : ['Zhangfei' , 'Guanyu' , 'a' , 'b' , 'c'] , ' data1':range(5)})
print df1.describe()

count 5.000000

mean 2.000000

std 1.******

I 0.000000

25% 1.000000

50% 2.000000

75% 3.000000

max 4.000000

Merge Data Sheet

A plurality of data channels, a plurality of source tables are combined

# 创建两个DataFrame
df1 = DataFrame({'name' :  ['Zhangfei' , 'Guanyu' , 'a' , 'b' , 'c'] , ' data1':range(5)})
df2 = DataFrame({'name' :  ['Zhangfei' , 'Guanyu' , 'A' , 'B' , 'C'] , ' data1':range(5)})
1. Based on the designated column are connected
#   基于name列进行连接
df3 = pd.merge(df1,df2,on = 'name')   //然后返回df1  和 df2 两个表中name列相同元素的部分
Within 2.inner connection
# inner内连接是merge合并的默认情况,inner内连接是键的交集,这里df1 & df2 相同的键是name,所以是基于name字段做的连接
df3 = pd.merge(df1,df2,how='inner')
3.left left join
# 左连接是以第一个 DataFrame为主进行的连接,第二个 DataFrame作为补充
df3 = pd.merge(df1, df2,how='left')   //返回的是先是data1 和name是第一个表的内容,还有data2除了相同的数值其他都是NaN

def3:

data1 name data2

0 Zhangfei 0.0

1 Guanyu 1.0

2 of

3 b NaN

4 C in

4. right right connection
df3 = pd.merge(df1, df2,how='right') 

def3:

data1 name data2

0.0 Zhangfei 0

1.0 Guanyu 1

To 2

NaN B 3

In C 4

The outer external connection
# 相当与求两个DataFrame的并集
df3 = pd.merge(df1,df2,how = 'outer')

df3:

data1 name data2

0.0 Zhangfei 0.0

1.0 Guanyu 1.0

2.0 of

3.0 b NaN

4.0 C in

In A 2.0

NaN B 3.0

In C 4.0

How to open a SQL way Pandas

SQL statements can be used directly in python to operate pandas

pandasql: The main functions are sqldf, accepts two parameters, a SQL query, there is a set of environment variables, Globals () or about locals (), it can use SQL statements directly in python DataFrame operation:

import pandas as pd
from pandas import DataFrame
from pandasql import sqldf , load_meat, load_births
df1 = DataFrame({'name' :  ['Zhangfei' , 'Guanyu' , 'a' , 'b' , 'c'] , ' data1':range(5)})
pysqldf = lambda sql:sqldf(sql , globals())
sql  =  "select * from df1 where name = 'Zhangfei'"
print pysqldf(sql)

lambda is used to define an anonymous function:

lambda arguement_list:experssion

arguement_list parameter list, the parameter is an expression experssion, it is returned in accordance with the expression evaluates experssion

pysqldf = lambda sql:sqldf(sql , globals())

Sql parameters are entered, returns the result of the operation sqldf sql, of course, also sqldf global globals input parameter, because the use of global parameters df1 in the sql

DataFrame be created using data and cleaning, while adding a "sum" is calculated for each of the three subjects of the person and

Full name Chinese English mathematics
Zhang Fei 66 65
Guan Yu 95 85 98
Zhao 95 92 96
Huang 90 88 77
Wade 80 90 90
Wade 80 90 90
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import pandas as pd

data = {'Chinese': [66, 95, 93, 90, 80, 80], 'English': [65, 85, 92, 88, 90, 90],
        'Math': [None, 98, 96, 77, 90, 90]}
df = pd.DataFrame(data, index=['张飞', '关羽', '赵云', '黄忠', '典韦', '典韦'],
                  columns=['English', 'Math', 'Chinese'])
# 去除重复行
df = df.drop_duplicates()
# 列名重新排序
cols = ['Chinese', 'English', 'Math']
df = df.filter(cols, axis=1)
# 列名改为中文
df.rename(columns={'Chinese': '语文', 'English': '英语',
                   'Math': '数学'}, inplace=True)


def total_score(df):
    df['总分'] = df['语文'] + df['英语'] + df['数学']
    return df


# 求成绩的和,用老师讲的 apply 方法
df = df.apply(total_score, axis=1)
# 或者可以用这个方法求和
# df['总分'] = df['语文'] + df['英语'] + df['数学']
# 按照总分排序,从高到低,此时有缺失值
df.sort_values(['总分'], ascending=[False], inplace=True)
# 打印显示成绩单信息,张飞有空值
print(df.isnull().sum())
print(df.describe())
print(df)

# 使用数学成绩均值填充张飞同学的缺失值
df['数学'].fillna(df['数学'].mean(), inplace=True)
# 再次求成绩的和并打印显示成绩单情况
df = df.apply(total_score, axis=1)
print(df.isnull().sum())
print(df.describe())
print(df)
Published 75 original articles · won praise 9 · views 9175

Guess you like

Origin blog.csdn.net/ywangjiyl/article/details/104719781
Recommended