Ten minutes to master Pandas (on) - API from the official website
A, numpy and pandas
numpy matrix calculation library is, pandas are data analysis library, on Baidu Encyclopedia, there is an introduction to the pandas.
pandas NumPy is a tool, the tool to solve data analysis tasks created based on. Pandas included a large library and some standard data model provides the tools needed to efficiently operate large data sets. pandas provides a number of functions and methods enable us to quickly and easily handle the data. You will soon find that it is one of the important factors that make Python become a powerful and efficient data analysis environment.
Second, the data type
numpy | pandas |
---|---|
ndArray dimensional matrix corresponding to n | Series (similar to the one-dimensional array, or the value kv) |
Only ndArray one of ndArray but there are many data types in numpy | DataFrame (read data with csv DataFrame) |
Second, the API official website
2.1.Object craetion
Because while pandas based on numpy development, so we want to introduce pandas were introduced numpy
import numpy as np
import pandas as pd
We create an integer index Series
s = pd.Series([1,3,5,np.nan,6,8])
print(s)
Create a DataFrame type, using an array of NumPy, index row, columns is the column
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
print(df)
Here is another way to create
df2 = pd.DataFrame({
'A':1.,
'B':pd.Timestamp('20130102'),
'C':pd.Series(1,index=list(range(4)),dtype='float32'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(["test","train","test","train"]),
'F':'foo'
})
print(df2)
Our view from the beginning, or view from tail
df.head()
df.tail(3)
Series use to_numpy () format is converted to ndArray
df.to_numpy()
Our DataFrame format, you can use to_numpy () to convert
df2.to_numpy()
Use df.describe view DataFrame properties
df.describe()
A T attribute can view DataFrame transpose
df.T
Use sort_index can be sorted by row index, designated 0 axis dimension sorted by column, a designated axis dimension sorted by row, as the reverse order Ascending to False, True positive sequence sorting
df.sort_index(axis=1,ascending=False)
Sorted by value
df.sort_values(by='B')
2.2.Selection
Note that our standard Python / Numpy express option is to see if the huge amount of data we will use .at, .iat, .loc and .iloc indirectly acquire data
Getting
Direct access to a column
df['A']
Slice acquisition
df[0:3]
Selection by label
Gets the label row
df.loc[dates[0]]
2.3.Select by position
Either obtained from the index position
df.iloc[3]
Location can still ranks way or slice
2.4.Boolean indexing
Boolean selector
df[df.A>0]
We can be filtered by pandas table isin () method
df2=df.copy()//拷贝
df2['E']=['one','two','three','four','five']//插入新列
df2[df2['E'].isin('two','three')] //进行选择过滤
2.5.Setting
Setting a new column set
Our column is equivalent to a Series format, now we find Pandas line is equivalent to a two-dimensional Series encapsulation
s1 = pd.Series([1,2,3,4],index=pd.date_range('20130102',periods=4))
df['F']=s1
Tab setting values
df.at[dates[0],'A']=0
To locate a value
df.iat[0,1]=0
2.6.Missing data
Missing data values, PANDAS np.nan originally used for representing missing values, such as can not be calculated, it may be used instead of Nan
reindex reconstruction
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
Remove rows of data loss
df1.dropna(how='any')
Fill in missing data
df1.fillna(value=5)
bool determine whether na
df1.isna(df1)
2.7.operations
Averaging, axis is set to 0, according to rows or columns averaging
df.mean(0)
df.mean(1)
Removing the first two values, are sequentially performed twice next shift, to remove the first four values
s=pd.Series([1,3,5,np.nan,6,8],index=dates)
s=s.shift(2)
s=s.shift(2)
s