In the article data analysis tool pandas series of tutorials (a): Speaking from the Series in: Details of the pandas underlying data structures Series, today to talk about another data structure DataFrame.
dataframe tabular data structure is by an ordered set of columns, the dictionary can be viewed as composed of a Series, for example:
/ | name | sex | course | grade |
---|---|---|---|---|
0 | Bob | male | math | 99 |
1 | Alice | female | english | 92 |
2 | Joe | male | chinese | 89 |
3 | Bob | male | chinese | 88 |
4 | Alice | female | chinese | 95 |
5 | Joe | male | english | 93 |
6 | Bob | male | english | 95 |
7 | Alice | female | math | 79 |
8 | Joe | male | math | 89 |
Create a common way of dataframe
The same series as, dataframe there are index, the difference is, series in addition to index, only one, but dataframe usually have a lot of columns, such as the above dataframe there are four, and has a name: name, sex, course, grade, these name, can index to a column, called a column names (index), therefore, in dataframe, I prefer to be called index row index and column index in order to distinguish.
Creating dataframe fact, there are N ways, no need to eleven master, after all commonly used, but two are three and I do not intend to create all the way say it again, as there are virtuoso of the suspects, according to their own understanding, I put these Creating a unified manner divided into two categories: the way to create columns, the lines created by the way, they talk about each of these two categories under way to create the most representative.
Dataframe to create the above example, the same.
Created by the column
import pandas as pd
#没有设置行索引 index,取默认值
df = pd.DataFrame({'name':['Bob','Alice','Joe']*3,
'sex':['male','female','male']*3,
'course':['math','english','chinese','chinese','chinese','english','english','math','math'],
'grade':[99,92,89,88,95,93,95,79,89]})
print(df)
Created by line
data = [['Bob','male','math',99],
['Alice','female','english',92],
['Joe','male','chinese',89],
['Bob','male','chinese',88],
['Alice','female','chinese',95],
['Joe','male','english',93],
['Bob','male','english',95],
['Alice','female','math',79],
['Joe','male','math',89]]
columns = ['name','sex','course','grade']
df = pd.DataFrame(data=data,columns=columns)
print(df)
Print the results above.
dataframe basic properties and overall description
Attributes | meaning |
---|---|
df.shape | Df number of rows of columns |
df.index | df row index |
df.columns | Df column index of (name) |
df.dtypes | Each column data types df |
df.valuse | df object value, is a two-dimensional array ndarray |
print(df.shape,'\n')
print(df.index,'\n')
print(df.columns,'\n')
print(df.dtypes,'\n')
print(df.values,'\n')
Note that the data type of each column, since their pandas can infer the data type, grade is thus 64 instead of an int type object.
function | effect |
---|---|
df.head() | Print front row n, the default line 5 |
df.tail() | Print n rows behind, the default line 5 |
df.info() | Number of printed lines, columns, column index, the column number of non-null value, etc. Overview of the entire information |
df.describe() | Print count, mean, variance, the minimum, quartiles, maximum and so the overall description |
print(df.head(),'\n')
print(df.tail(3),'\n')
print(df.info(),'\n')
print(df.describe(),'\n')
dataframe inquiry
LOC [] and iLoc []
Read data analysis tool pandas series of tutorials (a): Speaking from the Series readers should be aware iloc[]
of i
is integer
the meaning of, meaning iloc[]
only through the location query, and loc[]
can row, column index query; similarly, these two functions both queries, you can also add, edit.
To reflect the difference, we first converted into a row index from 0-8 1-9 (refer front closed closed section, and range()
a front opening and closing section):
df.index = range(1,10)
Suppose we want to complete a task: Bob's math scores into 100.
With loc[]
the completion of the following:
df.loc[1,'grade'] = 100
print(df,'\n')
And use iloc[]
, the corresponding code is as follows:
df.iloc[0,3] = 100
print(df,'\n')
iloc[]
It is based, and where queries row index, column index of little relationship, and this is the reason why I modify the line in advance of the index, to facilitate comparison iloc[]
and loc[]
information in the first parameter.
This point two queries are queries, in fact, loc[]
and iloc[]
also supports block inquiries, sample code as follows:
print(df.loc[[1,3,9],['name','grade']],'\n')
print(df.iloc[[0,2,8],[0,3]])
Traversal query
for index,row in df.iterrows():
print(index,': ',row['name'],row['sex'],row['course'],row['grade'])
Relations with the Series
You can create dataframe by series:
names = pd.Series(['Bob','Alice','Joe']*3)
sexs = pd.Series(['male','female','male']*3)
courses = pd.Series(['math','english','chinese','chinese','chinese','english','english','math','math'])
grades = pd.Series([99,92,89,88,95,93,95,79,89])
df = pd.DataFrame({'name':names,'sex':sexs,'course':courses,'grade':grades})
The result is printed at the beginning of the article that dataframe, which can be divided to create a way to create a column of the way, but not commonly talked about above that way.
Series can be obtained by the dataframe DF [Column Name] by:
print(df['name'],type(df['name']),'\n')
Consequently, all of the series of operations for df['name']
:
print(df['name'].values,type(df['name'].values),'\n')
print(df['name'].unique(),type(df['name'].unique()),'\n')
I am here to correct wrong with my last article in: series.values or series.unique () does not return the list, although the print results as a list (because of the __str__()
function is overloaded), but in fact it is ndarray
an object a similar list of the array, you can .tolist()
turn the list.
print(df['name'].values.tolist(),type(df['name'].values.tolist()),'\n')
print(df['name'].unique().tolist(),type(df['name'].unique().tolist()),'\n')
series, said last missing an important operation apply()
: the data for processing on the column, it can use lambda expressions as parameters, you can also use the function name already defined functions (no tape ()
) as a parameter, for example, we make every personal achievement of each course plus or minus 10 minutes:
# lambda 表达式适用于比较简单的处理
df['grade'] = df['grade'].apply(lambda x:x-10)
print(df,'\n')
# 定义函数适用于比较复杂的处理,这里仅作示例
def operate(x):
return x+10
df['grade'] = df['grade'].apply(operate)
print(df)
Note that apply()
the function is the return value, and is to use df['grade']
the reception rather than df, otherwise the whole dataframe leaving only grade this column.
New deleting rows or columns
Add / delete rows or columns less enumeration methods, and here I only said initiate several commonly used.
Delete Row / column by drop()
function to complete:
# drop() 的第一个参数是行索引或者列索引
# axis = 0 删除行
df.drop([0,7,8],axis=0,inplace=True) # 删除所有人的数学成绩
# axis = 0 删除列
df.drop(['sex'],axis=1,inplace=True) # 删除所有人的性别信息
print(df)
And the series as a new row is available set_value()
, , at[]
ifloc[]
the row index exists, it is modified, or is new; the following three lines, each line the same effect, are amended english Alice's score of 100:
# 不一定非得要列表,只要是可迭代对象即可
df.loc[1] = ['Alice', 'english', 100]
df.at[1] = ['Alice', 'english', 100]
# set_value 会在将来被舍弃
df.set_value(1, df.columns, ['Alice', 'english', 100], takeable=False)
Add a can df[列名]=可迭代对象
or df[:,列名]=可迭代对象
implemented to the driving task, such as a new performance levels, fail to 60 points or less, good as 60-89, preferably 90-100 of:
level = []
for grade in df['grade'].values.tolist():
if grade<60:
level.append('不及格')
elif grade>=60 and grade<90:
level.append('良')
else:
level.append('优')
df['level'] = level
print(df)
Thus, pandas in two basic data structures finished, the next pit pandas to talk about reading and writing files in a variety of functions.