Pandas数据结构以及学习总结
Data Structure
- Series
- DataFrame
- Panel
高纬度的数据结构包含低维度的数据结构
其中,Series的数据结构是size mutable大小可变的,而Series是size immutable大小不可变的
Series
常见的参数命令:
默认情况下,narray的索引是[0,1,2,...n]
import pandas as pd
import numpy as np
s = pd.Series()
print (s)
data = np.array(['a','b','c','d'])
s = pd.Series(data)
如果要定制data的使用规则,则可以使用如下方法:
-
- Create a Series from dict
通过字典构建Series
data= {'a':0,'b':1,'c':2}
s = pd.Series(data)
print (s)
s = pd.Series(data,index=['a','b','e'])
print(s)
-
- Create a Series from Scalar
import pandas as pd
import numpy as np
s = pd.Series(5,index=[0,1,2,3])
Accessing data from series with position
import pandas as pd
s = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
print (s[0])
Retrieve data with label
import pandas as pd
s = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
print(s[['a','b']])
Exercises:
1.Write a Pandas program to select the 'name' and 'score' columns from the following DataFrame and order by score (highest to lowest). Sample DataFrame: (i) exam_data = {'name': ['Ali', 'Abu', 'Katherine', 'Site', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'James'], 'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
import pandas as pd
import numpy as np
exam_data = {'name': ['Ali', 'Abu', 'Katherine', 'Site', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'James'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
df = pd.DataFrame(exam_data)
df[["name","score"]].sort_values(by='score',ascending=False)
2.Select 'name' , ‘attempts’ and 'score' columns in rows 1, 3, 5, 6 from the above DataFrame (i).
df[["name","attempts","score"]].iloc[[1,3,5,6],]
3.Write a Pandas program to select the rows where the number of attempts in the examination is greater than 2 from the above DataFrame(i).
df[df.attempts>2]
4.Write a Pandas program to count the number of rows and columns of the above DataFrame(i).
print("number of rows",len(df))
print("number of columns",len(df.columns))
5.Write a Pandas program to select the rows where the score is missing, i.e. is NaN.
df[df.score.isnull()]
6.Write a Pandas program to select the rows the score is between 12 and 20 (inclusive).
df[(df.score>=12)&(df.score<=20)]
7.Write a Pandas program to select the rows where number of attempts in the examination is less than 2 and score greater than 10.
8.Write a Pandas program to change the score in row '3' to 11.5.
df.score.iloc[3] = 11.5
9.Write a Pandas program to calculate the sum of the examination attempts by the students.
sum(df.attempts)
10.Write a Pandas program to calculate the mean score for each different student in DataFrame(i) and list out student(s) name and score(s) who is/are more than mean score.
df[df.score>df.mean().score].loc[:,['name','score']]
References:
https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm
https://www.w3resource.com/python-exercises/pandas/index-dataframe.php