Pandas数据结构以及学习总结

Data Structure

Series
DataFrame
Panel

高纬度的数据结构包含低维度的数据结构

其中，Series的数据结构是size mutable大小可变的，而Series是size immutable大小不可变的

Series

常见的参数命令：

默认情况下，narray的索引是[0,1,2,...n]

import pandas as pd
import numpy as np
s = pd.Series()
print (s)

data = np.array(['a','b','c','d'])
s = pd.Series(data)

如果要定制data的使用规则，则可以使用如下方法：

1. Create a Series from dict
通过字典构建Series

data= {'a':0,'b':1,'c':2}
s = pd.Series(data)
print (s)

s = pd.Series(data,index=['a','b','e'])
print(s)

1. Create a Series from Scalar

import pandas as pd
import numpy as np
s = pd.Series(5,index=[0,1,2,3])

Accessing data from series with position

import pandas  as pd
s = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
print (s[0])

Retrieve data with label

import pandas as pd
s = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])

print(s[['a','b']])

Exercises:

扫描二维码关注公众号，回复： 10341469 查看本文章

1.Write a Pandas program to select the 'name' and 'score' columns from the following DataFrame and order by score (highest to lowest).  Sample DataFrame: (i) exam_data = {'name': ['Ali', 'Abu', 'Katherine', 'Site', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'James'], 'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19], 'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1], 'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}

import pandas as pd
import numpy as np
exam_data = {'name': ['Ali', 'Abu', 'Katherine', 'Site', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'James'],
             'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
             'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
             'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}

df = pd.DataFrame(exam_data)
df[["name","score"]].sort_values(by='score',ascending=False)

2.Select 'name' , ‘attempts’ and 'score' columns in rows 1, 3, 5, 6 from the above DataFrame (i).

df[["name","attempts","score"]].iloc[[1,3,5,6],]

3.Write a Pandas program to select the rows where the number of attempts in the examination is greater than 2 from the above DataFrame(i).

df[df.attempts>2]

4.Write a Pandas program to count the number of rows and columns of the above DataFrame(i).

print("number of rows",len(df))

print("number of columns",len(df.columns))

5.Write a Pandas program to select the rows where the score is missing, i.e. is NaN.

df[df.score.isnull()]

6.Write a Pandas program to select the rows the score is between 12 and 20 (inclusive).

df[(df.score>=12)&(df.score<=20)]

7.Write a Pandas program to select the rows where number of attempts in the examination is less than 2 and score greater than 10.

8.Write a Pandas program to change the score in row '3' to 11.5.

df.score.iloc[3] = 11.5

9.Write a Pandas program to calculate the sum of the examination attempts by the students.

sum(df.attempts)

10.Write a Pandas program to calculate the mean score for each different student in DataFrame(i) and list out student(s) name and score(s) who is/are more than mean score.

df[df.score>df.mean().score].loc[:,['name','score']]

References:

https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm

https://www.w3resource.com/python-exercises/pandas/index-dataframe.php