Pandas 的数据结构 DataFrame 的常用方法

总结的方法所用实例为 sklearn&tensorflow机器学习使用指南第二章中的房屋价格投资预测项目

head() 方法用于查看数据集的前5行

print(housing.head())
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population  households  median_income  median_house_value ocean_proximity
0    -122.23     37.88                41.0        880.0           129.0       322.0       126.0         8.3252            452600.0        NEAR BAY
1    -122.22     37.86                21.0       7099.0          1106.0      2401.0      1138.0         8.3014            358500.0        NEAR BAY
2    -122.24     37.85                52.0       1467.0           190.0       496.0       177.0         7.2574            352100.0        NEAR BAY
3    -122.25     37.85                52.0       1274.0           235.0       558.0       219.0         5.6431            341300.0        NEAR BAY
4    -122.25     37.85                52.0       1627.0           280.0       565.0       259.0         3.8462            342200.0        NEAR BAY

info() 方法用于快速查看数据的描述,例如总行数,每个属性的类型以及非空值的数量

print(housing.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

value_counts() 查看某一列数据中都有哪些类别,以及每个类别中数据的数量。我们使用 ocean_proximity 这一列的数据,看出一共有5类,INLAND 这一类数量最多为6551

print(housing["ocean_proximity"].value_counts())
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64

describe() 方法显示数值属性的概括,count 是数据的数量;mean, min, max 分别表示平均值,最小以及最大值;std 是标准差,用来揭示数据分散度;25% 50% 75% 对应分位数,比如 25% 的房子年龄中位数小于18,而50%的小于29。

print(housing.describe())
          longitude      latitude  housing_median_age   total_rooms  total_bedrooms    population    households  median_income  median_house_value
count  20640.000000  20640.000000        20640.000000  20640.000000    20433.000000  20640.000000  20640.000000   20640.000000        20640.000000
mean    -119.569704     35.631861           28.639486   2635.763081      537.870553   1425.476744    499.539680       3.870671       206855.816909
std        2.003532      2.135952           12.585558   2181.615252      421.385070   1132.462122    382.329753       1.899822       115395.615874
min     -124.350000     32.540000            1.000000      2.000000        1.000000      3.000000      1.000000       0.499900        14999.000000
25%     -121.800000     33.930000           18.000000   1447.750000      296.000000    787.000000    280.000000       2.563400       119600.000000
50%     -118.490000     34.260000           29.000000   2127.000000      435.000000   1166.000000    409.000000       3.534800       179700.000000
75%     -118.010000     37.710000           37.000000   3148.000000      647.000000   1725.000000    605.000000       4.743250       264725.000000
max     -114.310000     41.950000           52.000000  39320.000000     6445.000000  35682.000000   6082.000000      15.000100       500001.000000

最后提一点和 pandas 无关的,但是可以让数据可视化即画图。画图需要用到库 matplotlib,画直方图用到其中的 hist() 方法。其中 bins 表示条状图的数量(对应 x 轴),figsize 是图片大小。

housing.hist(bins=50, figsize=(20,15))
plt.show

猜你喜欢

转载自www.cnblogs.com/shixp/p/10761594.html