数据分析模块pandas

一 介绍
pandas(Python Data Analysis Library)是基于numpy的数据分析模块,提供了大量标准数据模型和高效操作大型数据集所需要的工具,可以说pandas是使得Python能够成为高效且强大的数据分析环境的重要因素之一。
pandas主要提供了3种数据结构:
1)Series,带标签的一维数组。
2)DataFrame,带标签且大小可变的二维表格结构。
3)Panel,带标签且大小可变的三维数组。
二 代码
1、生成一维数组
  1. >>>import pandas as pd
  2. >>>import numpy as np
  3. >>> x = pd.Series([1,3,5, np.nan])
  4. >>>print(x)
  5. 01.0
  6. 13.0
  7. 25.0
  8. 3NaN
  9. dtype: float64
2、生成二维数组
  1. >>> dates = pd.date_range(start='20170101', end='20171231', freq='D')#间隔为天
  2. >>>print(dates)
  3. DatetimeIndex(['2017-01-01','2017-01-02','2017-01-03','2017-01-04',
  4. '2017-01-05','2017-01-06','2017-01-07','2017-01-08',
  5. '2017-01-09','2017-01-10',
  6. ...
  7. '2017-12-22','2017-12-23','2017-12-24','2017-12-25',
  8. '2017-12-26','2017-12-27','2017-12-28','2017-12-29',
  9. '2017-12-30','2017-12-31'],
  10. dtype='datetime64[ns]', length=365, freq='D')
  11. >>> dates = pd.date_range(start='20170101', end='20171231', freq='M')#间隔为月
  12. >>>print(dates)
  13. DatetimeIndex(['2017-01-31','2017-02-28','2017-03-31','2017-04-30',
  14. '2017-05-31','2017-06-30','2017-07-31','2017-08-31',
  15. '2017-09-30','2017-10-31','2017-11-30','2017-12-31'],
  16. dtype='datetime64[ns]', freq='M')
  17. >>> df = pd.DataFrame(np.random.randn(12,4), index=dates, columns=list('ABCD'))
  18. >>>print(df)
  19. A B C D
  20. 2017-01-31-0.6825560.2441020.4508550.236475
  21. 2017-02-28-0.6300600.5906670.4824380.225697
  22. 2017-03-311.0669890.3193391.0949531.716053
  23. 2017-04-300.334944-0.053049-1.009493-1.039470
  24. 2017-05-31-0.380778-0.0444290.0756470.931243
  25. 2017-06-300.8675400.872197-0.738974-1.114596
  26. 2017-07-310.423371-1.0863860.183820-0.438921
  27. 2017-08-311.2851630.634134-0.4729731.281057
  28. 2017-09-30-1.002832-0.888122-1.316014-0.070637
  29. 2017-10-311.735617-0.2538150.5544031.536211
  30. 2017-11-302.0303840.6675561.0126980.239479
  31. 2017-12-312.059718-0.0890501.4205170.224578
  32. >>> df = pd.DataFrame([[np.random.randint(1,100)for j in range(4)]for i in range(12)], index=dates, columns=list('ABCD'))
  33. >>>print(df)
  34. A B C D
  35. 2017-01-317532522
  36. 2017-02-2870997098
  37. 2017-03-3199477567
  38. 2017-04-3033701749
  39. 2017-05-3162886891
  40. 2017-06-3019751844
  41. 2017-07-3150856582
  42. 2017-08-315628776
  43. 2017-09-306173111
  44. 2017-10-318296692
  45. 2017-11-306359194
  46. 2017-12-3179586933
  47. >>> df = pd.DataFrame({'A':[np.random.randint(1,100)for i in range(4)],
  48. 'B':pd.date_range(start='20130101', periods=4, freq='D'),
  49. 'C':pd.Series([1,2,3,4],index=list(range(4)),dtype='float32'),
  50. 'D':np.array([3]*4,dtype='int32'),
  51. 'E':pd.Categorical(["test","train","test","train"]),
  52. 'F':'foo'})
  53. >>>print(df)
  54. A B C D E F
  55. 0152013-01-011.03 test foo
  56. 1112013-01-022.03 train foo
  57. 2912013-01-033.03 test foo
  58. 3912013-01-044.03 train foo
  59. >>> df = pd.DataFrame({'A':[np.random.randint(1,100)for i in range(4)],
  60. 'B':pd.date_range(start='20130101', periods=4, freq='D'),
  61. 'C':pd.Series([1,2,3,4],index=['zhang','li','zhou','wang'],dtype='float32'),
  62. 'D':np.array([3]*4,dtype='int32'),
  63. 'E':pd.Categorical(["test","train","test","train"]),
  64. 'F':'foo'})
  65. >>>print(df)
  66. A B C D E F
  67. zhang 362013-01-011.03 test foo
  68. li 862013-01-022.03 train foo
  69. zhou 102013-01-033.03 test foo
  70. wang 792013-01-044.03 train foo
  71. >>>
3、二维数据查看
  1. >>> df.head() #默认显示前5行
  2. A B C D E F
  3. zhang 362013-01-011.03 test foo
  4. li 862013-01-022.03 train foo
  5. zhou 102013-01-033.03 test foo
  6. wang 792013-01-044.03 train foo
  7. >>> df.head(3) #查看前3行
  8. A B C D E F
  9. zhang 362013-01-011.03 test foo
  10. li 862013-01-022.03 train foo
  11. zhou 102013-01-033.03 test foo
  12. >>> df.tail(2) #查看最后2行
  13. A B C D E F
  14. zhou 102013-01-033.03 test foo
  15. wang 792013-01-044.03 train foo
4、查看二维数据的索引、列名和数据
  1. >>> df.index
  2. Index(['zhang','li','zhou','wang'], dtype='object')
  3. >>> df.columns
  4. Index(['A','B','C','D','E','F'], dtype='object')
  5. >>> df.values
  6. array([[36,Timestamp('2013-01-01 00:00:00'),1.0,3,'test','foo'],
  7. [86,Timestamp('2013-01-02 00:00:00'),2.0,3,'train','foo'],
  8. [10,Timestamp('2013-01-03 00:00:00'),3.0,3,'test','foo'],
  9. [79,Timestamp('2013-01-04 00:00:00'),4.0,3,'train','foo']], dtype=object)
5、查看数据的统计信息
  1. >>> df.describe() #平均值、标准差、最小值、最大值等信息
  2. A C D
  3. count 4.0000004.0000004.0
  4. mean 52.7500002.5000003.0
  5. std 36.0682221.2909940.0
  6. min 10.0000001.0000003.0
  7. 25%29.5000001.7500003.0
  8. 50%57.5000002.5000003.0
  9. 75%80.7500003.2500003.0
  10. max 86.0000004.0000003.0
6、二维数据转置
  1. >>> df.T
  2. zhang li zhou \
  3. A 368610
  4. B 2013-01-0100:00:002013-01-0200:00:002013-01-0300:00:00
  5. C 123
  6. D 333
  7. E test train test
  8. F foo foo foo
  9. wang
  10. A 79
  11. B 2013-01-0400:00:00
  12. C 4
  13. D 3
  14. E train
  15. F foo
 
7、排序
  1. >>> df.sort_index(axis=0, ascending=False)#对轴进行排序
  2. A B C D E F
  3. zhou 102013-01-033.03 test foo
  4. zhang 362013-01-011.03 test foo
  5. wang 792013-01-044.03 train foo
  6. li 862013-01-022.03 train foo
  7. >>> df.sort_index(axis=1, ascending=False)
  8. F E D C B A
  9. zhang foo test 31.02013-01-0136
  10. li foo train 32.02013-01-0286
  11. zhou foo test 33.02013-01-0310
  12. wang foo train 34.02013-01-0479
  13. >>> df.sort_index(axis=0, ascending=True)
  14. A B C D E F
  15. li 862013-01-022.03 train foo
  16. wang 792013-01-044.03 train foo
  17. zhang 362013-01-011.03 test foo
  18. zhou 102013-01-033.03 test foo
  19. >>> df.sort_values(by='A')#对数据进行排序
  20. A B C D E F
  21. zhou 102013-01-033.03 test foo
  22. zhang 362013-01-011.03 test foo
  23. wang 792013-01-044.03 train foo
  24. li 862013-01-022.03 train foo
  25. >>> df.sort_values(by='A', ascending=False)#降序排列
  26. A B C D E F
  27. li 862013-01-022.03 train foo
  28. wang 792013-01-044.03 train foo
  29. zhang 362013-01-011.03 test foo
  30. zhou 102013-01-033.03 test foo
 
8、数据选择
  1. >>> df['A']#选择列
  2. zhang 1
  3. li 1
  4. zhou 60
  5. wang 58
  6. Name: A, dtype: int64
  7. >>> df[0:2]#使用切片选择多行
  8. A B C D E F
  9. zhang 12013-01-011.03 test foo
  10. li 12013-01-022.03 train foo
  11. >>> df.loc[:,['A','C']]#选择多列
  12. A C
  13. zhang 11.0
  14. li 12.0
  15. zhou 603.0
  16. wang 584.0
  17. >>> df.loc[['zhang','zhou'],['A','D','E']]#同时指定多行与多列进行选择
  18. A D E
  19. zhang 13 test
  20. zhou 603 test
  21. >>> df.loc['zhang',['A','D','E']]
  22. A 1
  23. D 3
  24. E test
  25. Name: zhang, dtype: object
9、数据修改和设置
  1. >>> df.iat[0,2]=3#修改指定行、列位置的数据值
  2. >>>print(df)
  3. A B C D E F
  4. zhang 12013-01-013.03 test foo
  5. li 12013-01-022.03 train foo
  6. zhou 602013-01-033.03 test foo
  7. wang 582013-01-044.03 train foo
  8. >>> df.loc[:,'D']=[np.random.randint(50,60)for i in range(4)]#修改某列的值
  9. >>>print(df)
  10. A B C D E F
  11. zhang 12013-01-013.057 test foo
  12. li 12013-01-022.052 train foo
  13. zhou 602013-01-033.057 test foo
  14. wang 582013-01-044.056 train foo
  15. >>> df['C']=-df['C']#对指定列数据取反
  16. >>>print(df)
  17. A B C D E F
  18. zhang 12013-01-01-3.057 test foo
  19. li 12013-01-02-2.052 train foo
  20. zhou 602013-01-03-3.057 test foo
  21. wang 582013-01-04-4.056 train foo
10、缺失值处理
  1. >>> df1 = df.reindex(index=['zhang','li','zhou','wang'], columns=list(df.columns)+['G'])
  2. >>>print(df1)
  3. A B C D E F G
  4. zhang 12013-01-01-3.057 test foo NaN
  5. li 12013-01-02-2.052 train foo NaN
  6. zhou 602013-01-03-3.057 test foo NaN
  7. wang 582013-01-04-4.056 train foo NaN
  8. >>> df1.iat[0,6]=3#修改指定位置元素值,该列其他元素为缺失值NaN
  9. >>>print(df1)
  10. A B C D E F G
  11. zhang 12013-01-01-3.057 test foo 3.0
  12. li 12013-01-02-2.052 train foo NaN
  13. zhou 602013-01-03-3.057 test foo NaN
  14. wang 582013-01-04-4.056 train foo NaN
  15. >>> pd.isnull(df1)#测试缺失值,返回值为True/False阵列
  16. A B C D E F G
  17. zhang FalseFalseFalseFalseFalseFalseFalse
  18. li FalseFalseFalseFalseFalseFalseTrue
  19. zhou FalseFalseFalseFalseFalseFalseTrue
  20. wang FalseFalseFalseFalseFalseFalseTrue
  21. >>> df1.dropna()#返回不包含缺失值的行
  22. A B C D E F G
  23. zhang 12013-01-01-3.057 test foo 3.0
  24. >>> df1['G'].fillna(5, inplace=True)#使用指定值填充缺失值
  25. >>>print(df1)
  26. A B C D E F G
  27. zhang 12013-01-01-3.057 test foo 3.0
  28. li 12013-01-02-2.052 train foo 5.0
  29. zhou 602013-01-03-3.057 test foo 5.0
  30. wang 582013-01-04-4.056 train foo 5.0
11、数据操作
  1. >>> df1.mean()#平均值,自动忽略缺失值
  2. A 30.0
  3. C -3.0
  4. D 55.5
  5. G 4.5
  6. dtype: float64
  7. >>> df.mean(1)#横向计算平均值
  8. zhang 18.333333
  9. li 17.000000
  10. zhou 38.000000
  11. wang 36.666667
  12. dtype: float64
  13. >>> df1.shift(1)#数据移位
  14. A B C D E F G
  15. zhang NaNNaTNaNNaNNaNNaNNaN
  16. li 1.02013-01-01-3.057.0 test foo 3.0
  17. zhou 1.02013-01-02-2.052.0 train foo 5.0
  18. wang 60.02013-01-03-3.057.0 test foo 5.0
  19. >>> df1['D'].value_counts()#直方图统计
  20. 572
  21. 561
  22. 521
  23. Name: D, dtype: int64
  24. >>>print(df1)
  25. A B C D E F G
  26. zhang 12013-01-01-3.057 test foo 3.0
  27. li 12013-01-02-2.052 train foo 5.0
  28. zhou 602013-01-03-3.057 test foo 5.0
  29. wang 582013-01-04-4.056 train foo 5.0
  30. >>> df2 = pd.DataFrame(np.random.randn(10,4))
  31. >>>print(df2)
  32. 0123
  33. 0-0.939904-1.856658-0.2819650.203624
  34. 10.3501620.060674-0.9148080.135735
  35. 2-1.031384-1.6112740.341546-0.363671
  36. 30.139464-0.050959-0.810610-0.772648
  37. 4-1.146810-0.7916081.488790-0.490004
  38. 5-0.100707-0.763545-0.071274-0.298142
  39. 6-0.2120140.8097090.6931960.980568
  40. 7-0.812985-0.000325-0.675101-0.217394
  41. 80.066969-0.084609-0.4330990.535616
  42. 9-0.319120-0.5328541.321712-1.751913
  43. >>> p1 = df2[:3] >>> print(p1) 0 1 2 3 0 -0.939904 -1.856658 -0.281965 0.203624 1 0.350162 0.060674 -0.914808 0.135735 2 -1.031384 -1.611274 0.341546 -0.363671 >>> p2 = df2[3:7] >>> print(p2) 0 1 2 3 3 0.139464 -0.050959 -0.810610 -0.772648 4 -1.146810 -0.791608 1.488790 -0.490004 5 -0.100707 -0.763545 -0.071274 -0.298142 6 -0.212014 0.809709 0.693196 0.980568 >>> p3 = df2[7:] >>> print(p3) 0 1 2 3 7 -0.812985 -0.000325 -0.675101 -0.217394 8 0.066969 -0.084609 -0.433099 0.535616 9 -0.319120 -0.532854 1.321712 -1.751913 >>> df3 = pd.concat([p1, p2, p3]) #数据行合并 >>> print(df3) 0 1 2 3 0 -0.939904 -1.856658 -0.281965 0.203624 1 0.350162 0.060674 -0.914808 0.135735 2 -1.031384 -1.611274 0.341546 -0.363671 3 0.139464 -0.050959 -0.810610 -0.772648 4 -1.146810 -0.791608 1.488790 -0.490004 5 -0.100707 -0.763545 -0.071274 -0.298142 6 -0.212014 0.809709 0.693196 0.980568 7 -0.812985 -0.000325 -0.675101 -0.217394 8 0.066969 -0.084609 -0.433099 0.535616 9 -0.319120 -0.532854 1.321712 -1.751913 >>> df2 == df3 0 1 2 3 0 True True True True 1 True True True True 2 True True True True 3 True True True True 4 True True True True 5 True True True True 6 True True True True 7 True True True True 8 True True True True 9 True True True True >>> df4 = pd.DataFrame({'A':[np.random.randint(1,5) for i in range(8)], 'B':[np.random.randint(10,15) for i in range(8)], 'C':[np.random.randint(20,30) for i in range(8)], 'D':[np.random.randint(80,100) for i in range(8)]}) >>> print(df4) A B C D 0 4 11 24 91 1 1 13 28 95 2 2 12 27 91 3 1 12 20 87 4 3 11 24 96 5 1 13 21 99 6 3 11 22 95 7 2 13 26 98 >>> >>> df4.groupby('A').sum() #数据分组计算 B C D A 1 38 69 281 2 25 53 189 3 22 46 191 4 11 24 91 >>> >>> df4.groupby(['A','B']).mean() C D A B 1 12 20.0 87.0 13 24.5 97.0 2 12 27.0 91.0 13 26.0 98.0 3 11 23.0 95.5 4 11 24.0 91.0
12、结合matplotlib绘图
  1. >>>import pandas as pd
  2. >>>import numpy as np
  3. >>>import matplotlib.pyplot as plt
  4. >>> df = pd.DataFrame(np.random.randn(1000,2), columns=['B','C']).cumsum()
  5. >>>print(df)
  6. B C
  7. 00.0898860.511081
  8. 11.3237661.584758
  9. 21.489479-0.438671
  10. 30.831331-0.398021
  11. 4-0.2482330.494418
  12. 5-0.0130850.684518
  13. 60.666951-1.422161
  14. 71.768838-0.658786
  15. 82.6610800.648505
  16. 91.9517510.836261
  17. 103.5387851.657475
  18. 113.2540342.052609
  19. 124.2486201.568401
  20. 134.0771730.055622
  21. 143.452590-0.200314
  22. 152.627620-0.408829
  23. 163.690537-0.210440
  24. 173.1849240.365447
  25. 183.646556-0.150044
  26. 194.164563-0.023405
  27. 202.3914470.517872
  28. 212.8651530.686649
  29. 223.6231830.663927
  30. 231.5451170.151044
  31. 243.5959240.903619
  32. 253.0138041.855083
  33. 264.4388011.014572
  34. 275.1552160.882628
  35. 284.4314570.741509
  36. 292.8419490.709991
  37. ........
  38. 970-7.910646-13.738689
  39. 971-7.318091-14.811335
  40. 972-9.144376-15.466873
  41. 973-9.538658-15.367167
  42. 974-9.061114-16.822726
  43. 975-9.803798-17.368350
  44. 976-10.180575-17.270180
  45. 977-10.601352-17.671543
  46. 978-10.804909-19.535919
  47. 979-10.397964-20.361419
  48. 980-10.979640-20.300267
  49. 981-8.738223-20.202669
  50. 982-9.339929-21.528973
  51. 983-9.780686-20.902152
  52. 984-11.072655-21.235735
  53. 985-10.849717-20.439201
  54. 986-10.953247-19.708973
  55. 987-13.032707-18.687553
  56. 988-12.984567-19.557132
  57. 989-13.508836-18.747584
  58. 990-13.420713-19.883180
  59. 991-11.718125-20.474092
  60. 992-11.936512-21.360752
  61. 993-14.225655-22.006776
  62. 994-13.524940-20.844519
  63. 995-14.088767-20.492952
  64. 996-14.169056-20.666777
  65. 997-14.798708-19.960555
  66. 998-15.766568-19.395622
  67. 999-17.281143-19.089793
  68. [1000 rows x 2 columns]
  69. >>> df['A']= pd.Series(list(range(len(df))))
  70. >>>print(df)
  71. B C A
  72. 00.0898860.5110810
  73. 11.3237661.5847581
  74. 21.489479-0.4386712
  75. 30.831331-0.3980213
  76. 4-0.2482330.4944184
  77. 5-0.0130850.6845185
  78. 60.666951-1.4221616
  79. 71.768838-0.6587867
  80. 82.6610800.6485058
  81. 91.9517510.8362619
  82. 103.5387851.65747510
  83. 113.2540342.05260911
  84. 124.2486201.56840112
  85. 134.0771730.05562213
  86. 143.452590-0.20031414
  87. 152.627620-0.40882915
  88. 163.690537-0.21044016
  89. 173.1849240.36544717
  90. 183.646556-0.15004418
  91. 194.164563-0.02340519
  92. 202.3914470.51787220
  93. 212.8651530.68664921
  94. 223.6231830.66392722
  95. 231.5451170.15104423
  96. 243.5959240.90361924
  97. 253.0138041.85508325
  98. 264.4388011.01457226
  99. 275.1552160.88262827
  100. 284.4314570.74150928
  101. 292.8419490.70999129
  102. ...........
  103. 970-7.910646-13.738689970
  104. 971-7.318091-14.811335971
  105. 972-9.144376-15.466873972
  106. 973-9.538658-15.367167973
  107. 974-9.061114-16.822726974
  108. 975-9.803798-17.368350975
  109. 976-10.180575-17.270180976
  110. 977-10.601352-17.671543977
  111. 978-10.804909-19.535919978
  112. 979-10.397964-20.361419979
  113. 980-10.979640-20.300267980
  114. 981-8.738223-20.202669981
  115. 982-9.339929-21.528973982
  116. 983-9.780686-20.902152983
  117. 984-11.072655-21.235735984
  118. 985-10.849717-20.439201985
  119. 986-10.953247-19.708973986
  120. 987-13.032707-18.687553987
  121. 988-12.984567-19.557132988
  122. 989-13.508836-18.747584989
  123. 990-13.420713-19.883180990
  124. 991-11.718125-20.474092991
  125. 992-11.936512-21.360752992
  126. 993-14.225655-22.006776993
  127. 994-13.524940-20.844519994
  128. 995-14.088767-20.492952995
  129. 996-14.169056-20.666777996
  130. 997-14.798708-19.960555997
  131. 998-15.766568-19.395622998
  132. 999-17.281143-19.089793999
  133. [1000 rows x 3 columns]
  134. >>> plt.figure()
  135. <matplotlib.figure.Figure object at 0x000002A2A0B10F28>
  136. >>> df.plot(x='A')
  137. <matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A12FE7F0>
  138. >>> plt.show()
运行结果


 
  1. >>> df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d'])
  2. >>>print(df)
  3. a b c d
  4. 00.5044340.1908750.0016870.327372
  5. 10.4068440.6020290.9120750.815889
  6. 20.8285340.9859100.0946620.552089
  7. 30.1988430.8187850.7506490.967054
  8. 40.4984940.1513780.4175060.264438
  9. 50.6552880.6727880.0886160.433270
  10. 60.4931270.0092540.1794790.396655
  11. 70.4193860.9109860.0200040.229063
  12. 80.6714690.6121890.3749200.407093
  13. 90.4149780.0334990.7560250.717849
  14. >>> df.plot(kind='bar')
  15. <matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A17BD7B8>
  16. >>> plt.show()
运行结果


 
  1. >>> df = pd.DataFrame(np.random.rand(10,4), columns=['a','b','c','d'])
  2. >>> df.plot(kind='barh', stacked=True)
  3. <matplotlib.axes._subplots.AxesSubplot object at 0x000002A2A3784390>
  4. >>> plt.show()


 
 
 
 

猜你喜欢

转载自cakin24.iteye.com/blog/2388398