Pandas数据操作详解-总结

pandas简介

pandas 是基于NumPy 的一种工具,该工具是为解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。pandas 是 Python 的核心数据分析支持库,提供了快速、灵活、明确的数据结构,旨在简单、直观地处理关系型、标记型数据。

1.数据读取

首先,pip install pandas 安装Pandas库。

引用pandas库,通常简称为pd,如下:

import pandas as pd

1.1获取样本数据-以波士顿房价数据为例

从sklearn.datasets数据集中下载波士顿房价数据:

from sklearn.datasets import load_boston
boston = load_boston()
# 输出对boston数据集的描述
print("波士顿房价的数据集描述是\n", boston.DESCR)

运行结果:

波士顿房价的数据集描述是
 .. _boston_dataset:

Boston house prices dataset
---------------------------
**Data Set Characteristics:**  

    :Number of Instances: 506 
    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's
    :Missing Attribute Values: None
    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

波士顿房价数据集的特征共有14种,分别是CRIM(城镇人均犯罪率)、ZN(占地面积超过25000平方英尺的住宅用地比例)、INDUS(非零售商业用地占比)、CHAS(是否临河)、NOX(氮氧化物浓度)、RM(房屋房间数)、AGE(房屋年龄)、DIS(和就业中心的距离)、RAD(是否容易上高速路)、TAX(税率)、PTRATTO(学生人数比老师人数)、B(城镇黑人比例计算的统计值)、LSTAT(低收入人群比例)和MEDV(房价中位数)。原文链接:https://blog.csdn.net/f18896984569/article/details/127759937

这个数据下载到哪里了呢?我们可以通过打印boston获取位置信息(print(boston)),这里列出部分信息:位置在:D:\\pythonProject\\venv\\lib\\site-packages\\sklearn\\datasets\\data\\boston_house_prices.csv

 per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n.. topic:: References\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n", 'filename': 'D:\\pythonProject\\venv\\lib\\site-packages\\sklearn\\datasets\\data\\boston_house_prices.csv'}

Process finished with exit code 0

我们打开路径可以看到:

显示时间不是当前时间,说明之前已经下载过。

打开数据如下,显示前面11行:

CRIM

ZN

INDUS

CHAS

NOX

RM

AGE

DIS

RAD

TAX

PTRATIO

B

LSTAT

MEDV

0.00632

18

2.31

0

0.538

6.575

65.2

4.09

1

296

15.3

396.9

4.98

24

0.02731

0

7.07

0

0.469

6.421

78.9

4.9671

2

242

17.8

396.9

9.14

21.6

0.02729

0

7.07

0

0.469

7.185

61.1

4.9671

2

242

17.8

392.83

4.03

34.7

0.03237

0

2.18

0

0.458

6.998

45.8

6.0622

3

222

18.7

394.63

2.94

33.4

0.06905

0

2.18

0

0.458

7.147

54.2

6.0622

3

222

18.7

396.9

5.33

36.2

0.02985

0

2.18

0

0.458

6.43

58.7

6.0622

3

222

18.7

394.12

5.21

28.7

0.08829

12.5

7.87

0

0.524

6.012

66.6

5.5605

5

311

15.2

395.6

12.43

22.9

0.14455

12.5

7.87

0

0.524

6.172

96.1

5.9505

5

311

15.2

396.9

19.15

27.1

0.21124

12.5

7.87

0

0.524

5.631

100

6.0821

5

311

15.2

386.63

29.93

16.5

0.17004

12.5

7.87

0

0.524

6.004

85.9

6.5921

5

311

15.2

386.71

17.1

18.9

0.22489

12.5

7.87

0

0.524

6.377

94.3

6.3467

5

311

15.2

392.52

20.45

15

第一行显示数据有506行记录,13个变量,最后一列为房价中位数。我们将第一行删除掉便于数据操作。把文件复制到当前路径下与操作,另存为一份Excel格式。

excel文件读取

def read_excel(io: {engine, parse},
               sheet_name: int = 0,
               header: int = 0,
               names: Any = None,
               index_col: Any = None,
               usecols: Any = None,
               squeeze: bool = False,
               dtype: Any = None,
               engine: {__ne__} = None,
               converters: Any = None,
               true_values: Any = None,
               false_values: Any = None,
               skiprows: Any = None,
               nrows: Any = None,
               na_values: Any = None,
               keep_default_na: bool = True,
               na_filter: bool = True,
               verbose: bool = False,
               parse_dates: bool = False,
               date_parser: Any = None,
               thousands: Any = None,
               comment: Any = None,
               skipfooter: int = 0,
               convert_float: bool = True,
               mangle_dupe_cols: bool = True,
               storage_options: Optional[Dict[str, Any]] = None)

示例:读取excel文件数据,默认读取所有数据:

df=pd.read_excel('boston_house_prices.xls')
print(df)

csv文件读取

read_csv函数中参数更多:

def read_csv(filepath_or_buffer: PathLike[str],
             sep: Any = lib.no_default,
             delimiter: Any = None,
             header: str = "infer",
             names: Any = None,
             index_col: Any = None,
             usecols: Any = None,
             squeeze: bool = False,
             prefix: Any = None,
             mangle_dupe_cols: bool = True,
             dtype: Any = None,
             engine: Any = None,
             converters: Any = None,
             true_values: Any = None,
             false_values: Any = None,
             skipinitialspace: bool = False,
             skiprows: Any = None,
             skipfooter: int = 0,
             nrows: Any = None,
             na_values: Any = None,
             keep_default_na: bool = True,
             na_filter: bool = True,
             verbose: bool = False,
             skip_blank_lines: bool = True,
             parse_dates: bool = False,
             infer_datetime_format: bool = False,
             keep_date_col: bool = False,
             date_parser: Any = None,
             dayfirst: bool = False,
             cache_dates: bool = True,
             iterator: bool = False,
             chunksize: Any = None,
             compression: str = "infer",
             thousands: Any = None,
             decimal: str = ".",
             lineterminator: Any = None,
             quotechar: str = '\"',
             quoting: int = csv.QUOTE_MINIMAL,
             doublequote: bool = True,
             escapechar: Any = None,
             comment: Any = None,
             encoding: Any = None,
             dialect: Any = None,
             error_bad_lines: bool = True,
             warn_bad_lines: bool = True,
             delim_whitespace: bool = False,
             low_memory: Optional[bool] = _c_parser_defaults["low_memory"],
             memory_map: bool = False,
             float_precision: Any = None,
             storage_options: Optional[Dict[str, Any]] = None)

示例:读取csv数据,默认读取前5行:

df = pd.read_csv(
    # 该参数为数据在电脑中的路径,可以不填写
    filepath_or_buffer='boston_house_prices.csv',
    # 该参数代表数据的分隔符,csv文件默认是逗号。其他常见的是'\t'
    sep=',',
    # 该参数代表跳过数据文件的的第1行不读入
    # skiprows=1,
    # nrows,只读取前n行数据,若不指定,读入全部的数据
    nrows=5,
)

2.数据保存

excel文件保存,需要import xlwt

df.to_excel('boston_part.xls')

csv文件保存

df.to_csv('boston_part.csv')

3.数据指定位置读取与切片

可通过iloc方法来实现

newdf=df.iloc[:,:] ,索引从0开始

示例:读取指定位置数据,比如第5行第5列数据

df = pd.read_csv('boston_house_prices.csv')
df=df.iloc[4,4]

读取5行5列数据:

df = pd.read_csv('boston_house_prices.csv')
df=df.iloc[:5,:5]
print(df)

结果如下:

      CRIM    ZN  INDUS  CHAS    NOX
0  0.00632  18.0   2.31     0  0.538
1  0.02731   0.0   7.07     0  0.469
2  0.02729   0.0   7.07     0  0.469
3  0.03237   0.0   2.18     0  0.458
4  0.06905   0.0   2.18     0  0.458

读取指定位置5行数据所有列:

df = pd.read_csv('boston_house_prices.csv')
df=df.iloc[10:15,:]
print(df)

运行结果:

       CRIM    ZN  INDUS  CHAS    NOX  ...  TAX  PTRATIO       B  LSTAT  MEDV
10  0.22489  12.5   7.87     0  0.524  ...  311     15.2  392.52  20.45  15.0
11  0.11747  12.5   7.87     0  0.524  ...  311     15.2  396.90  13.27  18.9
12  0.09378  12.5   7.87     0  0.524  ...  311     15.2  390.50  15.71  21.7
13  0.62976   0.0   8.14     0  0.538  ...  307     21.0  396.90   8.26  20.4
14  0.63796   0.0   8.14     0  0.538  ...  307     21.0  380.02  10.26  18.2

同样的,读取指定列所有行也是一样的。

4.数据合并连接

pd.concat([df1,df2],axis=1) 横向合并数据

df = pd.read_csv('boston_house_prices.csv')
df1=df.iloc[:,:13]
df2=df.iloc[:,13]
print(df1,df2)
df3=pd.concat([df1,df2],axis=1)
print(df3)

纵向合并数据:

df = pd.read_csv('boston_house_prices.csv')
df1=df.iloc[:5,:]
df2=df.iloc[5:10,:]
print(df1,df2)
df3=pd.concat([df1,df2],axis=0)
print(df3)

5.根据条件读取数据

只选择中位数房价大于30的数据。df['MEDV']>30

df = pd.read_csv('boston_house_prices.csv')
df=df[df['MEDV']>30]
print(df)

6.根据条件删除数据

删除房价大于30的数据:

indexname=df[df['MEDV']>30].index
df.drop(index,Inplace=True)

7.统计函数

df = pd.read_csv('boston_house_prices.csv')
print(df['MEDV'].mean())  # 求一整列的均值,返回一个数。会自动排除空值。
print(df[['MEDV', 'LSTAT']].mean())  # 求两列的均值,返回两个数,Series
print(df[['MEDV', 'LSTAT']])
print(df[['MEDV', 'LSTAT']].mean(axis=1))  # 求两列的均值,返回DataFrame。axis=0或者1要搞清楚。
#axis=1,代表对整几列进行操作。axis=0(默认)代表对几行进行操作。实际中弄混很正常,到时候试一下就知道了。
print(df['MEDV'].max())  # 最大值
print(df['MEDV'].min())  # 最小值
print(df['MEDV'].std())  # 标准差
print(df['MEDV'].count())  # 非空的数据的数量
print(df['MEDV'].median())  # 中位数
print(df['MEDV'].quantile(0.25))  # 25%分位数

8.数据排序

8.1 按索引排序

函数:sort_index()是 pandas 中按索引排序的函数,默认情况下, sort_index 是按行索引升序排序。

df = pd.read_csv('boston_house_prices.csv',nrows=5,
                  index_col=['CRIM'],#设置该属性为索引列
                  usecols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS'])
print(df)
df1=df.sort_index()
print('sort_index:')
print(df1)

运行结果:

         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622
sort_index:
         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622

默认索引就是从小到达排序的.我们反序排列:

df = pd.read_csv('boston_house_prices.csv',nrows=5,
                  index_col=['CRIM'],
                  usecols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS'])
print(df)
df1=df.sort_index(ascending=False)
print('sort_index:')
print(df1)
         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622
sort_index:
         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900

8.2按数值排序

sort_values() 中设置单个列的列名称,可以对单个列进行排序,通过设置参数 ascending 可以设置升序或降序排列,默认升序排序。

df = pd.read_csv('boston_house_prices.csv',nrows=5,
                  index_col=['CRIM'],
                  usecols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS'])
print(df)
df1=df.sort_values('NOX')
print('sort_values:')
print(df1)
         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622
sort_values:
         ZN  INDUS  CHAS    NOX     RM   AGE     DIS
CRIM                                                
0.03237   0   2.18     0  0.458  6.998  45.8  6.0622
0.06905   0   2.18     0  0.458  7.147  54.2  6.0622
0.02731   0   7.07     0  0.469  6.421  78.9  4.9671
0.02729   0   7.07     0  0.469  7.185  61.1  4.9671
0.00632  18   2.31     0  0.538  6.575  65.2  4.0900

9.修改数值

根据条件修改某个指定字段的数值。

df1.loc[df1["NOX"]>0.50,"DIS"] = 0  #NOX大于0.50的数据0修改为0

猜你喜欢

转载自blog.csdn.net/soderayer/article/details/129546835