本篇主要介绍pandas的数据类型层次；数据类型转换；python,numpy,pandas数据相互转换；及数据转换过程中

出现的问题解决办法。

pandas 数据类型 2018/12/11

1.数据类型

2.查看通用的所有子类型 dtypes：  

1.查看数据类型:
  df.info()
    df.dtypes
    series.dtype
    get_dtype_counts()
# 如一列含多个类型则该列类型是object
# 不同数据类型也会被当成object,比如int32,float32

2.实例:

def subdtypes(dtype):
    subs = dtype.__subclasses__()
    if not subs:
            return dtype
    return [dtype, [subdtypes(dt) for dt in subs]]
   
subdtypes(np.generic)


[numpy.generic,
     [[numpy.number,
                 [[numpy.integer,
                        [[numpy.signedinteger,
                                [numpy.int8,numpy.int16,numpy.int32,numpy.int32,numpy.int64,numpy.timedelta64]],
                        [numpy.unsignedinteger,
                                [numpy.uint8,numpy.uint16,numpy.uint32,numpy.uint32,numpy.uint64]]]
                   ],
                 [numpy.inexact,
                        [[numpy.floating,
                                [numpy.float16, numpy.float32, numpy.float64, numpy.float64]],
                        [numpy.complexfloating,
                                [numpy.complex64, numpy.complex128, numpy.complex128]]]]
                  ]
       ],
     [numpy.flexible,
                        [[numpy.character,
                                [numpy.bytes_, numpy.str_]],
                        [numpy.void,
                                [numpy.record]]]],
                        numpy.bool_,numpy.datetime64,numpy.object_]
 ]

数据转换-python/numpy/pandas相互转换 2019/1/10

1.1.python转pandas

实例1.1：python的tuple/list/dict/array转Series/DataFrame
import array
v=(1,2)
v=[1,2]
v={'a':1,'b':2}
v=array.array('i',[1,2])

s=pd.Series(v)      #字典键名为索引名，其他默认数字
df=pd.DataFrame([v])#字典的键名为列名，其他默认数字
pd.DataFrame.from_dict({'A': [1,2], 'B': [3,4]})
''' 
  A B
0 1 3
1 2 4
'''
实例1.2：numpy数组转Series/DataFrame
v=np.arange(4).reshape(2,2)
s=pd.Series(v.flatten())#必须是1维
df=pd.DataFrame(v)

1.2.pandas转python,numpy

实例2.1：Series转string/list/dict/array/xarray
s=pd.Series([1,2],index=list('ab'))
s.tolist()    #[1, 2]
s.to_dict()   # {'a': 1, 'b': 2}
s.to_string() #'a 1\nb 2'
array.array('i',s)#array('i', [1, 2])
s.to_xarray()
''' 
<xarray.DataArray (index: 2)>
array([1, 2], dtype=int64)
Coordinates:
* index (index) object 'a' 'b'
'''
实例2.2：Series转numpy数组
s.values#array([1, 2], dtype=int64)

实例2.3：Series转DataFrame
s.to_frame()

1.2pandas转python,numpy

实例3.1：DataFrame转list/dict/xarray
df=pd.DataFrame([[1,2],[3,4]],index=list('ab'),columns=list('AB'))

np.array(df).tolist()# [[1, 2], [3, 4]]
df.stack().tolist()  # [1, 2, 3, 4]
df.to_dict()         # {'A': {'a': 1, 'b': 3}, 'B': {'a': 2, 'b': 4}}
df.to_string()       #' A B\na 1 2\nb 3 4'

实例3.2：DataFrame转numpy.array
np.array(df) # array([[1, 2], [3, 4]], dtype=int64)
df.values    # 结果同上

1.3日期格式转换

s.to_period([freq, copy]) # 将Series从DatetimeIndex转换为具有所需频率的PeriodIndex

dt=pd.DatetimeIndex(['2018-10-14', '2018-10-15', '2018-10-16'])
dt.to_period('D')
# PeriodIndex(['2018-10-14', '2018-10-15', '2018-10-16'], dtype='period[D]', freq='D')
s.to_timestamp([freq, how, copy]) #在期间开始时转换为时间戳的datetimedex

2.修改数据类型

# 数据类型转换方法：
1）astype()函数进行强制类型转换
# 转数字注意事项：
# 每列都能简单解释为数字；不含特殊字符如','' ¥'等非数字的str;含有缺失值astype()函数可能失效。
2）自定义函数进行数据类型转换
3）函数to_numeric()、to_datetime()

实例1：

实例1：创建df时指定dtype参数指定类型

df = pd.DataFrame([1], dtype='float')
df = pd.DataFrame([1], dtype=np.int8)

实例2：

实例2：astype强制类型转换

data='客户编号 客户姓名 2018 2019 增长率 所属组 day month year 状态 \n' \
'4564651 张飞 ¥125,000.00 ¥162500.00 30% 500 12 10 2018 Y\n' \
'4564652 刘备 ¥920,000.00 ¥1012000.0 10% 700 26 5 2019 N\n' \
'4564653 关羽 ¥50,000.00 ¥62500.00 25% 125 24 2 2019 Y\n' \
'4564654 曹操 ¥15,000.00 ¥490000.00 4% 300 10 8 2019 Y\n'

from io import StringIO
df=pd.read_csv(StringIO(data), sep=r'\s+')
df.info() #查看加载数据信息主要是每列的数据类型数量

df['客户编号'] = df['客户编号'].astype('object') #对原始数据进行转换并覆盖原始数据列
df[['day', 'month']] = df[['day', 'month']].astype(int)

实例3：

实例3：自定义函数进行数据类型转换

def convert_currency(value):
v = value.replace(',', '').replace('¥', '').replace('￥', '')
return np.float(v)

#2018、2019列完整的转换代码
df['2018'] = df['2018'].apply(convert_currency)
df['2019'] = df['2019'].apply(convert_currency)
# df['2019'].apply(lambda x: x.replace('￥', '').replace(',', '')).astype('float')

def convert_percent(value):
return float(value.replace('%', '')) / 100

df['增长率']=df['增长率'].apply(convert_percent)
# df['增长率'].apply(lambda x: x.replace('%', '')).astype('float') / 100
df['状态'] = np.where(df['状态'] == 'Y', True, False)

实例4：

实例4：辅助函数进行类型转换- 如to_numeric()、to_datetime()

df['所属组']=pd.to_numeric(df['所属组'], errors='coerce').fillna(0)#将无效值强制转换为NaN
df['date']=pd.to_datetime(data[['day', 'month', 'year']])#把year、month、day三列合并成一个时间戳

实例5：

实例5：直接输入数据类型
df1 = pd.read_csv(StringIO(data), sep=r'\s+',
converters={
'客户编号': str,
'2018': convert_currency,
'2019': convert_currency,
'增长率': convert_percent,
'所属组': lambda x: pd.to_numeric(x, errors='coerce'),
'状态': lambda x: np.where(x == "Y", True, False)
})

实例6：

实例6：多列转换

a = [['a1', '1.1', '1.2'], ['a2', '0.02', '0.03'], ['a3', '5', 'NG']]
df = pd.DataFrame(a, columns=['A1','A2','A3'])

df[['A2','A3']] = df[['A2','A3']].apply(pd.to_numeric)#报错
df.apply(pd.to_numeric, errors='ignore')#能转换的列转换，不能被转换的保留

实例7：

实例7：类型自动推断infer_objects()

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['3','2','1']}, dtype='object')
df.dtypes
a object
b object
dtype: object

df = df.infer_objects()#将列'a'的类型更改为int64
# 由于'b'的值是字符串，而不是整数，因此'b'一直保留
df.dtypes
a int64
b object
dtype: object

3.数据转换中的注意事项

3.1.int列中有缺失值，结果都转换为float
pandas内缺失值转换规则：
integer float
boolean object
float no cast
object no cast

需要先做数据类型的转化，然后用Nan来表示缺失值。

3.2.数字列中含有空字符''

# 空值在MySQL、Python、Pandas上的表现形式：
str空值 空str 数值类型空值
MySQL Null '' Null
Python None '' None
Pandas None '' Nan

字符串空值和空字符串在写到csv效果一致，导致在读取数据时，无法区分。
如后续明确要求区分处理这两种情况，则会因为一次读写文件的操作导致数据失真。
建议规定一个唯一标识的字符串来代表None值

3.3.数字型字符串

若某一列为数值字符串时，通过pd.read_csv方法转化为DataFrame后，该列会被识别为numeric类型
应在读取csv文件时指定dtype参数

4.函数

DataFrame.select_dtypes(include=None, exclude=None)#通过列类型选取列
参数：
include, exclude : list-like(传入想要查找的类型)
返回：DataFrame

Series.astype(dtype, copy=True, errors='raise', **kwargs)#转换列类型
DataFrame.astype(dtype, copy=True, errors='raise', **kwargs)

参数：
dtype : data type, or dict of column name -> data type(传入列名和类型的字典)
errors ='raise': {'raise', 'ignore','coerce'}#(ignore,强制转换,这样不会报错,可以识别不同类型的数据)
kwargs : keyword arguments to pass on to the constructor
返回：type of caller

Index.astype(dtype, copy=True)

参数：
dtype : numpy dtype or pandas type
copy : bool, default True

其他转换方法：
to_numeric() (conversion to numeric dtypes)
to_datetime() (conversion to datetime objects)
to_timedelta() (conversion to timedelta objects)
参数：errors：'raise', 'ignore','coerce'
#默认有nan值报错，ignore忽略；coerce强制转换

48 python,numpy,pandas数据相互转换及数据类型转换；（汇总）（tcy）

1.数据类型

数据转换-python/numpy/pandas相互转换 2019/1/10

2.修改数据类型

3.数据转换中的注意事项

4.函数

猜你喜欢