train_test_split划分数据集时，数据类型的原因导致报错！！padas判断列类型，强制改变列类型，单独修改1列或者同时对多列进行修改！！！

示例，用在决策树前期数据准备的时候。运用train_test_split函数，如下：
 df = pd.read_csv(file_path)
    df_p = df[df['_c1'] == 1].head(389081)
    df_n = df[df['_c1'] == 2].head(389081)
    frames = [df_p, df_n]
    df = pd.concat(frames)
    df = df.fillna(0)
    # df = pd.to_numeric(df, errors='coerce') #转为数值，将不能操作的转为NAN
    target = df['_c1'].as_matrix()#将某一列转换成np.array类型
    target = pd.to_numeric(target, errors='coerce')#转为数值,并不是转换类型！
    data = df[feature_name].astype(float).as_matrix()#强制类型转换，对单列astype()
     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
注释：
pandas.DataFrame.as_matrix  #返回的是Numpy-array类型；
DataFrame.as_matrix(columns=None)[source]
Convert the frame to its Numpy-array representation.
Parameters:     
columns: list, optional, default:None
   If None, return all columns, otherwise, returns specified columns.
Returns:values : ndarray
   If the caller is heterogeneous and contains booleans or objects, the result will be of dtype=object. See Notes.
Return is NOT a Numpy-matrix, rather, a Numpy-array.
**知识点1：Numpy-matrix与Numpy-array区别：matrix只支持二维，array支持多维**
学会索引方式（部分元素的检索）
学会获取matrix/array的维数（matrix只支持二维，array支持多维）
初始化操作
矩阵运算：转置，相乘，点乘，点积，求秩，求逆等等
Numpy中的矩阵和数组
  numpy包含两种基本的数据类型：数组（array）和矩阵（matrix）。无论是数组，还是矩阵，都由同种元素组成。
    下面是测试程序：

    [python] view plain copy

        <span style="font-size:18px;"># coding:utf-8   
        import numpy as np  
        # print(dir(np))  
        M = 3  
        #---------------------------Matrix---------------------------  
        A = np.matrix(np.random.rand(M,M))  # 随机数矩阵  
        print('原矩阵：',A)      # A矩阵  
        print('A矩阵维数：',A.shape)  # 获取矩阵大小  
        print('A的转置:',A.T)   # A的转置   
        print('sum=',np.sum(A,axis=1)) # 横着加  
        print('sorted=',np.sort(A,axis=1)) # 竖着排  
        print('sin(A[0])=',np.sin(A[0]))  # 第一行元素取余弦值  
        print('A*A.T=',A*A.T)  # A*A.T  
        print('A.*A=',np.multiply(A,A)) # 点乘  
        print('mean(A)=',np.mean(A)) # 平均值,mean(A,axis=1)亦可  
        print('Rank(A)=',np.linalg.matrix_rank(A)) # 矩阵的秩</span>  

    [python] view plain copy

        <span style="font-size:18px;">#--------------------------Array-----------------------------#  
        B = np.array(np.random.randn(2,M,M)) # 可以是二维的  
        print('B =',B)  # 原矩阵  
        print('Size(B)= [',B.shape[0],B.shape[1],B.shape[2],']; ndim(B)=',B.ndim)  
        print('B[0]=',B[0]) # 第一维  
        Position = np.where(B[0]<0) #numpy.where和find用法相同  
        print('B[0]<0的位置：',Position[0],'（横坐标）；',Position[1],'（纵坐标）')      
        print('B[0][condition])=',B[0][B[0]<0])  # 找第一维数组中满足条件的元素  
        print('Dot(B[0][0],B[0][0])=',np.dot(B[0][0],B[0][0])) # 向量形式才计算内积</span>  

        知识点：

        （1）dir、shape、T、sum、sort、*、multiply、mean、linalg；

        （2）创建array、dot、where、逻辑索引
**知识点2：类型转换pd.to_numeric(target, errors='coerce')，可以将无效值强制转换为NaN，！！**
先看一个非常简单的例子：

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

有什么方法可以将列转换为适当的类型？例如，上面的例子，如何将列2和3转为浮点数？有没有办法将数据转换为DataFrame格式时指定类型？或者是创建DataFrame，然后通过某种方法更改每列的类型？理想情况下，希望以动态的方式做到这一点，因为可以有数百个列，明确指定哪些列是哪种类型太麻烦。可以假定每列都包含相同类型的值。

解决方法

可以用的方法简单列举如下：
对于创建DataFrame的情形

如果要创建一个DataFrame，可以直接通过dtype参数指定类型：


df = pd.DataFrame(a, dtype='float')  #示例1
df = pd.DataFrame(data=d, dtype=np.int8) #示例2
df = pd.read_csv("somefile.csv", dtype = {'column_name' : str})

对于单列或者Series

下面是一个字符串Seriess的例子，它的dtype为object：

>>> s = pd.Series(['1', '2', '4.7', 'pandas', '10'])
>>> s
0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

**使用to_numeric转为数值。默认情况下，它不能处理字母型的字符串'pandas'：**

>>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise')
ValueError: Unable to parse string

可以将无效值强制转换为NaN，如下所示：

>>> pd.to_numeric(s, errors='coerce')
0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

如果遇到无效值，第三个选项就是忽略该操作：

>>> pd.to_numeric(s, errors='ignore')
# the original Series is returned untouched

**对于多列或者整个DataFrame**

**如果想要将这个操作应用到多个列，依次处理每一列是非常繁琐的，所以可以使用DataFrame.apply处理每一列。**

对于某个DataFrame：

>>> a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
>>> df = pd.DataFrame(a, columns=['col1','col2','col3'])
>>> df
  col1 col2  col3
0    a  1.2   4.2
1    b   70  0.03
2    x    5     0

然后可以写：

df[['col2','col3']] = df[['col2','col3']].apply(pd.to_numeric)

那么'col2'和'col3'根据需要具有float64类型。

但是，可能不知道哪些列可以可靠地转换为数字类型。在这种情况下，设置参数：

df.apply(pd.to_numeric, errors='ignore')

然后该函数将被应用于整个DataFrame，可以转换为数字类型的列将被转换，而不能(例如，它们包含非数字字符串或日期)的列将被单独保留。

另外pd.to_datetime和pd.to_timedelta可将数据转换为日期和时间戳。
软转换——类型自动推断

版本0.21.0引入了infer_objects()方法，用于将具有对象数据类型的DataFrame的列转换为更具体的类型。

例如，用两列对象类型创建一个DataFrame，其中一个保存整数，另一个保存整数的字符串：

>>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object')
>>> df.dtypes
a    object
b    object
dtype: object

然后使用infer_objects()，可以将列'a'的类型更改为int64：

>>> df = df.infer_objects()
>>> df.dtypes
a     int64
b    object
dtype: object

由于'b'的值是字符串，而不是整数，因此'b'一直保留。
astype强制转换

如果试图强制将两列转换为整数类型，可以使用df.astype(int)。

示例如下：

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[16]: 
  one  two three
0   a  1.2   4.2
1   b   70  0.03
2   x    5     0

df.dtypes
Out[17]: 
one      object
two      object
three    object

df[['two', 'three']] = df[['two', 'three']].astype(float)

df.dtypes
Out[19]: 
one       object
two      float64
three    float64
train_test_split划分数据集时，数据类型的原因导致报错！！padas判断列类型，强制改变列类型，单独修改1列或者同时对多列进行修改！！！

猜你喜欢