Python data processing - understanding and practical application of None/NULL/NaN


1. None, null and NaN in python

Note: There is no null in python, only None which has a similar meaning.

1、None

1) The data type
None represents a null value, a special Python object, and the type of None is NoneType.
None is the only value of the NoneType data type, we cannot create variables of other NoneType types, but we can assign None to any variable.

type(None) #该值是一个空对象,空值是python里面一个特殊的值,用None表示。None不能理解为0,因为0是有意义,而None是一个特殊的空值。
type('')

insert image description here
2) Features:

  • None does not support any operations
  • Comparison of None with any other data type always returns False
  • None has its own data type NoneType, and other NoneType objects cannot be created. (it has only one value None)
  • None is not the same as 0, an empty list, or an empty string.
  • None is assigned to any variable, and None value variables can also be assigned
  • None has no attributes like len, size, etc. To judge whether a variable is None, use it directly.
None==0 
None==‘’
None==False
dir(None) #返回参数属性、方法列表

insert image description here

3) As the return value of the key function without return

For all function definitions without a return statement, python will add return None at the end, and use a return statement without a value (that is, only the return keyword itself), then return None.

def fun1():
	print('test')
result=fun()
print(result)

insert image description here

2、NaN

1) When using Numpy or Pandas to process data, it will automatically convert no data in the entry to NaN.

import pandas as pd
df=pd.read_csv('F:\\python_test\\demo.csv',header=None)
print(df)

The original data is as follows:
insert image description here
insert image description here
2) Features

  • NaN is no way to compare with any data
  • It is not equal to any value, including himself.
  • Its number type is float, but the result of calculation with any value is NaN
frame= pd.DataFrame([[1, 2, 3], 
                    [2, 3, 4],
                    [3, 4, np.nan]],
                   index=list('abc'), columns=list('ABC'))
num=frame.iloc[2,2]
result=num+2
result

insert image description here


2. Practical application

1. Use read_sql to read null data and display NaN

After using hive for data cleaning and feature processing, use python to read the organized tables under the hive database. Generally, pyhive continuous database is used, and pandas reads data.

Pandas reads the data table under the hive database

1. If the data table field is in string format, after pandas reads it, the data format of the field in python is object. If the field contains a NULL value, the read will be directly converted into a string **'NULL'**, if the If the field contains '' (null value), the read is directly converted to a string ** ''**.
2. If the data table field is in int format, after pandas reads it, the data format of the field in python is float64. If the field contains NULL value or '' (empty value), the reading is directly converted into a string NaN .


If the field of the database table is a string, and the field contains NULL, most of the meanings are empty or abnormal values ​​by default. Use pd.DataFrame or read_sql() to read, and the default is the string 'NULL'. Therefore, numerical processing of this will lead to many errors (such as ValueError: could not convert string to float: 'NULL'), such as when you want to convert the data type in python, or you want to fill in missing values ​​​​for null, etc. Therefore, it is more convenient to convert "NULL" to "NaN".

from pyhive import hive
import pandas as pd
import numpy as np

#缺失值统计
def na_count(data):
    data_count=data.count()
    na_count=len(data)-data_count
    na_rate=na_count/len(data)
    na_result=pd.concat([data_count,na_count,na_rate],axis=1)
    return na_result


sql='''select is_vice_card,online_days,age,payment_type
,star_level_int,cert_cnt,channel_nbr,payment_method_variable,package_price_group,white_flag
 from database.v1_6_501_train_test'''
 
con=hive.connect(host='b1m6.hdp.dc',port=10000,auth='KERBEROS',kerberos_service_name="hive") 
cursor=con.cursor()
cursor.execute(sql) #执行sql
result=cursor.fetchall()
data_pos_1=pd.DataFrame(result,columns=['is_vice_card',
		'online_days',
		'calling_cnt',
		'age',
		'payment_type',
		'star_level_int',
		'cert_cnt',
		'channel_nbr',
		'payment_method_variable',
		'package_price_group'])

print("未将‘NULL’替换成np.nan,查看train_data的缺失值:\n",na_count(data_pos_1))

#将str字段中的null转换成空值		
data_pos_1.loc[data_pos_1['is_vice_card']=='NULL','is_vice_card'] = np.nan
data_pos_1.loc[data_pos_1['online_days']=='NULL','online_days'] = np.nan
data_pos_1.loc[data_pos_1['age']=='NULL','age'] = np.nan
data_pos_1.loc[data_pos_1['payment_type']=='NULL','payment_type'] = np.nan
data_pos_1.loc[data_pos_1['star_level_int']=='NULL','star_level_int'] = np.nan
data_pos_1.loc[data_pos_1['cert_cnt']=='NULL','cert_cnt'] = np.nan
data_pos_1.loc[data_pos_1['channel_nbr']=='NULL','channel_nbr'] = np.nan
data_pos_1.loc[data_pos_1['payment_method_variable']=='NULL','payment_method_variable'] = np.nan
data_pos_1.loc[data_pos_1['package_price_group']=='NULL','package_price_group'] = np.nan

print("将‘NULL’替换成np.nan,查看train_data的缺失值:\n",na_count(data_pos_1))

Run the code, the result shows:

查看未将“NULL”替换成np.nan,data_pos_1的缺失值:
                             0  1    2
is_vice_card             8289  0  0.0
online_days              8289  0  0.0
calling_cnt              8289  0  0.0
age                      8289  0  0.0
payment_type             8289  0  0.0
star_level_int           8289  0  0.0
cert_cnt                 8289  0  0.0
channel_nbr              8289  0  0.0
payment_method_variable  8289  0  0.0
package_price_group      8289  0  0.0

查看将“NULL”替换成np.nan,data_pos_1的缺失值:;
                             0     1         2
is_vice_card             7854   435  0.052479
online_days              7854   435  0.052479
calling_cnt              8289     0  0.000000
age                      7854   435  0.052479
payment_type             7854   435  0.052479
star_level_int           7830   459  0.055375
cert_cnt                 6134  2155  0.259983
channel_nbr              7847   442  0.053324
payment_method_variable  7890   399  0.048136
package_price_group      8289     0  0.000000

You can also set the field to int in the database table.

2. Using read_csv to read null data is displayed as a string null

Generally defaults to NaN

import pandas as pd
data_pos=pd.read_csv(file_pos,encoding='utf-8')
data_pos.head(10)

insert image description here
Displayed as string is null

import pandas as pd
data_pos=pd.read_csv(file_pos,encoding='utf-8', na_filter=False)
#或
data_pos=pd.read_csv(file_pos,encoding='utf-8', keep_default_na=False)

#查看data_pos数据格式
data_pos[data_pos['online_days']=='NULL'].head(10)

insert image description here

Guess you like

Origin blog.csdn.net/sodaloveer/article/details/129791858