Data processing - MISSING VALUES

Data processing - MISSING VALUES

Missing data including case records missing and missing information fields, etc., which have a greater impact on the data analysis, the results lead to more significant uncertainty

Handling missing values: delete records / data interpolation / no treatment

1, delete records

Determining whether the value of missing data - isnull, notnull

isnull: missing value is True, non-missing value is False

notnull: missing value is False, non-missing value is True

AS NP numpy Import 
Import PANDAS AS PD
Import matplotlib.pyplot AS PLT
from SciPy Import stats
% matplotlib inline S = pd.Series ([12,33,45,23, np.nan, np.nan, 66,54, NP .nan, 99]) DF = pd.DataFrame ({ 'VALUE1': [12,33,45,23, np.nan, np.nan, 66,54, np.nan, 99,190],                 'value2': [ 'A', 'B', 'C', 'D', 'E', np.nan, np.nan, 'F', 'G', np.nan, 'G']}) # Create data print (s.isnull ()) # Series directly judge whether the missing value, returns a Series Print (df.notnull ()) # DataFrame directly determines whether the missing value, returns a Series Print (DF [ 'VALUE1']. notnull ()) # index is determined by Print ( '------') S2 = S [s.isnull () == False]   DF2 = DF [DF [ 'value2'].Difference notnull ()] # attention and df2 = df [df [ 'value2']. Notnull ()] [ 'value1'] of














Print (S2)
Print (DF2) # non-missing screening value

 

 

 

Delete missing values ​​- dropna

pd.Series = S ([12,33,45,23, np.nan, np.nan, 66,54, np.nan, 99]) 
DF = pd.DataFrame ({ 'VALUE1': [12, 33, 45,23, np.nan, np.nan, 66,54, np.nan, 99,190],
                'value2': [ 'A', 'B', 'C', 'D', 'E', NP. NaN3, np.nan, 'F', 'G', np.nan, 'G']}) # Create data s.dropna (InPlace = True) DF2 DF = [ 'VALUE1']. dropna () Print (S) Print (DF2) # drop method: can be used directly Series, DataFrame # inplace noted parameters, generates a new default value False →










Filling / replace missing data - fillna, replace

pd.Series = S ([12,33,45,23, np.nan, np.nan, 66,54, np.nan, 99]) 
DF = pd.DataFrame ({ 'VALUE1': [12, 33, 45,23, np.nan, np.nan, 66,54, np.nan, 99,190],
                'value2': [ 'A', 'B', 'C', 'D', 'E', NP. NaN3, np.nan, 'F', 'G', np.nan, 'G']}) # Create data s.fillna (0, InPlace = True) Print (S) Print ( '---- - ') # s.fillna (value = None, Method = None, Axis = None, inplace = False, limit = None, the downcast = None, ** kwargs) # value: fill value # Note parameter inplace DF [ 'VALUE1']. fillna (Method = 'PAD', InPlace = True) Print (DF) Print ( '------') : Method # parameter data before filling # pad / ffill → with





















# Backfill / bfill → data after filling s = pd.Series ([1,1,1,1,2,2,2,3,4,5, np.nan , np.nan, 66,54, np.nan, 99]) s.replace does (np.nan, 'missing data', InPlace = True) Print (S) Print ( '------') # df.replace (to_replace = None, value None =, = False InPlace, limit = None, REGEX = False, Method = 'PAD', Axis = None) # is replaced with the value of to_replace → # value → overrides s.replace ([1,2,3 ], np.nan, InPlace = True) Print (S) # multivalued substituting np.nan















2, the missing interpolation value

Several ideas: the mean number / median / public interpolation, near the value interpolation, interpolation

(1) the mean / median / digital interpolation public

s = pd.Series ([1,2,3, np.nan , 3,4,5,5,5,5, np.nan, np.nan, 6,6,7,12,2, np.nan , 3,4-]) 
#Print (S)
Print ( '------') # Create data u = s.mean () # mean me = s.median () # median mod = s .mode () # the mode print ( 'mean:.% 2f, median:.%. 2F'% (U, Me)) print ( 'the mode is:', mod.tolist ()) Print ( '------') # were determined the mean / median / a mode s.fillna (U, InPlace = True) Print (S) # filled with mean















 

 

(2) adjacent to the interpolation value

s = pd.Series ([1,2,3, np.nan , 3,4,5,5,5,5, np.nan, np.nan, 6,6,7,12,2, np.nan , 3,4-]) 
#Print (S)
Print ( '------') # Create data s.fillna (Method = 'ffill', InPlace = True) Print (S) # with the value before the interpolation






 

 

(3) Interpolation - Lagrange polynomial

 

 

Import Lagrange scipy.interpolate from 
X = [. 3,. 6,. 9]
Y = [10,. 8,. 4]
Print (Lagrange (X, Y))
(type (Lagrange (X, Y))) Print output value of the # n coefficients of the polynomial # 3 where the output values are A0, A1, A2 # Y = A0 + A1 * X ** 2 + X * → Y = A2 * X ** 2 + -0.11111111 * X + 10 0.33333333 Print ( '10 of interpolation:.%. 2F' Lagrange% (X, Y) (10)) Print ( '------') # + 0.33333333 -0.11111111 * 100 + 10 * = 3.33333333 + 10 + 10 = 2.22222222 -11.11111111










(3) interpolation - Lagrange polynomial, practical application

= pd.Series Data (np.random.rand (100) * 100) 
Data [3,6,33,56,45,66,67,80,90] = np.nan
Print (data.head ())
Print ( 'total data amount: I%'% len (data))
print ( '------') # Create data data_na = data [data.isnull ()] amount value data print ( 'deletions:% 'I% len (data_na)) (accounting for missing data.:% (len (data_na) / len (data) * 100% 2f %%') print ') # of missing values data_c = data.fillna ( data.median ()) # fill in missing values of the median Fig, = plt.subplots axes (l, 4, figsize = (20, 5)) data.plot.box (axes AX = [0], = True Grid, title = 'data profile') data.plot (kind = 'KDE', style = '--r', axes AX = [. 1], = True Grid, title = 'delete missing values', xlim = [-50,150] )













data_c.plot (kind = 'kde', style = '--b', ax = axes [2], grid = True, title = ' missing value bits of the stuffing', XLIM = [-50,150]) # Density See FIG missing value or DEF na_c (S, n-, K =. 5):   Y = S [List (Range (NK, n-K + +. 1))] # fetch   y = y [y.notnull ()] # excluding null   return (Lagrange (y.index, List (Y)) (n-)) # create function, interpolate, since the amount of data reason, before and after the null value to five data (a total of 10 data) interpolate Example = na_re [] for I in Range (len (Data)):   IF data.isnull () [I]:       Data [I] = na_c (Data, I)       Print (na_c (Data, I))       na_re.append ( Data [I]) data.dropna (InPlace = True) # Clear interpolated missing values still present data.plot (kind = 'kde', style = '--k', ax = axes [3], grid = True , title = 'after Lagrange interpolation', XLIM = [-50,150]) Print ( 'Finished!')




















# Missing values ​​interpolated

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/Lilwhat/p/12446883.html