[In-depth and simple study notes] Data preprocessing learning in Li Mu's "Hands-on Learning of Deep Learning 2.0"

This article is mainly about the notes of Li Mu: Hands-on Learning Deep Learning 2.0 online course.
Video address: https://zhuanlan.zhihu.com/p/29125290.
Full textbook: https://zh-v2.d2l.ai/Textbook
for this course: https://zh-v2.d2l.ai/chapter_preliminaries/pandas.html
Note address: https://gitee.com/lhm8013609/ mldl_-learning-notes/tree/master/1%E3%80%81DL_Limu/Notes

2021.05.08 Data preprocessing learning

os file/directory method module learning

As an example, we first create an artificial dataset and store it in a csv (comma separated values) file.../data/house_tiny.csv. Data stored in other formats can be processed in a similar manner. The mkdir_if_not_exist function below ensures that the directory .../data exists. Note that the comment #@save is a special tag. The functions, classes or statements below this tag will be saved in the d2l package so that they can be called directly later (for example, d2l.mkdir_if_not_exist(path)) without redefinition.

import os

# os.makedirs() 方法用于递归创建目录
os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')  # 列名
    f.write('NA,Pave,127500\n')  # 每行表示一个数据样本
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

Note:

The os.makedirs() method is used to create directories recursively
  • The syntax format of the makedirs() method is as follows:
    os.makedirs(path, mode=0o777)
  • Parameter
    path – the directory to be created recursively, which can be a relative or absolute path.
    mode – permission mode.
The os.path.join() function is used to join path to file path
  • path representation
  • .Represents the current directory
  • ..Represents the directory above the current directory.
  • ./Indicates a file or folder in the current directory, depending on the name that follows.
  • ../Represents a file or folder in the directory one level above the current directory, depending on the name that follows.
* os.path.join('...', 'data') represents the path...data, which actually creates the data folder in the current directory.
  • Splicing will start from the first parameter starting with "/", and all previous parameters will be discarded.
  • The above situation comes first. In the above case, if a parameter starting with "./" appears, splicing will start from the previous parameter of the parameter starting with "./".
  • If there are multiple parameters starting with "/", they will be spliced ​​from the last one starting with "/", and all previous parameters will be discarded.
  • !!!Note: There are differences between Linux and Windows. This is based on the conclusion under Windows. See the comments: Complete tutorial on python path splicing os.path.join() function
  • os.path.join('…', 'data', 'house_tiny.csv') represents the directory of the house_tiny.csv file under the data folder of the current directory
with open(data_file, ‘w’) as f: f.write()
import os

# windows环境下,结果如上所述,结论正确
print("1:",os.path.join('aaaa','/bbbb','ccccc.txt'))
print("2:",os.path.join('/aaaa','/bbbb','/ccccc.txt'))
print("3:",os.path.join('aaaa','./bbb','ccccc.txt'))

Output:

1: /bbbb\ccccc.txt
2: /ccccc.txt
3: aaaa\./bbb\ccccc.txt

Read file:

import pandas as pd
data = pd.read_csv(data_file)
data

Output:

	NumRooms	Alley	Price
0	NaN	Pave	127500
1	2.0	NaN	106000
2	4.0	NaN	178100
3	NaN	NaN	140000
2.2.2. Handling missing values

Note that "NaN" items represent missing values. To handle missing data, typical methods include interpolation and deletion, where interpolation replaces missing values ​​with surrogate values. Delete ignores missing values. Here we will consider interpolation.

Through positional index iloc, we divide the data into inputs and outputs, where the former is the first two columns of data, and the latter is the last column of data. For missing values ​​in the inputs, we replace the "NaN" entries with the mean of the same column.

# iloc[:,:],逗号前是行,后是列,
# :表示从哪行(列)到哪行(列),如下面的0:2即表示0-2列
# 第二列,即最后一列
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]

inputs = inputs.fillna(inputs.mean())
print(inputs)

Output:

   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN

Note:

iloc function: fetch row data through row number (such as fetching the data of the second row)
fillna(), mean() function
  • fillna function form: fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

  • parameter:

  • value: The value of the null value to use for padding.

  • More parameters: pandas fills missing values ​​NaN with mean value | fillna method analysis

  • mean() function function: find the mean: mean() function usage in python's numpy library

  • The commonly operated parameter is axis, taking m * n matrix as an example:
    axis does not set a value, averages mn numbers, and returns a real number
    axis = 0: compresses rows, averages each column, and returns 1 n matrix
    axis = 1: Compress the columns, average each row, and return an m *1 matrix

  • mean(A)
    If A is a matrix, then output the mean of each column (a vector).
    If A is a column vector, then output the mean (a number).
    If A is a row vector, then output the mean (a number), and the column vector. Same

Use jupyter notebook to edit text and code: Use jupyter notebook to edit text and code
  • Two lines of enter are both empty lines
  • Press dd twice to delete a cell

For categorical or discrete values ​​in inputs, we treat "NaN" as a category. Since the "Alley" column only accepts two types of categorical values, "Alley" and "NaN", pandas can automatically convert this column into two columns, "Alley_Pave" and "Alley_nan". Rows with an alley type of "Pave" will have the value of "Alley_Pave" set to 1 and the value of "Alley_nan" set to 0. Rows missing an alley type will have "Alley_Pave" and "Alley_nan" set to 0 and 1 respectively.

inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
   NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1

Note

pd.get_dummies,

Official documentation: click here
  • pd.get_dummies(inputs, dummy_na=True)

  • By default, it is divided into several columns by value. At the same time, dummy_na=True means that bool values ​​are used to represent specific values.

  • The above only has two values: pave and NAN, so it is divided into two columns. At the same time, pave is represented by 1 and NAN is represented by 0.

  • Encoding categorical variables - pd.get_dummies(), LabelEncoder(), oneHotEncoder(): click here

2021.05.11

2.2.3. Convert to tensor format

Now that all entries in inputs and outputs are of numeric type, they can be converted to tensor format. When the data is in tensor format, it can be further manipulated through those tensor functions introduced in Section 2.1.

import torch

x, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
x, y

Output:

(tensor([[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))

Note: inputs are the room number (NumRooster) and alley (Alley) in front, and outputs are the price.

2.2.4. Summary

Like many other extension packages in the vast Python ecosystem, pandas is compatible with tensors.

Interpolation and deletion can be used to deal with missing data.

2.2.5. Exercise

Create an original dataset with more rows and columns.

Remove the column with the most missing values.

Convert the preprocessed dataset to tensor format.

Operation:

# 1、创建原始数据集
import os
p_datafile = os.path.join('..', 'data', 'house.csv')
with open(p_datafile, 'w') as f:
    f.write('NumRoos,Alley,Size,Garden,Price\n')
    f.write('NA,Pave,100,Yes,127500\n')
    f.write('2,NA,200,Yes,187500\n')
    f.write('3,NA,150,No,155500\n')
    f.write('NA,NA,90,NA,100500\n')
    f.write('4,Pave,120,Yes,137500\n')
import pandas as pd

data1 = pd.read_csv(p_datafile)
data1

Output:

NumRoos	Alley	Size	Garden	Price
0	NaN	Pave	100	Yes	127500
1	2.0	NaN	200	Yes	187500
2	3.0	NaN	150	No	155500
3	NaN	NaN	90	NaN	100500
4	4.0	Pave	120	Yes	137500

Note: Handle missing values

  • df.isnull()# Returns True if the value is missing, otherwise False
  • df.isnull().sum()#Returns the number of missing values ​​contained in each column
  • df.dropna()#Directly delete rows containing missing values
  • df.dropna(axis = 1)#Directly delete columns containing missing values
  • df.dropna(how = 'all')#Only delete rows with all missing values
  • df.dropna(thresh = 4)#Retain rows with at least 4 missing values
  • df.dropna(subset = ['C'])#Delete specific columns containing missing values
  • dddf = ddf.dropna(subset=['jie_num'],axis=0)#Delete specific rows containing missing values
  • datanota = AData[AData['marital'].notna()]#Delete rows containing missing values ​​in a column
* Parameters description of df.dropna():

DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)

  • Axis 0 is row 1 is column, default 0, data deletion dimension
  • how {'any', 'all'}, default 'any', any: delete rows with nan; all: delete rows with all nan
  • thresh int, keep at least int non-nan rows
  • subset list, processing missing values ​​in specific columns
  • inplace bool, whether to modify the source file

End

data1.isna().sum()# 返回每列包含的缺失值的个数

Output:

NumRoos    2
Alley      3
Size       0
Garden     1
Price      0
dtype: int64
# 2、删除缺失值最多的列
data1 = data1.dropna(axis=1, thresh=max(data1.isna().sum()))
# data1.dropna(axis=1, thresh=3)# 将在列的方向上三个为NaN的项删除
data1 = data1.fillna(data1.mean())# 将数值的空值填充为已有数值的平均值
data1

Output:

	NumRoos	Size	Garden	Price
0	3.0	100	Yes	127500
1	2.0	200	Yes	187500
2	3.0	150	No	155500
3	3.0	90	NaN	100500
4	4.0	120	Yes	137500
input1, output1 = data1.iloc[:, 0:3], data1.iloc[:,3]
input1, output1

Output:

(   NumRoos  Size Garden
 0      3.0   100    Yes
 1      2.0   200    Yes
 2      3.0   150     No
 3      3.0    90    NaN
 4      4.0   120    Yes,
 0    127500
 1    187500
 2    155500
 3    100500
 4    137500
 Name: Price, dtype: int64)
input1 = pd.get_dummies(input1, dummy_na=True) # 按值将Garden分为3列
input1

Output:

NumRoos	Size	Garden_No	Garden_Yes	Garden_nan
0	3.0	100	0	1	0
1	2.0	200	0	1	0
2	3.0	150	1	0	0
3	3.0	90	0	0	1
4	4.0	120	0	1	0
# 3、将其转换为张量格式
import torch 

a, b = torch.tensor(input1.values), torch.tensor(output1.values)
a, b

Output:

(tensor([[  3., 100.,   0.,   1.,   0.],
         [  2., 200.,   0.,   1.,   0.],
         [  3., 150.,   1.,   0.,   0.],
         [  3.,  90.,   0.,   0.,   1.],
         [  4., 120.,   0.,   1.,   0.]], dtype=torch.float64),
 tensor([127500, 187500, 155500, 100500, 137500]))

The difference between torch.Tensor and torch.tensor

In Pytorch, both Tensor and tensor are used to generate new tensors.

a = torch.Tensor([1, 2])
a

a=torch.tensor([1,2])
a

First, let’s look at the difference between torch.Tensor() and torch.tensor() from the root.

torch.Tensor
torch.Tensor() is a Python class, more specifically, it is an alias of the default tensor type torch.FloatTensor(). torch.Tensor([1,2]) will call the constructor __init__ of the Tensor class, Generates a tensor of type single-precision floating point.

a=torch.Tensor([1,2])
a.type()

torch.tensor()

torch.tensor() is just a Python function, and the function prototype is:

torch.tensor(data, dtype=None, device=None, requires_grad=False)

Among them, data can be: list, tuple, array, scalar and other types.
torch.tensor() can copy (rather than directly reference) the data part in data and generate the corresponding torch.LongTensor, torch.FloatTensor, and torch.DoubleTensor according to the original data type.

import numpy as np 
a = torch.tensor([1, 2])
a.type()

Output:

'torch.LongTensor'
b = torch.tensor([1., 2.])
b.type()

Output:

'torch.FloatTensor'
c = np.zeros(2, dtype=np.float64)
c = torch.tensor(c)
c.type()

Output:

'torch.DoubleTensor'
a, b=torch.Tensor(1), torch.Tensor([1])
a, b

Output:

(tensor([1.4013e-45]), tensor([1.]))

The scalar 1 of the former is passed in as size, and the vector 1 of the latter is passed in as value.

# astype函数用于array中数值类型转换
x = np.array([1, 2, 2.5])
x.astype(int)

Output:

array([1, 2, 2])

See more: dtype() and astype() functions in simple terms: click here

If there is anything inappropriate, please tell me! For more information, please pay attention to [Official Z Account: Long Yi’s Programming Life]

Guess you like

Origin blog.csdn.net/weixin_43658159/article/details/116646515