[In-depth and simple study notes] Data preprocessing learning in Li Mu's "Hands-on Learning of Deep Learning 2.0"

This article is mainly about the notes of Li Mu: Hands-on Learning Deep Learning 2.0 online course.
Video address: https://zhuanlan.zhihu.com/p/29125290.
Full textbook: https://zh-v2.d2l.ai/Textbook
for this course: https://zh-v2.d2l.ai/chapter_preliminaries/pandas.html
Note address: https://gitee.com/lhm8013609/ mldl_-learning-notes/tree/master/1%E3%80%81DL_Limu/Notes

2021.05.08 Data preprocessing learning

os file/directory method module learning

As an example, we first create an artificial dataset and store it in a csv (comma separated values) file.../data/house_tiny.csv. Data stored in other formats can be processed in a similar manner. The mkdir_if_not_exist function below ensures that the directory .../data exists. Note that the comment #@save is a special tag. The functions, classes or statements below this tag will be saved in the d2l package so that they can be called directly later (for example, d2l.mkdir_if_not_exist(path)) without redefinition.

import os

# os.makedirs() 方法用于递归创建目录
os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')  # 列名
    f.write('NA,Pave,127500\n')  # 每行表示一个数据样本
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

Note：

The os.makedirs() method is used to create directories recursively

The syntax format of the makedirs() method is as follows:
os.makedirs(path, mode=0o777)
Parameter
path – the directory to be created recursively, which can be a relative or absolute path.
mode – permission mode.

The os.path.join() function is used to join path to file path

path representation
.Represents the current directory
..Represents the directory above the current directory.
./Indicates a file or folder in the current directory, depending on the name that follows.
../Represents a file or folder in the directory one level above the current directory, depending on the name that follows.

* os.path.join('...', 'data') represents the path...data, which actually creates the data folder in the current directory.

Splicing will start from the first parameter starting with "/", and all previous parameters will be discarded.
The above situation comes first. In the above case, if a parameter starting with "./" appears, splicing will start from the previous parameter of the parameter starting with "./".
If there are multiple parameters starting with "/", they will be spliced from the last one starting with "/", and all previous parameters will be discarded.
!!!Note: There are differences between Linux and Windows. This is based on the conclusion under Windows. See the comments: Complete tutorial on python path splicing os.path.join() function
os.path.join('…', 'data', 'house_tiny.csv') represents the directory of the house_tiny.csv file under the data folder of the current directory

with open(data_file, ‘w’) as f: f.write()

File writing operations;
For related usage, see:
python uses with open() as to read and write files ;
with open() as f Usage

import os

# windows环境下，结果如上所述，结论正确
print("1:",os.path.join('aaaa','/bbbb','ccccc.txt'))
print("2:",os.path.join('/aaaa','/bbbb','/ccccc.txt'))
print("3:",os.path.join('aaaa','./bbb','ccccc.txt'))

Output:

1: /bbbb\ccccc.txt
2: /ccccc.txt
3: aaaa\./bbb\ccccc.txt

Read file:

import pandas as pd
data = pd.read_csv(data_file)
data

Output:

	NumRooms	Alley	Price
0	NaN	Pave	127500
1	2.0	NaN	106000
2	4.0	NaN	178100
3	NaN	NaN	140000

2.2.2. Handling missing values

Note that "NaN" items represent missing values. To handle missing data, typical methods include interpolation and deletion, where interpolation replaces missing values with surrogate values. Delete ignores missing values. Here we will consider interpolation.

Through positional index iloc, we divide the data into inputs and outputs, where the former is the first two columns of data, and the latter is the last column of data. For missing values in the inputs, we replace the "NaN" entries with the mean of the same column.

# iloc[:,:]，逗号前是行，后是列，
# ：表示从哪行（列）到哪行（列），如下面的0:2即表示0-2列
# 第二列，即最后一列
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]

inputs = inputs.fillna(inputs.mean())
print(inputs)

Output:

   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN

Note：

iloc function: fetch row data through row number (such as fetching the data of the second row)

data.iloc[:, 0:2] takes columns 0-2 of all rows of data
loc function: Fetch row data through the specific value in the row index "Index" (such as fetching the row whose "Index" is "A")
More: Detailed explanation of the usage of loc and iloc functions in Pandas (source code + examples)

fillna(), mean() function

fillna function form: fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
parameter:
value: The value of the null value to use for padding.
More parameters: pandas fills missing values NaN with mean value | fillna method analysis
mean() function function: find the mean: mean() function usage in python's numpy library
The commonly operated parameter is axis, taking m * n matrix as an example:
axis does not set a value, averages mn numbers, and returns a real number
axis = 0: compresses rows, averages each column, and returns 1 n matrix
axis = 1: Compress the columns, average each row, and return an m *1 matrix
mean(A)
If A is a matrix, then output the mean of each column (a vector).
If A is a column vector, then output the mean (a number).
If A is a row vector, then output the mean (a number), and the column vector. Same

Use jupyter notebook to edit text and code: Use jupyter notebook to edit text and code

Two lines of enter are both empty lines
Press dd twice to delete a cell

For categorical or discrete values in inputs, we treat "NaN" as a category. Since the "Alley" column only accepts two types of categorical values, "Alley" and "NaN", pandas can automatically convert this column into two columns, "Alley_Pave" and "Alley_nan". Rows with an alley type of "Pave" will have the value of "Alley_Pave" set to 1 and the value of "Alley_nan" set to 0. Rows missing an alley type will have "Alley_Pave" and "Alley_nan" set to 0 and 1 respectively.

inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1

Note

pd.get_dummies，

Official documentation: click here

pd.get_dummies(inputs, dummy_na=True)
By default, it is divided into several columns by value. At the same time, dummy_na=True means that bool values are used to represent specific values.
The above only has two values: pave and NAN, so it is divided into two columns. At the same time, pave is represented by 1 and NAN is represented by 0.
Encoding categorical variables - pd.get_dummies(), LabelEncoder(), oneHotEncoder(): click here

2021.05.11

2.2.3. Convert to tensor format

Now that all entries in inputs and outputs are of numeric type, they can be converted to tensor format. When the data is in tensor format, it can be further manipulated through those tensor functions introduced in Section 2.1.

import torch

x, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
x, y

Output:

(tensor([[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))

Note: inputs are the room number (NumRooster) and alley (Alley) in front, and outputs are the price.

2.2.4. Summary

Like many other extension packages in the vast Python ecosystem, pandas is compatible with tensors.

Interpolation and deletion can be used to deal with missing data.

2.2.5. Exercise

Create an original dataset with more rows and columns.

Remove the column with the most missing values.

Convert the preprocessed dataset to tensor format.

Operation:

# 1、创建原始数据集
import os
p_datafile = os.path.join('..', 'data', 'house.csv')
with open(p_datafile, 'w') as f:
    f.write('NumRoos,Alley,Size,Garden,Price\n')
    f.write('NA,Pave,100,Yes,127500\n')
    f.write('2,NA,200,Yes,187500\n')
    f.write('3,NA,150,No,155500\n')
    f.write('NA,NA,90,NA,100500\n')
    f.write('4,Pave,120,Yes,137500\n')

import pandas as pd

data1 = pd.read_csv(p_datafile)
data1

Output:

NumRoos	Alley	Size	Garden	Price
0	NaN	Pave	100	Yes	127500
1	2.0	NaN	200	Yes	187500
2	3.0	NaN	150	No	155500
3	NaN	NaN	90	NaN	100500
4	4.0	Pave	120	Yes	137500

Note: Handle missing values

df.isnull()# Returns True if the value is missing, otherwise False
df.isnull().sum()#Returns the number of missing values contained in each column
df.dropna()#Directly delete rows containing missing values
df.dropna(axis = 1)#Directly delete columns containing missing values
df.dropna(how = 'all')#Only delete rows with all missing values
df.dropna(thresh = 4)#Retain rows with at least 4 missing values
df.dropna(subset = ['C'])#Delete specific columns containing missing values
dddf = ddf.dropna(subset=['jie_num'],axis=0)#Delete specific rows containing missing values
datanota = AData[AData['marital'].notna()]#Delete rows containing missing values in a column

* Parameters description of df.dropna():

DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)

Axis 0 is row 1 is column, default 0, data deletion dimension
how {'any', 'all'}, default 'any', any: delete rows with nan; all: delete rows with all nan
thresh int, keep at least int non-nan rows
subset list, processing missing values in specific columns
inplace bool, whether to modify the source file

End

data1.isna().sum()# 返回每列包含的缺失值的个数

Output:

NumRoos    2
Alley      3
Size       0
Garden     1
Price      0
dtype: int64

# 2、删除缺失值最多的列
data1 = data1.dropna(axis=1, thresh=max(data1.isna().sum()))
# data1.dropna(axis=1, thresh=3)# 将在列的方向上三个为NaN的项删除

data1 = data1.fillna(data1.mean())# 将数值的空值填充为已有数值的平均值
data1

Output:

	NumRoos	Size	Garden	Price
0	3.0	100	Yes	127500
1	2.0	200	Yes	187500
2	3.0	150	No	155500
3	3.0	90	NaN	100500
4	4.0	120	Yes	137500

input1, output1 = data1.iloc[:, 0:3], data1.iloc[:,3]
input1, output1

Output:

(   NumRoos  Size Garden
 0      3.0   100    Yes
 1      2.0   200    Yes
 2      3.0   150     No
 3      3.0    90    NaN
 4      4.0   120    Yes,
 0    127500
 1    187500
 2    155500
 3    100500
 4    137500
 Name: Price, dtype: int64)

input1 = pd.get_dummies(input1, dummy_na=True) # 按值将Garden分为3列
input1

Output:

NumRoos	Size	Garden_No	Garden_Yes	Garden_nan
0	3.0	100	0	1	0
1	2.0	200	0	1	0
2	3.0	150	1	0	0
3	3.0	90	0	0	1
4	4.0	120	0	1	0

# 3、将其转换为张量格式
import torch 

a, b = torch.tensor(input1.values), torch.tensor(output1.values)
a, b

Output:

(tensor([[  3., 100.,   0.,   1.,   0.],
         [  2., 200.,   0.,   1.,   0.],
         [  3., 150.,   1.,   0.,   0.],
         [  3.,  90.,   0.,   0.,   1.],
         [  4., 120.,   0.,   1.,   0.]], dtype=torch.float64),
 tensor([127500, 187500, 155500, 100500, 137500]))

The difference between torch.Tensor and torch.tensor

In Pytorch, both Tensor and tensor are used to generate new tensors.

a = torch.Tensor([1, 2])
a

a=torch.tensor([1,2])
a

First, let’s look at the difference between torch.Tensor() and torch.tensor() from the root.

torch.Tensor
torch.Tensor() is a Python class, more specifically, it is an alias of the default tensor type torch.FloatTensor(). torch.Tensor([1,2]) will call the constructor __init__ of the Tensor class, Generates a tensor of type single-precision floating point.

a=torch.Tensor([1,2])
a.type()

torch.tensor()

torch.tensor() is just a Python function, and the function prototype is:

torch.tensor(data, dtype=None, device=None, requires_grad=False)

Among them, data can be: list, tuple, array, scalar and other types.
torch.tensor() can copy (rather than directly reference) the data part in data and generate the corresponding torch.LongTensor, torch.FloatTensor, and torch.DoubleTensor according to the original data type.

import numpy as np 
a = torch.tensor([1, 2])
a.type()

Output:

'torch.LongTensor'

b = torch.tensor([1., 2.])
b.type()

Output:

'torch.FloatTensor'

c = np.zeros(2, dtype=np.float64)
c = torch.tensor(c)
c.type()

Output:

'torch.DoubleTensor'

a, b=torch.Tensor(1), torch.Tensor([1])
a, b

Output:

(tensor([1.4013e-45]), tensor([1.]))

The scalar 1 of the former is passed in as size, and the vector 1 of the latter is passed in as value.

# astype函数用于array中数值类型转换
x = np.array([1, 2, 2.5])
x.astype(int)

Output:

array([1, 2, 2])

See more: dtype() and astype() functions in simple terms: click here

If there is anything inappropriate, please tell me! For more information, please pay attention to [Official Z Account: Long Yi’s Programming Life]