This article is mainly about the notes of Li Mu: Hands-on Learning Deep Learning 2.0 online course.
Video address: https://zhuanlan.zhihu.com/p/29125290.
Full textbook: https://zh-v2.d2l.ai/Textbook
for this course: https://zh-v2.d2l.ai/chapter_preliminaries/pandas.html
Note address: https://gitee.com/lhm8013609/ mldl_-learning-notes/tree/master/1%E3%80%81DL_Limu/Notes
2021.05.08 Data preprocessing learning
os file/directory method module learning
As an example, we first create an artificial dataset and store it in a csv (comma separated values) file.../data/house_tiny.csv. Data stored in other formats can be processed in a similar manner. The mkdir_if_not_exist function below ensures that the directory .../data exists. Note that the comment #@save is a special tag. The functions, classes or statements below this tag will be saved in the d2l package so that they can be called directly later (for example, d2l.mkdir_if_not_exist(path)) without redefinition.
import os
# os.makedirs() 方法用于递归创建目录
os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
f.write('NumRooms,Alley,Price\n') # 列名
f.write('NA,Pave,127500\n') # 每行表示一个数据样本
f.write('2,NA,106000\n')
f.write('4,NA,178100\n')
f.write('NA,NA,140000\n')
Note:
The os.makedirs() method is used to create directories recursively
- The syntax format of the makedirs() method is as follows:
os.makedirs(path, mode=0o777) - Parameter
path – the directory to be created recursively, which can be a relative or absolute path.
mode – permission mode.
The os.path.join() function is used to join path to file path
- path representation
.
Represents the current directory..
Represents the directory above the current directory../
Indicates a file or folder in the current directory, depending on the name that follows.../
Represents a file or folder in the directory one level above the current directory, depending on the name that follows.
* os.path.join('...', 'data') represents the path...data, which actually creates the data folder in the current directory.
- Splicing will start from the first parameter starting with "/", and all previous parameters will be discarded.
- The above situation comes first. In the above case, if a parameter starting with "./" appears, splicing will start from the previous parameter of the parameter starting with "./".
- If there are multiple parameters starting with "/", they will be spliced from the last one starting with "/", and all previous parameters will be discarded.
- !!!Note: There are differences between Linux and Windows. This is based on the conclusion under Windows. See the comments: Complete tutorial on python path splicing os.path.join() function
- os.path.join('…', 'data', 'house_tiny.csv') represents the directory of the house_tiny.csv file under the data folder of the current directory
with open(data_file, ‘w’) as f: f.write()
- File writing operations;
- For related usage, see:
- python uses with open() as to read and write files ;
- with open() as f Usage
import os
# windows环境下,结果如上所述,结论正确
print("1:",os.path.join('aaaa','/bbbb','ccccc.txt'))
print("2:",os.path.join('/aaaa','/bbbb','/ccccc.txt'))
print("3:",os.path.join('aaaa','./bbb','ccccc.txt'))
Output:
1: /bbbb\ccccc.txt
2: /ccccc.txt
3: aaaa\./bbb\ccccc.txt
Read file:
import pandas as pd
data = pd.read_csv(data_file)
data
Output:
NumRooms Alley Price
0 NaN Pave 127500
1 2.0 NaN 106000
2 4.0 NaN 178100
3 NaN NaN 140000
2.2.2. Handling missing values
Note that "NaN" items represent missing values. To handle missing data, typical methods include interpolation and deletion, where interpolation replaces missing values with surrogate values. Delete ignores missing values. Here we will consider interpolation.
Through positional index iloc, we divide the data into inputs and outputs, where the former is the first two columns of data, and the latter is the last column of data. For missing values in the inputs, we replace the "NaN" entries with the mean of the same column.
# iloc[:,:],逗号前是行,后是列,
# :表示从哪行(列)到哪行(列),如下面的0:2即表示0-2列
# 第二列,即最后一列
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
print(inputs)
Output:
NumRooms Alley
0 3.0 Pave
1 2.0 NaN
2 4.0 NaN
3 3.0 NaN
Note:
iloc function: fetch row data through row number (such as fetching the data of the second row)
- data.iloc[:, 0:2] takes columns 0-2 of all rows of data
- loc function: Fetch row data through the specific value in the row index "Index" (such as fetching the row whose "Index" is "A")
- More: Detailed explanation of the usage of loc and iloc functions in Pandas (source code + examples)
fillna(), mean() function
-
fillna function form: fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
-
parameter:
-
value: The value of the null value to use for padding.
-
More parameters: pandas fills missing values NaN with mean value | fillna method analysis
-
mean() function function: find the mean: mean() function usage in python's numpy library
-
The commonly operated parameter is axis, taking m * n matrix as an example:
axis does not set a value, averages mn numbers, and returns a real number
axis = 0: compresses rows, averages each column, and returns 1 n matrix
axis = 1: Compress the columns, average each row, and return an m *1 matrix -
mean(A)
If A is a matrix, then output the mean of each column (a vector).
If A is a column vector, then output the mean (a number).
If A is a row vector, then output the mean (a number), and the column vector. Same
Use jupyter notebook to edit text and code: Use jupyter notebook to edit text and code
- Two lines of enter are both empty lines
- Press dd twice to delete a cell
For categorical or discrete values in inputs, we treat "NaN" as a category. Since the "Alley" column only accepts two types of categorical values, "Alley" and "NaN", pandas can automatically convert this column into two columns, "Alley_Pave" and "Alley_nan". Rows with an alley type of "Pave" will have the value of "Alley_Pave" set to 1 and the value of "Alley_nan" set to 0. Rows missing an alley type will have "Alley_Pave" and "Alley_nan" set to 0 and 1 respectively.
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
NumRooms Alley_Pave Alley_nan
0 3.0 1 0
1 2.0 0 1
2 4.0 0 1
3 3.0 0 1
Note
pd.get_dummies,
Official documentation: click here
-
pd.get_dummies(inputs, dummy_na=True)
-
By default, it is divided into several columns by value. At the same time, dummy_na=True means that bool values are used to represent specific values.
-
The above only has two values: pave and NAN, so it is divided into two columns. At the same time, pave is represented by 1 and NAN is represented by 0.
-
Encoding categorical variables - pd.get_dummies(), LabelEncoder(), oneHotEncoder(): click here
2021.05.11
2.2.3. Convert to tensor format
Now that all entries in inputs and outputs are of numeric type, they can be converted to tensor format. When the data is in tensor format, it can be further manipulated through those tensor functions introduced in Section 2.1.
import torch
x, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
x, y
Output:
(tensor([[3., 1., 0.],
[2., 0., 1.],
[4., 0., 1.],
[3., 0., 1.]], dtype=torch.float64),
tensor([127500, 106000, 178100, 140000]))
Note: inputs are the room number (NumRooster) and alley (Alley) in front, and outputs are the price.
2.2.4. Summary
Like many other extension packages in the vast Python ecosystem, pandas is compatible with tensors.
Interpolation and deletion can be used to deal with missing data.
2.2.5. Exercise
Create an original dataset with more rows and columns.
Remove the column with the most missing values.
Convert the preprocessed dataset to tensor format.
Operation:
# 1、创建原始数据集
import os
p_datafile = os.path.join('..', 'data', 'house.csv')
with open(p_datafile, 'w') as f:
f.write('NumRoos,Alley,Size,Garden,Price\n')
f.write('NA,Pave,100,Yes,127500\n')
f.write('2,NA,200,Yes,187500\n')
f.write('3,NA,150,No,155500\n')
f.write('NA,NA,90,NA,100500\n')
f.write('4,Pave,120,Yes,137500\n')
import pandas as pd
data1 = pd.read_csv(p_datafile)
data1
Output:
NumRoos Alley Size Garden Price
0 NaN Pave 100 Yes 127500
1 2.0 NaN 200 Yes 187500
2 3.0 NaN 150 No 155500
3 NaN NaN 90 NaN 100500
4 4.0 Pave 120 Yes 137500
Note: Handle missing values
- df.isnull()# Returns True if the value is missing, otherwise False
- df.isnull().sum()#Returns the number of missing values contained in each column
- df.dropna()#Directly delete rows containing missing values
- df.dropna(axis = 1)#Directly delete columns containing missing values
- df.dropna(how = 'all')#Only delete rows with all missing values
- df.dropna(thresh = 4)#Retain rows with at least 4 missing values
- df.dropna(subset = ['C'])#Delete specific columns containing missing values
- dddf = ddf.dropna(subset=['jie_num'],axis=0)#Delete specific rows containing missing values
- datanota = AData[AData['marital'].notna()]#Delete rows containing missing values in a column
* Parameters description of df.dropna():
DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)
- Axis 0 is row 1 is column, default 0, data deletion dimension
- how {'any', 'all'}, default 'any', any: delete rows with nan; all: delete rows with all nan
- thresh int, keep at least int non-nan rows
- subset list, processing missing values in specific columns
- inplace bool, whether to modify the source file
End
data1.isna().sum()# 返回每列包含的缺失值的个数
Output:
NumRoos 2
Alley 3
Size 0
Garden 1
Price 0
dtype: int64
# 2、删除缺失值最多的列
data1 = data1.dropna(axis=1, thresh=max(data1.isna().sum()))
# data1.dropna(axis=1, thresh=3)# 将在列的方向上三个为NaN的项删除
data1 = data1.fillna(data1.mean())# 将数值的空值填充为已有数值的平均值
data1
Output:
NumRoos Size Garden Price
0 3.0 100 Yes 127500
1 2.0 200 Yes 187500
2 3.0 150 No 155500
3 3.0 90 NaN 100500
4 4.0 120 Yes 137500
input1, output1 = data1.iloc[:, 0:3], data1.iloc[:,3]
input1, output1
Output:
( NumRoos Size Garden
0 3.0 100 Yes
1 2.0 200 Yes
2 3.0 150 No
3 3.0 90 NaN
4 4.0 120 Yes,
0 127500
1 187500
2 155500
3 100500
4 137500
Name: Price, dtype: int64)
input1 = pd.get_dummies(input1, dummy_na=True) # 按值将Garden分为3列
input1
Output:
NumRoos Size Garden_No Garden_Yes Garden_nan
0 3.0 100 0 1 0
1 2.0 200 0 1 0
2 3.0 150 1 0 0
3 3.0 90 0 0 1
4 4.0 120 0 1 0
# 3、将其转换为张量格式
import torch
a, b = torch.tensor(input1.values), torch.tensor(output1.values)
a, b
Output:
(tensor([[ 3., 100., 0., 1., 0.],
[ 2., 200., 0., 1., 0.],
[ 3., 150., 1., 0., 0.],
[ 3., 90., 0., 0., 1.],
[ 4., 120., 0., 1., 0.]], dtype=torch.float64),
tensor([127500, 187500, 155500, 100500, 137500]))
The difference between torch.Tensor and torch.tensor
In Pytorch, both Tensor and tensor are used to generate new tensors.
a = torch.Tensor([1, 2])
a
a=torch.tensor([1,2])
a
First, let’s look at the difference between torch.Tensor() and torch.tensor() from the root.
torch.Tensor
torch.Tensor() is a Python class, more specifically, it is an alias of the default tensor type torch.FloatTensor(). torch.Tensor([1,2]) will call the constructor __init__ of the Tensor class, Generates a tensor of type single-precision floating point.
a=torch.Tensor([1,2])
a.type()
torch.tensor()
torch.tensor() is just a Python function, and the function prototype is:
torch.tensor(data, dtype=None, device=None, requires_grad=False)
Among them, data can be: list, tuple, array, scalar and other types.
torch.tensor() can copy (rather than directly reference) the data part in data and generate the corresponding torch.LongTensor, torch.FloatTensor, and torch.DoubleTensor according to the original data type.
import numpy as np
a = torch.tensor([1, 2])
a.type()
Output:
'torch.LongTensor'
b = torch.tensor([1., 2.])
b.type()
Output:
'torch.FloatTensor'
c = np.zeros(2, dtype=np.float64)
c = torch.tensor(c)
c.type()
Output:
'torch.DoubleTensor'
a, b=torch.Tensor(1), torch.Tensor([1])
a, b
Output:
(tensor([1.4013e-45]), tensor([1.]))
The scalar 1 of the former is passed in as size, and the vector 1 of the latter is passed in as value.
# astype函数用于array中数值类型转换
x = np.array([1, 2, 2.5])
x.astype(int)
Output:
array([1, 2, 2])
See more: dtype() and astype() functions in simple terms: click here
If there is anything inappropriate, please tell me! For more information, please pay attention to [Official Z Account: Long Yi’s Programming Life]