[Deep Learning] Chapter 2: Data

Deep Learning: DNN

1. Environment setup

I originally planned to start with the data, but it wouldn't make sense if I didn't set up an environment before starting the lecture, so I simply went through the environment.

Our deep learning below is all based on the pytorch framework, so the first step is that you need the pytorch module. Install anaconda first, then conda install pytorch torchvision torchaudio cpuonly, which is the cpu version of pytorch. This step is for extremely smooth situations. If you encounter a pitfall, just use Baidu temporarily and it can be solved. The GPU version is more troublesome.

In addition, we will encounter some other third-party libraries in the future during the learning process. When the time comes, we will explain how to install and use them at any time.

2. Data

Before learning an algorithm or an architecture, you must first figure out the data to which the algorithm or architecture applies.

1. Data Dimension
The following data set is the first data set we encountered when learning machine learning algorithms: the iris data set. Data like this kind of data set is generally called two-dimensional tabular data, which is two-dimensional data with rows and columns. The rows are samples one by one, and the columns are the characteristics of the samples. There is no relationship between samples. . The data that DNN is suitable for is this type of data. That is, your data needs to be run with DNN. You have to ensure that the data is two-dimensional. Rows represent samples and columns represent features. They must also be labeled. There is no relationship between rows. Only in this way can DNN be used. run.

The following uses the Iris data set to introduce several concepts: samples, features, labels, classification, regression, supervised learning, unsupervised learning, etc. These concepts are also applicable to deep learning.


It is currently recognized that for small-scale data, machine learning models can achieve satisfactory results; for very large-scale data, deep learning models must be used to achieve the desired results.
How to define small-scale and large-scale? Generally, a sample size of tens of thousands, or a sample size of less than 100,000, is considered small-scale data. Data such as picture data, video data, text data, audio data, etc. are generally large-scale data, because for example, a picture, even the smallest black and white handwritten digital picture, has 28x28=784 features, and this type of data is large. For large-scale data, it is completely impossible to use traditional machine learning algorithms at this time. Neural network models must be used to achieve the desired results.

2. Data types
Data types are the way computers store and organize data. Different data processing frameworks have their own corresponding underlying data organization structures. The following are the different data structures of various data in the iris data set:

among them, the dictionary structure of 1 is the native basic data structure of python, 2 is the array type of numpy, and 3 is the dataframe data type of pandas. We use the pytorch framework to learn DNN, so the data fed into the DNN must be 4, tensor type data. It can also be seen that the conversion is very simple, just use torch.tensor().

3. Reading data
does not mean converting the data into tensor type and everything will be fine. You also have to package the features and labels of the data together, then divide them into training sets and test sets, and then divide them into small batches to batch them one by one. Feed the DNN.
Why is it so complicated? Because the data in deep learning is generally massive, that is, the sample size is very large, such as more than 100,000 samples. The training process is not like the algorithm model in machine learning, which first loads all the data into the memory and then learns a model. Deep learning loads data in batches, and learns and iterates in batches. It is this mechanism that allows deep learning to process massive amounts of data and avoid the limitation of insufficient memory.

In this process, you can package your own handwritten functions - divided into training sets and test sets - into small batches, but here we focus on the functions provided by pytorch:
use TensorDataset to package the features and labels of the samples - use random_split to package the packaged Divide the samples into training sets and test sets - use DataLoader to divide the training sets and test sets into small batches. The whole process is the three steps 3, 4, and 5 in the figure below:

At this time, the batchdata object can be fed into the DNN for training and learning. But we haven’t talked about the architecture yet, so we won’t demonstrate it yet. We will demonstrate it in the architecture chapter. Let’s take a closer look at these objects:

In fact, you can also implement the above functions without adjusting the package. You can also write a packaging function and a function for dividing into small batches by yourself. I just want to remind you that when writing the dividing function and the batch dividing function, you must be certain. Remember that you must shuffle, otherwise you will not be able to learn and train later, and the resulting loss function will be a horizontal line.
Another thing to say is that the method of writing it manually is only suitable for small data sets. The amount of data in deep learning is often very large, and it is generally stored in a distributed manner. You pack the data manually, shuffle the data and then divide it into small batches. This process requires a large amount of storage space and computing resources. So it is not recommended to implement it manually.

Note: Pytorch performs data generation, packaging, shuffle, segmentation, small batches, and data preprocessing (such as converting data types, data normalization) and other operations. Pytorch only stores the logical relationship of data transformation. , instead of actually generating some new data transformation result data, it generates some mapping or iterative objects. When used, these objects are also queried iteratively or recursively. This underlying clever design mechanism is mainly to adapt to the massive amount of data. Designed for data.

4. A more universal data reading method

In reality, the data we obtain are often of various types. In addition to the data type, the data storage methods are also different. For example, features and labels are stored in two files. Therefore, pytorch also provides us with a Dataset class to encapsulate data and encapsulate the data into a data structure that pytorch can recognize. We can read our own data by inheriting the nn.Dataset class, so that the read data also inherits Now that the methods and properties of the parent class nn.Dataset have been encapsulated for us, we can use them directly in the future, which is very convenient. In fact, this method is not very useful when learning DNN. When you learn CNN in the future, the data set will become very complex. This class will become very useful, and there will also be cases to show. Here we first use the iris data set to demonstrate the general process of inheriting the nn.Dataset class to read data:

Note: When we write this class, the __getitem__ method must be written, because this method obtains the characteristics and labels of each sample based on the index value, and establishes a direct mapping with the actual data, so it must Gotta write. Like other methods, such as the __len__ method, they are optional. Of course, you can add more methods according to your needs.

After you read the data in this way, you can package the features and labels - divide them into training sets and test sets - divide them into small batches - and train the model in a loop.

Summary: This section explains the data applicable to DNN and how to read, package, separate, and batch the processes and steps.
The data suitable for DNN is two-dimensional tabular data, in which rows represent samples and columns represent features; DNN is a supervised learning algorithm, so the data must be labeled; it is best to have no relationship between samples and samples. If it is a two-dimensional time series For tabular data, or natural semantic two-dimensional data with semantics, DNN cannot learn the relationship between samples. DNN learns the relationship between features and features.
Since DNN is an introduction to deep learning and is too simple, the data is not complicated. Therefore, data reading is the same as in machine learning. The process of data type conversion-packaging-dividing into training sets and test sets-dividing into small batches is not difficult. You can implement this process by handwriting yourself or by adjusting the package. The specific The package adjustment functions are all displayed, so you can just check them and use them.

Guess you like

Origin blog.csdn.net/friday1203/article/details/135208594