Machine Learning Toss 2: Datasets, Classification, and Feature Engineering

If we look at it from the perspective of the whole life, life is a process of constant choice and choice. During this process, there will be some changes that we expect or do not expect, but no matter how we choose, some profound experiences and experiences will occur. It will be preserved forever and become an unalterable imprint in our life. ("Diligence: How to Be a Great Person")

So far, in fact, I haven't introduced Numpy and SciPy, and I haven't even seen the shadow of Scikit-learn. Don't worry, a good entry point usually does not start from the basics, but the basics will be very important when applying, see It is very contradictory, but after you understand it, you will find that it is not contradictory.

To borrow a phrase from the DL4J community, there are many different sources and types of data that machine learning needs to process, such as log files, text documents, tabular data, images, videos, etc. The goal of machine learning data processing is to convert various types of data into a series of values ​​stored in a Multi-Dimensional Array. The data may also require various preprocessing, including transformation, scaling, normalization, transformation, merging, partitioning into training and testing datasets, random sorting, etc.

Let's continue to look at the core three steps and try to expand -

1. Read data - what are the data sources? What type of data is there? Is the data open or encrypted? How to read large amounts of data?
2. Preprocessed data – can the raw data be used directly? What data format is more suitable for machine learning? What is the so-called "vectorization"?
3. Choosing the right model and learning algorithm – what makes a good algorithm? How to optimize the model? What are the hardware requirements?

Of course, questions can go on and on if you want.

data set

Let’s start with datasets. In fact, one of the hardest problems in machine learning has nothing to do with machine learning itself: it’s how to get the right data in the right format.

Without data, machines don’t know what to learn, just like we are faced with a blank book and don’t know what to learn.

Data sets in different fields require a lot of professional knowledge to judge, because it implies what we want the machine to pay attention to, such as recognizing number plates, recognizing different types of flowers, or understanding a piece of music, etc.

Our actual dataset is the Iris dataset (Iris dateset, also known as the iris flower dataset), which is a classic dataset from the 1930s.

visualization

There is a concept mentioned here called classification. Generally speaking, we call all the measurements in the data as features. And a feature is a collection of properties of something, for example, sepal length, sepal width, petal length, petal width mentioned in the book.

In fact, there may be other features, but for classification, this feature may be enough to distinguish. This is the problem that classification is trying to solve; for a given sample with a class label (feature), we have to design a rule, and then through this rule, we can finally achieve the other samples (for example, the mobile phone re-shot an iris pictures of flowers). This is the same problem as spam classification; based on a sample of spam and non-spam flagged by users, can we decide whether a new incoming message is spam?

In the book, it is mentioned that the data set should be identified by visualization first. I think it is possible to use a small data set, but a large amount of data may not be suitable for visualization first, because the process itself takes a lot of time.

data analysis

After reading this, I suggest that you don't get caught up in the code implementation in the book, and briefly jump out to think about a question: How to get the right data in the right format?

As we said earlier, there are tens of millions of types of data. If you multiply the dimensions of different fields, the amount of data will explode exponentially. Now we all like to talk about big data casually, but if massive big data is not analyzed and processed, it is just a pile of information, which is not only unrecognizable by humans, but also by machines.

Therefore, data analysis and processing are very important, but now there are also objections that it is not the best machine learning method for machines to imitate people blindly. For example, AlphaGo Zero, AlphaGo's brother, defeated the game without learning human chess. AlphaGo, is statistical-based empirical learning really feasible? It is still being explored.

But in the current situation, we do need to continue to pay attention to data analysis and processing.

Build a Simplest Classifier

Having said that, according to the author's idea, we have been able to build a simple classifier, and can train a model to achieve classification prediction, the following code:

from matplotlib import pyplot as plt
from sklearn.datasets import load_iris
data = load_iris()
features = data['data']
feature_names = data['feature_names']
species = data['target_names'][data['target']]

setosa = (species == 'setosa')
features = features[~setosa]
species = species[~setosa]
virginica = species == 'virginica'

t = 1.75
p0,p1 = 3,2

if COLOUR_FIGURE:
    area1c = (1.,.8,.8)
    area2c = (.8,.8,1.)
else:
    area1c = (1.,1,1)
    area2c = (.7,.7,.7)

x0,x1 =[features[:,p0].min()*.9,features[:,p0].max()*1.1]
y0,y1 =[features[:,p1].min()*.9,features[:,p1].max()*1.1]

plt.fill_between([t,x1],[y0,y0],[y1,y1],color=area2c)
plt.fill_between([x0,t],[y0,y0],[y1,y1],color=area1c)
plt.plot([t,t],[y0,y1],'k--',lw=2)
plt.plot([t-.1,t-.1],[y0,y1],'k:',lw=2)
plt.scatter(features[virginica,p0], features[virginica,p1], c='b', marker='o')
plt.scatter(features[~virginica,p0], features[~virginica,p1], c='r', marker='x')
plt.ylim(y0,y1)
plt.xlim(x0,x1)
plt.xlabel(feature_names[p0])
plt.ylabel(feature_names[p1])
# plt.savefig('../1400_02_02.png')
plt.grid()
plt.show()

A simple classification diagram

I guess, at the same time, you may be confused by the various parameter settings in it, and then confused about the various concepts proposed by the author. This is the performance of the basics is not good enough, maybe you have begun to be discouraged, and at this time, I would like to encourage you to continue reading, and the basics will be added later.

If you have been doing data analysis and data mining for a long time, you will feel that learning machine learning is an easy thing to do, and the difficulty may only be in the optimization and implementation of different algorithms.

For those who are starting from scratch, this may be the most fatal thing. The enthusiasm for machine learning was suddenly awakened by the cold water of reality, and then it was as if it had returned to the original point.

Classification

What does a classification model consist of? The book divides it into three parts -

1. Model structure - here we use a threshold to divide on a feature.
2. The search process - here we try as many combinations of features and thresholds as possible.
3. Loss function - We use the loss function to determine which possibilities are not too bad (because we are not going to talk about the perfect solution). We can define this in terms of training error or some other way, like we want the highest accuracy possible. In general, one wants the loss function to be minimized.

What is the goal of classification? Simply put, it is to distinguish between what an object is and what is not, which is easy for humans to do, but very difficult for machines.

Here the author mentions a very important concept called feature engineering. As the name suggests, its essence is an engineering activity. The purpose is to extract features from the original data to the maximum extent and provide them for learning algorithms to build models.

It should be added here that in fact, classification data is manually labeled data. Preprocessing is to turn it into a vector that can be learned by the machine, so that the machine can start learning. The algorithm that uses such data for classification learning is also called supervised learning. .

There is another concept - vector, which is called vector in English. It is a concept of linear algebra. A vector is essentially a column of numbers, and it is also the basic data representation of machine learning.

In the chapter on data sets, the author also talked about the K means nearest neighbor classification algorithm and the two-class classification algorithm, which will not be introduced here, and will be practiced when we meet them alone.

summary

Now to wrap up today's content, we talked about datasets, visualization, classification algorithms, supervised learning, vectors, feature engineering. With a simple classification example, the first two of the core steps are clarified - reading data and preprocessing data. In fact, we are far from getting started with machine learning. If learning machine learning is completely from the perspective of actual combat, it will easily become a proficient explanation of tools without understanding the essence behind it, and then run a few graphics locally to entertain yourself. Happy, or just talking about algorithm analysis, which makes people who want to get started daunting, but in real practical applications, the difficulties that will be encountered are far from being as simple as knowing a few algorithms, but may be that the foundation is more important, and the way of thinking is more important. .

Reference resources

1. "Machine Learning System Design"
2. "Data Mining and Data Operation Practice"
3. "Python Data Analysis Basic Course: NumPy Learning Guide (Second Edition)"
4. Dataset and Machine Learning Introduction: https:// deeplearning4j.org/en/data-sets-ml

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324632877&siteId=291194637