Getting Started Guide must know the mathematical sciences data

Scientific data you want into the pit but I do not know how to start it? Take a look at the scientific data in this Getting Started Guide mathematics used it!

Mathematics is like a octopus: its "tentacles" can reach almost all disciplines. Although some subjects simply dip the edge point math, but some subjects were mathematics "tentacles" tightly wrapped. Scientific data belongs to the latter. If you want to engage in scientific work data, you'll have to solve mathematical problems. If you have a degree in mathematics or other degree of emphasis on math skills, you may want to know if you learn this knowledge are necessary. And if you do not have the relevant background, you may want to know: how many work in data science and math really need? In this article, we will explore what the data means that science, mathematics and discuss how much knowledge we need in the end. Let's start with the actual meaning of "scientific data" began to talk.

 

Data for scientific understanding, is the "eyes of the beholder, the wise see wisdom" thing! In Dataquest, we will be scientific data is defined as: discipline and the use of advanced statistical data to predict. This is a professional discipline, focusing understand that sometimes some confusion and inconsistent data (although scientists have solved the problem of data varies). Statistical Mathematics is the only one we mentioned in this definition, but scientific data are often involved in other areas of mathematics. Studying statistics is a good start, but also scientific data using an algorithm to predict. These algorithms are called machine learning algorithms, the number of hundreds of species.

 

How much depth study of mathematical knowledge does not belong to the scope of this article needs of each algorithm, this article will discuss the mathematical knowledge required for the following commonly used algorithms:

  • Naive Bayes

  • Linear Regression

  • Logistic regression

  • K-Means Clustering

  • Decision Tree

 

Now let's look at what the actual needs of each algorithm mathematical knowledge!

 

Naive Bayes classifier

 

Definition: naive Bayes classifier is based on the same principle of a series of algorithms, i.e., a certain value independently of any other feature value characteristics. Let us naive Bayes probability events can be predicted based on what we know of the conditions related events. The name is derived from Bayes' theorem, a mathematical formula is as follows:

 

 

There events A and B, and P (B) is not equal to 0. It looks complicated, but we can put it disassembled into three parts:

 

  • P (A | B) is a conditional probability. That is, the probability of an event A occurs under conditions of event B occurs.

  • P (B | A) is a conditional probability. That is, the probability of an event B occurs under conditions of event A occurs.

  • P (A) and P (B) is the probability of event A and event B occurring respectively, wherein two mutually independent.

 

Mathematical knowledge required: If you want to understand the basic principles of naive Bayes classifier algorithm, and all use of Bayes' theorem, a sufficient probability theory course.

 

Linear Regression

 

Definition: Linear regression is the most basic type of return. It helps us understand the relationship between two continuous variables. Simple linear regression is to obtain a set of data points can be plotted and used to predict future trend line. Linear regression is an example of parameterization of machine learning. In the parameters of machine learning, machine learning algorithms to make the training process to become a mathematical function that fits the pattern found in the training set. You can then use the mathematical functions to predict future results. In machine learning, mathematical functions are called models. The case of linear regression model can be expressed as:

 

Wherein a_1, a_2, ..., a_n parameter value representing a particular set of data, x_1, x_2, ..., x_n feature means that we choose a model used in the final column, y represents a target column. Linear regression goal is to find the optimal parameter values ​​can describe the relationship between features and destination column. In other words, it is to find a straight line that best fit the data optimally in order to predict future results based on the trend line.

 

In order to find the optimal parameters of the linear regression model, we want to minimize the sum of squared residuals and model. Residual errors also commonly referred to, to describe the difference between the predicted value and the true value. Residual sum of squares equation can be expressed as:

 

Where y ^ is the predicted value of the target column, y is the actual value.

 

Mathematical knowledge required: If you just want a quick look at linear regression, learning a basic statistics course on it. If you want to have in-depth understanding of the concept, you may want to know how to derive the residual sum of squares formulas, which are described in the most advanced statistics courses.

 

Logistic regression

 

Definitions: Logistic regression estimates the probability of focusing on events in the case to take two values ​​(that is, only two values, 0 and 1 output) the dependent variable. As with linear regression, Logistic regression is an example of parameterization of machine learning. Thus, training results of these machine learning algorithm is to obtain a mathematical function that best approximate the training set mode. The difference is that the output of the linear regression model is a real number, and is output Logistic regression model probability value.

 

As a linear regression algorithm to generate the model as a linear function, Logistic regression model algorithm generates Logistic Function. It is also called Sigmoid function, it will be mapped to the probability of all input values ​​between 0 and 1 results. Sigmoid function can be expressed as follows:

 

So why Sigmoid function always returns a value between 0-1 it? Remember, any number of algebra negative power of this number being equal to several times the inverse of the square.

 

Required mathematical knowledge: Here we have discussed and the probability index, you need to have a thorough understanding of algebra and probability, in order to understand the workings of Logistic algorithm. If you want to understand the concept, I suggest you learn discrete mathematics and probability theory or real analysis.

 

K-Means Clustering

 

Definitions: K Means clustering algorithm is an unsupervised machine learning, for unlabeled data (ie no defined categories or groups) are classified. Working principle of the algorithm is to discover cluster cluster data, the number of which is represented by a cluster of clusters k. Then iterate the feature assigns each data point to a k th cluster. K-means clustering algorithm relies throughout the concept of distance data points "assigned" to different clusters. It refers to the concept of distance between two space to the given item. In mathematics, a function describing the distance between any two elements of the set is called distance function or metric. There are two common types: Euclidean distance and Manhattan distance. Standard Euclidean distance is defined as follows:

 

Where (x1, y1) and (x2, y2) are the coordinates of a point on the Cartesian plane. Although Euclidean distance very broad application, but in some cases does not work. Suppose you walk in a big city; if there is a huge building block your path, then you say, "I and the destination is 6.5 units" is meaningless. To solve this problem, we can use the Manhattan distance. Manhattan distance formula is as follows:

 

 

Where (x1, y1) and (x2, y2) are the coordinates of a point on the Cartesian plane.

 

Required mathematical knowledge: in fact, you only need to know addition and subtraction, and understand the basics of algebra, we can grasp the distance formula. But in order to understand the basic geometry of each type of measure contained, I suggest learning about geometry contains Euclidean geometry and non-Euclidean geometry. To understand the meaning of metrics and space, I will read mathematical analysis and elective courses in real analysis.

 

Decision Tree

 

Definition: A decision tree is a tree structure similar to a flowchart, a method will be described which uses a branch result for each possible decision. Each node in the tree represents a particular variable test, each branch is the test result. Decision tree relies on information theory to determine how they are constructed. In information theory, the more people's understanding of an event, the less new information acquired from them. One of the key indicators of information theory is called entropy. Entropy is the amount of certainty is not a given variable to quantify the measure. Entropy can be expressed as:

 

In the above formula, P (x_i) is the probability of occurrence of random events x_i. The number of base-b may be any real number greater than 0; usually in base value 2, e (2.71) and 10. Like the "S" symbol is a summation of fancy symbols, which can continuously function outside the summation symbol sum, depending on the sum of the number of additions of lower and upper limits. After calculating the entropy, we can gain through the use of information began to construct a decision tree to determine which division method can minimize entropy. Information gain formula is as follows:

 

 

Information can gain a measure of the amount of information that the number of "bits" of information available. In the case of the decision tree, we can calculate the data set for each column of information gain in order to find which column will provide us with the greatest information gain, then split on this column.

 

Mathematical knowledge required: a preliminary understanding of the decision tree just want the basic knowledge of algebra and probability. If you want to log on probability and depth of conceptual understanding, I recommend you to learn probability theory and algebra.

 

Final Thoughts

 

If you are still in school, I highly recommend you some of the elective courses of pure and applied mathematics. Sometimes they will certainly make people feel fear, but the good news is, when you encounter these algorithms and how to best use them, you will be more capable. If you are currently not in school, I suggest you go to the nearest bookstore for books mentioned in this article. If you can find books involve probability theory, statistics and linear algebra, I strongly suggest you choose books cover these topics in order to truly understand the principles involved in this paper and those behind the machine learning algorithms are not involved.

 

Original link: https://www.dataquest.io/blog/math-in-data-science/

Published 363 original articles · won praise 74 · views 190 000 +

Guess you like

Origin blog.csdn.net/sinat_26811377/article/details/104584583