The Connections and Differences Between Statistics and Machine Learning

1. Description

        I'm honestly tired of hearing this debate on social media and at my college almost every day. Usually, this is accompanied by some vague statement explaining the problem. Both parties feel guilty about it. I hope by the end of this article you will have a more informed stance on these somewhat nebulous terms.

2. Argument

        Contrary to popular belief, machine learning has been around for decades. It was initially shunned due to its enormous computational demands and the limitations of computing power that existed at the time. However, due to the dominance of data generated by the information explosion, machine learning has seen a renaissance in recent years.

        So, if machine learning and statistics are synonymous, why haven't we seen every university's statistics department close or transition to a "machine learning" department? Because they are different!

        I often hear a few vague statements on this topic, the most common being this:

        "The main difference between machine learning and statistics is their purpose. Machine learning models are designed to achieve the most accurate predictions. Statistical models are designed to infer relationships between variables.

        While this is technically correct, it doesn't give a particularly clear or satisfying answer. The main difference between machine learning and statistics is indeed their purpose. However, to say that machine learning is about accurate predictions and that statistical models are designed for inference is almost a meaningless statement unless you are well versed in the concepts.

        First, we must understand that statistics and statistical models are not the same. Statistics is the mathematical study of data. You can't do statistics unless you have data. Statistical models are models of data that are used to infer certain relationships in the data or to create models capable of predicting future values. Often, the two go hand in hand.

        So we really need to discuss two things: first, how statistics differ from machine learning, and second, how statistical models differ from machine learning.

        To make this clearer, there are many statistical models that can make predictions, but predictive accuracy is not their strong suit.

        Likewise, machine learning models offer varying degrees of interpretability, from highly interpretable lasso regression to inscrutable neural networks, but they often sacrifice interpretability for predictive power.

        From a high-level perspective, this is a great answer. That's enough for most people. However, in some cases, this explanation can lead us to misunderstand the difference between machine learning and statistical modeling. Let's look at an example of linear regression.

3. Statistical Models and Machine Learning — Linear Regression Example

        In my opinion, the similarity of the methods used in statistical modeling and machine learning lead people to think they are the same thing. This is understandable, but not true at all.

        The most obvious example is the case of linear regression, which is probably the main reason for this misunderstanding. Linear regression is a statistical method where we can train a linear regressor and obtain the same results as a statistical regression model, aiming at minimizing the squared error between data points.

        We saw that in one case we did something called "training" a model, which involved using a subset of the data, and we didn't know how well the model would perform until we had additional data that didn't exist during training (called "test" these data on the test set). In this case, the goal of machine learning is to achieve the best performance on the test set.

        For a statistical model, we find a line that minimizes the mean squared error over all the data, assuming the data is a linear regressor with some random noise added, which is usually Gaussian. No training and no test set required. In many cases, especially in research (such as the sensor example below), the focus of our models is to characterize the relationship between data and outcome variables, rather than to make predictions about future data. We call this process statistical inference, not forecasting. However, we can still use this model to make predictions, which may be your main purpose, but the way to evaluate the model will not involve the test set, but rather evaluate the importance and robustness of the model parameters.

        The goal of (supervised) machine learning is to obtain a model that can make repeatable predictions. We generally don't care if the model is interpretable, although I personally recommend always testing to make sure the model predictions actually make sense. Machine learning is all about results, and it might work in a company where your value is all about your performance. Statistical modeling, however, is more about discovering relationships between variables and the importance of those relationships, while also catering to predictions.

        To give a concrete example of the difference between these two programs, I'll give a personal example. During the day, I work as an environmental scientist working with sensor data. If I'm trying to prove that a sensor responds to a certain stimulus (such as a gas concentration), then I use a statistical model to determine whether the signal response is statistically significant. I will try to understand this relationship and test its repeatability so that I can accurately characterize the sensor response and make inferences from this data. Some things I might test are whether the response is actually linear, whether the response can be attributed to the gas concentration rather than random noise in the sensor, etc.

        In contrast, I also have access to an array of 20 different sensors that I can use to try and predict the response of my newly characterized sensor. This might seem a little strange if you don't know much about sensors, but this is currently an important area of ​​environmental science. A model with 20 different variables to predict the results of my sensors is obviously about prediction and I don't expect it to be particularly interpretable. Due to the nonlinearity created by chemical kinetics and the relationship between physical variables and gas concentrations, the model could be more esoteric like a neural network. I hope the model makes sense, but as long as I can make accurate predictions, I'll be happy.

        If I were trying to prove that the relationship between variables in my data was statistically significant to a certain degree so that I could publish it in a scientific paper, I would use statistical models rather than machine learning. This is because I care more about the relationship between variables than making predictions. Making predictions may still be important, but most machine learning algorithms lack interpretability, making it hard to prove relationships in the data (this is actually a big problem in academic research these days, researchers use algorithms they don't understand and obtain plausible inferences).

Source:  Analyzing Vidya

        It should be clear that the two approaches have different goals, although similar means are used to achieve them. The evaluation of a machine learning algorithm uses a test set to verify its accuracy. However, for statistical models, analysis of regression parameters with confidence intervals, significance tests, and other tests can be used to assess the validity of the model. Since these methods produce the same result, it's easy to see why people would assume they are the same.

4. Statistics and Machine Learning — Linear Regression Example

        I think this misconception is nicely encapsulated in this ostensibly tongue-in-cheek 10-year challenge, comparing statistics and machine learning.

        However, it would be unreasonable to conflate the two terms based solely on the fact that both terms exploit the same basic concept of probability. For example, if we state that machine learning is just glorified statistics based on this fact, we can also make the following statement.

        Physics is just glorified mathematics.

        Zoology is just a glorified stamp collection.

        Architecture is just glorified sandcastle architecture.

        These statements (especially the last one) are pretty ridiculous, and are all based on conflating terms that build on similar ideas (pun intended for architectural examples).

        In fact, physics is built on mathematics, and it is the application of mathematics to understand physical phenomena that exist in reality. Physics also includes aspects of statistics, and modern forms of statistics are often constructed from a framework consisting of Zermelo-Frankel set theory combined with measure theory to produce probability spaces. They all have a lot in common because they come from similar origins and apply similar ideas to reach logical conclusions. Likewise, architecture and sandcastle building may have a lot in common - although I'm not an architect so I can't give a sensible explanation - but they are clearly not the same.

        To give you an idea of ​​the scope of this debate, there is actually a paper published in Nature Methods outlining the difference between statistics and machine learning. The idea might seem ludicrous, but it's kind of sad that this level of discussion is necessary.

        Before we move on, I'll quickly clear up two other common misconceptions associated with machine learning and statistics. These are artificial intelligence is different from machine learning, data science is different from statistics. These are fairly undisputed questions, so it will be quick.

        Data science is essentially computational and statistical methods applied to data, which can be small or large data sets. This can also include things like exploratory data analysis, where data is examined and visualized to help scientists better understand and draw inferences from the data. Data science also includes things like data wrangling and preprocessing, so it's a bit of computer science as it involves coding, making connections and plumbing between databases, web servers, etc.

        You don't necessarily need a computer to do statistics, but you can't really do data science without a computer. You can see again that while data science uses statistics, they are clearly not the same.

        Likewise, machine learning is not the same as artificial intelligence. In fact, machine learning is a subset of artificial intelligence. This is very obvious because we are teaching ("training") a machine to make generalizable inferences about a certain type of data based on previous data.

5. Machine learning is built on statistics

        Before we discuss how statistics and machine learning are different, let's discuss the similarities. We have already touched on this in the previous sections.

        Machine learning is built on a statistical framework. This should be obvious, since machine learning involves data, and the data must be described using a statistical framework. However, the extension of statistical mechanics to thermodynamics for large numbers of particles is also built on a statistical framework. The concept of pressure is actually a statistic, and temperature is also a statistic. If you think this sounds ridiculous, fair enough, but it's actually true. That's why you can't describe the temperature or pressure of a molecule, which is absurd. Temperature is a representation of the average energy produced by molecular collisions. For enough molecules, it makes sense that we can describe the temperature in a house or outside.

        Would you admit that thermodynamics and statistics are the same? No, thermodynamics uses statistics to help us understand the interaction of work and heat in the form of transfer phenomena.

        In fact, thermodynamics is based on more items than statistics. Likewise, machine learning draws from a host of other areas of mathematics and computer science, such as:

  • ML theory in fields such as mathematics and statistics
  • ML algorithms from optimization, matrix algebra, calculus, and more
  • ML implementation of computer science and engineering concepts (e.g. kernel tricks, feature hashing)

        When one starts coding on Python and develops the sklearn library and starts using these algorithms, a lot of these concepts are abstracted away so it's hard to see the differences. In this case, this abstraction leads to ignorance of what machine learning actually involves.

6. Statistical Learning Theory - The Statistical Basis of Machine Learning

        The main difference between statistics and machine learning is that statistics is entirely based on probability spaces. You can derive the whole statistic from set theory, which talks about how we group numbers into categories called sets, and then apply a measure to that set to ensure that all of them add up to 1. We call this a probability space.

        Statistics makes no assumptions about the universe other than these notions of sets and measures. That's why when we specify probability spaces in very strict mathematical terms, we specify 3 things.

The probability space ( Ω, F, P)         we represent thus consists of three parts:

  1. The sample space  Ω , which is the set of all possible outcomes.
  2. A set of events  F , where each event is a collection of zero or more outcomes .
  3. The assignment of event probabilities , P ; that is, the function from events to probabilities.

        Machine learning is based on statistical learning theory, which is still based on the axiomatic notion of probability spaces. The theory was developed in the 1960s and expanded upon traditional statistics.

        There are several categories of machine learning, so I'll just focus on supervised learning here because it's the easiest to explain (although it's still somewhat esoteric because it's buried in mathematics).

        Statistical learning theory for supervised learning tells us that we have a set of data which we denote as  S = {(xi,yi)} . This basically says that we have a dataset of n data points, each of which is described by some other value we call a feature, provided by x , and these features are mapped by some function, giving We value y.

        It says we know we have this data and our goal is to find  the function that maps x  values ​​to  y  values. We call the set of all possible functions that can describe this map a hypothesis space.

        In order to find this function, we have to give the algorithm some way to "learn" the best way to solve the problem. This is provided by something called a loss function. So, for each hypothesis (proposed function) we have, we need to evaluate the function's performance by looking at its expected risk value for all the data .

        The expected risk is essentially the sum of the loss function multiplied by the probability distribution of the data. Finding the optimal function is easy if we know the joint probability distribution of the mappings. However, this is usually unknown, so our best bet is to guess the best function and then empirically decide whether the loss function is better. We call this empirical risk .

        We can then compare the different functions and look for the hypothesis that gives us the least expected risk , i.e. the hypothesis that gives the smallest value of all the hypotheses on the data (called the infallible value).

        However, the algorithm tends to cheat in order to minimize its loss function by overfitting the data. That's why after learning a function based on the training set data, the function is validated on the test data set, which was not present in the training set.

        We have just defined that the nature of machine learning introduces the problem of overfitting and justified the need for training and test sets when performing machine learning. This is not an inherent feature of statistics, since we did not attempt to minimize empirical risk.

        A learning algorithm that chooses a function that minimizes empirical risk is called empirical risk minimization .

Seven, examples

        Take the simple case of linear regression as an example. In the traditional sense, we try to minimize the error between some data in order to find a function that can be used to describe the data. In this case, we usually use the mean squared error. We square it so that positive and negative errors don't cancel each other out. We can then solve for the regression coefficients in closed form.

        As it happens, if we treat the loss function as mean squared error and perform empirical risk minimization as supported by statistical learning theory, we end up with the same results as traditional linear regression analysis.

        This is simply because the two cases are equivalent, just like performing maximum likelihood on the same data will give you the same results. There are different ways of maximum likelihood to achieve the same goal, but no one will argue and say that maximum likelihood is the same as linear regression. The simplest case obviously doesn't help to differentiate the methods.

        Another point to emphasize here is that in traditional statistical methods, there is no concept of training and testing sets, but we do use metrics to help us check the performance of our models. Therefore, the evaluation procedure is different, but both methods are able to provide us with statistically robust results.

        Another point is that the traditional statistical method here gives us the optimal solution, because the solution has a closed form. It does not test any other assumptions and converges to a solution. However, machine learning methods try a bunch of different models and converge to a final hypothesis, which is consistent with the results of regression algorithms.

        If we use a different loss function, the result will not converge. For example, if we use hinge loss (which is indistinguishable using standard gradient descent, thus requiring other techniques such as proximal gradient descent), then the results will not be the same.

        The final comparison can be made by taking into account the bias of the models. One can ask machine learning algorithms to test linear models, as well as polynomial models, exponential models, etc., to see if those assumptions fit the data better given our prior loss function. This is similar to increasing the associated hypothesis space. In the traditional statistical sense, we choose a model and can evaluate its accuracy, but we cannot automatically make it choose the best model out of 100 different models. Obviously, there is always some bias in the model, which stems from the initial choice of the algorithm. This is necessary because finding an arbitrary function that best fits a dataset is an NP-hard problem.

8. So which one is better?

        This is actually a stupid question. As far as statistics vs. machine learning goes, machine learning would not exist without statistics, but machine learning is very useful in modern times due to the vast amount of data humans have access to since the information explosion.

        Comparing machine learning and statistical models is a bit difficult. Which you use largely depends on what your purpose is. If you just want to create an algorithm that can predict housing prices with high accuracy, or use data to determine whether someone is likely to contract certain types of diseases, machine learning may be a better approach. Statistical models may be a better approach if you are trying to demonstrate relationships between variables or make inferences from data.

Source:  Stack Exchange

        If you don't have a strong statistics background, you can still learn machine learning and take advantage of it, the abstractions provided by machine learning libraries make it very easy to use them as a non-expert, but you still need some understanding of the underlying statistical ideas, to prevent the model from overfitting and giving plausible inferences.

9. Where can I learn more?

        If you're interested in delving into statistical learning theory, there are many books and university courses on the topic. Here are some lecture courses I recommend:

9.520/6.860, Fall 2018 (mit.edu)

ECE 543: Statistical Learning Theory (Spring 2018) (illinois.edu)

Matthew Stewart

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132164292
Recommended