Wu Enda: The six core algorithms of machine learning!

 Datawhale dry goods 

Editor: Senior Johngo , source: AI Technology Review

Recently, Wu Enda updated a blog post on his artificial intelligence weekly "The Batch", summarizing the historical origin of multiple basic algorithms in the field of machine learning.

At the beginning of the article, Ng Enda recalled a decision in his research process:

Many years ago, in a project, when choosing an algorithm, he had to choose between a neural network and a decision tree learning algorithm. Considering the computing budget, he finally chose the neural network, abandoning the enhanced decision tree for a long time.

This was a wrong decision, "Fortunately, my team quickly revised my choice, and the project was successful." Wu Enda said.

He sighed that it is very important to continuously learn and update basic knowledge. Like other fields of technology, the field of machine learning is constantly evolving with more researchers and the number of research results. But the contribution of some basic algorithms and core ideas can stand the test of time:

  • Algorithms: linear and logistic regression, decision trees, etc.

  • Concepts: regularization, optimizing loss functions, bias/variance, etc.

In Wu Enda's view, these algorithms and concepts are the core ideas of many machine learning models, including house price predictors, text-image generators (such as DALL·E), etc.

In the latest article, Wu Enda and his team investigated the sources, uses, and evolution of the six basic algorithms, and provided a more detailed explanation.

These six algorithms are: linear regression, logistic regression, gradient descent, neural network, decision tree and k-means clustering algorithm.

1

Linear Regression: Straight & Narrow

Linear regression is a key statistical method in machine learning, but it is not won without a fight. It was proposed by two brilliant mathematicians, but 200 years later, the problem remains unsolved. The long-running controversy not only demonstrates the algorithm's excellent utility, but also its fundamental simplicity.

So whose algorithm is linear regression?

In 1805, French mathematician Adrien-Marie Legendre published the method of fitting a line to a set of points while trying to predict the position of a comet (celestial navigation was the most valuable scientific direction in global commerce at the time, just like artificial intelligence today as intelligent).

4637e15168737a70e4dcb171bb802a98.png

Caption: A sketch portrait of Adrien-Marie Legendre

Four years later, 24-year-old German prodigy Carl Friedrich Gauss insisted he had been using it since 1795, but thought it was too trivial to write about. Gauss' claim prompted Legendre to publish an anonymous article stating that "a very eminent geometer did not hesitate to adopt this method."

6f03a5998d39c3a2ebc5d15c07bd9fdc.png

Legend: Carl Friedrich Gauss

Slope and Bias : Linear regression is useful when the relationship between the outcome and the variables that affect it follows a straight line. For example, a car's fuel consumption is linearly related to its weight.

  • The relationship between a car's fuel consumption y and its weight x depends on the slope w of the line (how much fuel consumption rises with weight) and the bias term b (fuel consumption at zero weight): y=w*x+b.

  • During training, given the weight of the car, the algorithm predicts the expected fuel consumption. It compares expected and actual fuel consumption. It then minimizes the squared difference, usually by ordinary least squares techniques, to hone the values ​​of w and b.

  • Considering the car's drag can generate more accurate predictions. Additional variants extend the line to the plane. In this way, linear regression can accommodate any number of variables/dimensions.

Two steps to popularization : the algorithm immediately helped navigators track the stars, and later biologists (notably Charles Darwin's cousin Francis Galton) identified heritable traits in plants and animals. These two deep developments unlock the broad potential of linear regression. In 1922, British statisticians Ronald Fisher and Karl Pearson showed how linear regression fits into the general statistical framework of correlation and distribution, making it useful in all sciences. And, nearly a century later, the advent of computers provided the data and processing power to exploit it to an even greater extent.

Dealing with Ambiguity : Of course, data will never be measured perfectly, and some variables are more important than others. These facts of life inspired more complex variants. For example, linear regression with regularization (also known as "ridge regression") encourages the linear regression model not to depend too much on any one variable, or rather, to rely evenly on the most important variable. If for simplicity, another form of regularization (L1 instead of L2) produces a lasso (compressed estimate), encouraging as many coefficients as possible to be zero. In other words, it learns to select variables with high predictive power and ignore the rest. Elastic nets combine both types of regularization. It is useful when data is sparse or features appear to be correlated.

In every neuron : Now, the simple version is still very useful. The most common type of neuron in a neural network is a linear regression model, followed by a nonlinear activation function, making linear regression a fundamental part of deep learning.

2

Logistic Regression: Following a Curve

There was a time when logistic regression was used to classify just one thing: if you drank a bottle of poison, were you likely to be labeled "alive" or "dead"? Times have changed, and today, not only is calling emergency services a better answer to this question, but logistic regression is at the heart of deep learning.

Toxic Control :

The logistic function dates back to the 1830s, when Belgian statistician PF Verhulst invented it to describe population dynamics: Over time, the initial explosion of exponential growth flattens out as it consumes available resources, resulting in the characteristic logistic curve. More than a century later, American statistician EB Wilson and his student Jane Worcester devised logistic regression to calculate how much of a given hazardous substance is deadly.

81f567d81ff8c597d2244f0be450bf88.png

Caption: PF Verhulst

Fit function : Logistic regression fits a logistic function to a data set in order to predict the probability of a given event (eg, strychnine ingestion) for a particular outcome (eg, premature death).

  • Training adjusts the center of the curve horizontally and the middle of the curve vertically to minimize the error between the function output and the data.

  • Adjusting the center to the right or left means that it takes more or less poison to kill the average person. The steepness of the slope implies certainty: before the half-way point, most people survive; more than half, "just say goodbye" (meaning death). Gentle slopes are more forgiving: below the middle of the curve, more than half survive; further up, less than half survive.

  • Put a threshold between one result and another, say 0.5, and the curve becomes a classifier. Just enter a dose into the model and you'll know whether you should be planning a party or a funeral.

More outcomes : Verhulst's work found probabilities of binary outcomes, ignoring further possibilities, such as which side of the afterlife a poisoning victim might end up on. His successors extended the algorithm:

  • In the late 1960s, British statistician David Cox and Dutch statistician Henri Theil worked independently to perform logistic regression for situations with more than two possible outcomes.

  • Further work produced ordinal logistic regression, where the outcome is an ordinal value.

  • To deal with sparse or high-dimensional data, logistic regression can utilize the same regularization techniques as linear regression.

3e62505d92a383e071910f59ef9052f7.png

Legend: David Cox

Multifunctional Curves : Logistic functions describe a wide range of phenomena in a fairly accurate manner, so logistic regression provides useful baseline predictions in many situations. In medicine, it estimates mortality and disease risk. In political science, it predicts election winners and losers. In economics, it predicts business prospects. More importantly, it drives a subset of neurons in a wide variety of neural networks (where the nonlinearity is the sigmoid function).

3

Gradient Descent: Everything Is Downhill

Imagine hiking in the mountains after dusk and finding out you can't see anything below your feet. And your phone battery is dead, so you can't use the GPS app to find your way home. You might find the fastest path with gradient descent. Be careful not to walk off the cliff.

Sun and carpet: Gradient descent is more beneficial than descending through steep terrain. In 1847, the French mathematician Augustin-Louis Cauchy invented an algorithm to approximate the orbits of stars. Sixty years later, his compatriot Jacques Hadamard independently developed it to describe the deformation of thin, flexible objects such as carpet, which might make knee-down hiking easier. In machine learning, however, its most common use is to find the lowest point of a learning algorithm's loss function.

b29a6e7b3bf8939a6a8de4168c70beb6.png

Caption: Augustin-Louis Cauchy

Climbing Down : A trained neural network provides a function that computes a desired output given an input. One way to train a network is to minimize the loss, or error, in the output by iteratively computing the difference between the actual output and the desired output, and then changing the network's parameter values ​​to reduce the difference. Gradient descent narrows the difference, minimizing the function that computes the loss. The parameter value of the network is equivalent to a position on the terrain, and the loss is the current height. As you go down, you can increase the ability of the network to compute closer to the desired output. Visibility is limited because in a typical supervised learning situation, the algorithm only relies on the parameter values ​​of the network and the gradient, or slope, of the loss function—that is, where you are on the hill and the slope at your feet.

  • The basic approach is to move in the direction where the terrain descends steepest. The trick is to calibrate your stride. Too small a stride, and it takes a long time to make progress; too big, and you jump into uncharted territory, possibly uphill rather than downhill.

  • Given the current position, the algorithm estimates the direction of fastest descent by computing the gradient of the loss function. If the gradient points uphill, then the algorithm is going in the opposite direction by subtracting a fraction of the gradient. A fraction α called the learning rate determines the step size before the gradient is measured again.

  • Repeat these few steps and hopefully you will reach a valley. Congratulations!

Stuck in a valley : Too bad your phone died because the algorithm probably didn't push you to the bottom of the convex mountain. You may be stuck in a non-convex landscape consisting of multiple valleys (local minima), mountain peaks (local maxima), saddle points (saddle points), and plateaus. In fact, tasks such as image recognition, text generation, and speech recognition are non-convex, and many variants of gradient descent have emerged to handle this situation. For example, the algorithm may have momentum that helps it amplify small ups and downs, making it more likely to reach a bottom. The researchers engineered so many variants that it appears that there are as many optimizers as there are local minima. Fortunately, local minima and global minima tend to be roughly equal.

Optimal optimizers : Gradient descent is the clear choice for finding the minimum of any function. In cases where the exact solution can be computed directly—for example, in a linear regression task with a large number of variables—it can approximate a value, often more quickly and at a lower cost. But it does play a role in complex non-linear tasks. With gradient descent and a sense of adventure, you might be able to get out of the mountains in time for dinner.

4

Neural Networks: Finding Functions

Let's get this out of the way first: the brain isn't a set of graphics processing units, and if it is, it runs far more complex software than a typical artificial neural network. Neural networks, on the other hand, were inspired by the structure of the brain: layers of interconnected neurons, each of which computes its own output based on the state of its neighbors, and the resulting cascade of activity forms a thought—or recognizes a A picture of a cat.

From Biological to Artificial : The idea that the brain learns through interactions between neurons dates back to 1873, but it wasn't until 1943 that American neuroscientists Warren McCulloch and Walter Pitts modeled biological neural networks using simple mathematical rules. In 1958, American psychologist Frank Rosenblatt developed the sensor — a single-layer visual network implemented on a punch card — with the aim of building a hardware version for the U.S. Navy.

c036f49a75c74d98d2a61d8562175af0.png

Legend: Frank Rosenblatt

Bigger is better : Rosenblatt's invention can only recognize single-line classifications. Later, Ukrainian mathematicians Alexey Ivakhnenko and Valentin Lapa overcame this limitation by stacking networks of neurons in an arbitrary number of layers. In 1985, working independently, the French computer scientists Yann LeCun, David Parker, and the American psychologist David Rumelhart and colleagues described the use of backpropagation to efficiently train such networks. In the first decade of the new millennium, researchers including Kumar Chellapilla, Dave Steinkraus, and Rajat Raina (in collaboration with Andrew Ng) pushed the development of neural networks further by using graphics processing units, which allowed ever larger Neural networks can learn from the vast amounts of data generated by the Internet.

Fit for every task : The principle behind neural networks is simple: for any task, there is a function that performs it. A neural network forms trainable functions by combining multiple simple functions, each executed by a single neuron. The function of a neuron is determined by adjustable parameters called "weights". Given these weights and input examples and random values ​​for their desired outputs, the weights can be changed iteratively until the trainable function does the task at hand.

  • A neuron takes various inputs (for example, numbers representing pixels or words, or the output of a previous layer), multiplies them with weights, adds the products, and derives a non-linear function or activation chosen by the developer sum of functions. During the period, it should be considered that it is a linear regression, plus an activation function.

  • Training modifies the weights. For each example input, the network computes an output and compares it to the expected output. Backpropagation can change the weights through gradient descent to reduce the difference between the actual output and the expected output. When this process is repeated enough times with enough (good) examples, the network can learn to perform the task.

The black box : While with luck a well-trained network can do its job, you end up reading a function, often so complex—with thousands of variables and nested activation functions—that interpreting the network How to successfully complete its task is also very difficult. Furthermore, a well-trained network is only as good as the data it's learned from. For example, if the dataset is biased, then the output of the network will also be biased. If it contained only high-resolution images of cats, it would be unknown how it would respond to lower-resolution images.

Common sense: The New York Times pioneered the AI ​​hype when it reported on Rosenblatt's 1958 sensor invention, noting that "the U.S. Navy wants to have a machine that walks, talks, sees, writes, and replicates itself." and rudimentary electronic computers that are aware of their own existence.” While the sensor of the time fell short, it produced a number of impressive models: convolutional neural networks for images; recurrent neural networks for text; and Transformers for images, text, speech, video, protein structures, etc. They've done amazing things, surpassing human-level performance at Go, for example, and approaching human-level performance at practical tasks like diagnosing X-ray images. However, they still struggle with common sense and logical reasoning.

5

Decision Trees: From Root to Leaf

What kind of "beast" was Aristotle? A follower of the philosopher, Porphyry, who lived in Syria during the third century, came up with a logical way to answer this question. He grouped the "categories of being" proposed by Aristotle from the general to the specific, placing Aristotle in each category in turn: Aristotle's being is material rather than conceptual or spiritual ; his body is animate rather than inanimate; his mind rational rather than irrational. Therefore, his classification is human. Medieval logic teachers drew this sequence as a vertical flowchart: an early decision tree.

The difference in numbers : Fast-forward to 1963, when University of Michigan sociologist John Sonquist and economist James Morgan implemented decision trees for the first time in computers when grouping survey respondents. This work has become common with the advent of software that automatically trains algorithms, and decision trees are now used by various machine learning libraries including scikit-learn and others. The code was developed over 10 years by four statisticians at Stanford University and the University of California, Berkeley. By now, writing a decision tree from scratch has become a homework assignment in Machine Learning 101.

Roots in the air : Decision trees can perform classification or regression. It grows downward, from the root to the canopy, to classify one decision hierarchy of input examples into two (or more). Consider the subject of German medical scientist and anthropologist Johann Blumenbach: monkeys and apes were lumped together before that, around 1776, when he first distinguished monkeys from apes (except humans). This classification is based on various criteria, such as whether they have tails, narrow or broad chests, standing upright or crouching, and their level of intelligence. Using a trained decision tree to label such animals, each criterion is considered one by one, finally separating the two groups of animals.

  • The tree starts from a root node that can be considered a biological database that contains all cases—chimpanzees, gorillas, and orangutans, as well as capuchin monkeys, baboons, and marmosets. The root provides a choice between two child nodes whether to exhibit a particular characteristic, resulting in two child nodes containing examples with and without that characteristic. By analogy, the process ends with an arbitrary number of leaf nodes, each containing most or all of a category.

  • In order to grow, a tree must find a root decision. To make a choice, consider all the features and their values—posterior appendages, barrel chest, etc.—and choose the one that maximizes the purity of the segmentation. "Optimal purity" is defined as 100% of instances of a class going to a particular child node and not going to another node. Forks are rarely 100% pure after just one decision, and likely never will be. As this process continues, one level after another of child nodes is generated, until the purity does not increase much by considering more features. At this point, the tree is fully trained.

  • At inference time, a new example goes through the decision tree from top to bottom, completing the evaluation of different decisions at each level. It will get the data labels contained in its leaf node.

Getting into the Top 10: Given Blumenbach's conclusion (later overturned by Charles Darwin) that humans are distinguished from apes by broad pelvises, hands, and clenched teeth, if we want to extend the decision tree to classify not just apes and monkeys, but What about classifying humans? Australian computer scientist John Ross Quinlan made this possible in 1986 with ID3, which extends decision trees to support non-binary outcomes. In 2008, in the list of top ten data mining algorithms planned by the IEEE International Data Mining Conference, an extended refinement algorithm named C4.5 was among the best. In a world where innovation is rampant, that's staying power.

Cutting back the leaves: Decision trees do have some drawbacks. They can easily overfit the data by adding multiple levels of hierarchy such that a leaf node only includes a single example. Worse, they are prone to the butterfly effect: replace one instance and the resulting tree will be very different.

Into the forest: American statistician Leo Breiman and New Zealand statistician Adele Cutler turned this feature to their advantage, developing a random forest in 2001—a collection of decision trees, each of which Different, overlapping example selections are processed and voted on the final result. Random Forest and its cousin XGBoost are less prone to overfitting, which helps make them one of the most popular machine learning algorithms. It's like having Aristotle, Porphyry, Blumenbach, Darwin, Jane Goodall, Dian Fossey, and 1,000 other zoologists in the room together to make sure your taxonomy is the best it can be.

6

K-Means Clustering: Groupthink

If you stand close to other people at a party, chances are you have something in common. This is the idea of ​​using k-means clustering to group data points. Whether groups formed through human agency or other forces, the algorithm will find them.

From explosions to dial tone : American physicist Stuart Lloyd, an alumnus of Bell Labs' iconic innovation factory and the Manhattan Project that invented the atomic bomb, first proposed k-means clustering in 1957 to distribute information in digital signals, but This work was not published until 1982:

f9fecf174cac942495bd4095e950ca79.png

Paper address: https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf

Meanwhile, American statistician Edward Forgy described a similar method in 1965, leading to its alternative name "Lloyd-Forgy algorithm".

Finding hubs : Consider dividing clusters into like-minded work groups. Given the location of the participants in the room and the number of groups to form, k-means clustering can divide the participants into roughly equal-sized groups, each clustered around a central point or centroid.

  • During training, the algorithm initially assigns k centroids by randomly selecting k people. (K has to be chosen manually, finding an optimal value is sometimes very important.) Then it grows k clusters by associating each person with the nearest centroid.

  • For each cluster, it computes the average position of all people assigned to that group and assigns this average position as the new centroid. Each new centroid may not be occupied by a single person, but so what? People tend to gather around chocolate and fondue.

  • After calculating new centroids, the algorithm reassigns individuals to the centroids closest to them. It then computes new centroids, adjusts the clusters, etc. until the centroids (and the groups around them) don't move anymore. After that, it's easy to assign new members to the correct cluster. Get them in place in the room and look for the nearest centroid.

  • Be forewarned: given the initial random centroid assignment, you may not end up in the same group as the lovable data-centric AI experts you hope to hang out with. The algorithm does a good job, but it's not guaranteed to find an optimal solution.

Different distances : Of course, the distance between clustered objects does not need to be very large. Any measure between the two vectors will do. For example, rather than grouping partygoers based on physical distance, k-means clustering can segment them based on their clothing, occupation, or other attributes. Online stores use it to segment customers based on their preferences or behavior, and astronomers can group stars of the same type.

The Power of Data Points : This idea yielded some notable changes:

  • K-medoids  use actual data points as centroids, rather than the average position within a given cluster. The center point is the point where the distance to all points in the cluster can be minimized. This variation is easier to interpret because the centroid is always the data point.

  • Fuzzy C-Means Clustering  enables data points to participate in multiple clusters to varying degrees. It uses the degree of the cluster instead of hard cluster assignment based on the distance from the centroid.

N-Dimensional Carnival : Nevertheless, the algorithm in its original form is still widely useful - especially since, as an unsupervised algorithm, it does not require the collection of expensive labeled data. It's also being used faster and faster. For example, machine learning libraries including scikit-learn benefit from the addition of kd-trees in 2002, which can partition high-dimensional data very quickly.

Original link:

https://read.deeplearning.ai/the-batch/issue-146/

b0ec855a77385755727ce021e195661f.pngDry goods learning, like three times

Guess you like

Origin blog.csdn.net/Datawhale/article/details/132267650
Recommended