Machine Learning: Neural Network Basics

Introduction

Neural networks try to replicate the work of the human brain to make things smarter.

Neural networks are generally a supervised learning method. This means that there needs to be a training set. Ideally, the training set contains absolute ground truth (tags|tags, classes|classes) examples. For example in the case of text sentiment analysis, the training set is a list of sentences and their respective corresponding sentiments. (Note: Unlabeled datasets can also be used to train neural networks, but only the most basic cases are considered here.)

For example: call the texts X and their labels Y. There are functions that define the relationship between X and Y, such as what features (word/phrase/sentence structure, etc.) lead to the negative or positive meaning of a sentence. Early people were used to finding these features manually, which was called feature engineering. Neural networks automate this process.

So there are many ways you can understand a concept, choose whichever suits you, being persistent about the learning part. At the end knowing maths is a useful tool when it comes to optimisations or experimentations.

work process

An artificial neural network consists of 3 components:

  • Input Layer Input Layer
  • Hidden (computation) Layers
  • Output Layer

The learning process takes place in two steps:

  • Forward-Propagation: Guessing the answer
  • Back-Propagation: Minimize the error between the actual answer and the guessed answer

Forward-Propagation

Randomly initialize weights

  • w1
  • w2
  • The data of the w3 input layer is multiplied by the weight to form the hidden layer
  • h1 = (x1 * w1) + (x2 * w1)
  • h2 = (x1 * w2) + (x2 * w2)
  • h3 = (x1 * w3) + (x2 * w3) The output of the hidden layer is passed through a nonlinear function (activation function) to form the guessed output (guessed output)
  • y_ = fn( h1 , h2, h3 )

Backward Propagation

  • The total error total_error is calculated by a cost function (cost function), the parameters are calculated expected value (expected value) y (value in the training set) and observed value (observed value) y_ (forward propagation value)
  • Calculate the partial derivatives of the error for each weight (these partial derivatives are the measure of each weight in the total error)
  • Differentiated and multiplied by a decimal ( η ), η is called the learning rate
  • Then subtract the result from the respective weights

The result of backpropagation is the following updated weights:

  • w1 = w1 - (η * ∂ (err) / ∂ (w1))
  • w2 = w2 - (η * ∂ (err) / ∂ (w2))
  • w3 = w3 - (η * ∂(err) / ∂(w3))

Basically we initialize the weights randomly and assume they will yield accurate answers.

Those familiar with Taylor Series, backpropogation shares the same end result with it. But instead of an indefinite series we try to optimise the first element only.

Bias is the weight added to the hidden layer. They are also randomly initialized and updated in a similar way to hidden layers. While the role of the hidden layer is to map the model of the underlying function in the data, the role of the bias is to move the learned function laterally so that it overlaps the original function.

Partial Derivatives

Computing partial derivatives allows us to know the contribution of each weight to the error.

The need for derivatives is obvious. Example: Consider a neural network trying to find the optimal speed of a self-driving car. Now, if the car finds it's going faster or slower than expected, the neural network changes speed by accelerating or decelerating. What is acceleration/deceleration? Derivative of velocity.

Explaining Partial Derivatives: Shooting Darts

Suppose several children are asked to throw darts at a dartboard, aiming for the center. The initial result is:

Now if we identify the total error and simply subtract it from all the weights, then we can generalize the error for each student. Suppose one child is aiming too low, but we ask all children to aim higher, the result is:

Some student errors may decrease, but overall errors will still increase. By looking for the partial derivatives, we can find the error that each weight makes individually. Correcting each weight individually yields the following results:

Hyper Parameters

Although neural networks are used to automate feature selection, there are still some parameters that we must enter manually.

Learning Rate

Learning rate is a very critical hyperparameter. If the learning rate is too small, even after training the neural network for a long time, it will still be far from the optimal result. The result looks like:

Conversely, if the learning rate is too high, the learner will draw conclusions prematurely. produces the following results:

Activation Function

In simple terms, the activation function (the activation function) is responsible for deciding which neurons will be activated, i.e. what information will be passed to other layers. Without activation functions, deep neural networks lose a lot of description learning ability.

The nonlinearity of these functions is responsible for increasing the learner's degrees of freedom, allowing them to generalize high-dimensional problems in lower dimensions. Here are some examples of popular activation functions:

Cost Function

Cost functions are at the heart of neural networks. It is used to calculate the loss of the real and observed results. Our goal is to minimize this loss. Therefore, the cost function effectively drives the neural network's learning of its objective.

The cost function is a measure of "how well" a neural network does, given a training sample and expected output. It may also depend on variables such as weights and biases.

The cost function is a single value, not a vector, because it evaluates the performance of the neural network as a whole. Some of the best known cost functions are:

  • Square Mean Quadratic Cost, referred to as Root Mean Square
  • Cross Entropy
  • Exponential (AdaBoost)
  • Relative Entropy Kullback–Leibler divergence or Information Gain

Root mean square is the simplest and most commonly used of them. It is simply defined as:

Loss = √(expected_output ** 2) - (real_output ** 2)

The cost function in a neural network should satisfy two conditions:

  • The cost function must be able to be written as an average
  • The cost function cannot depend on any activation value other than the output value in a neural network

deep network

Deep learning is a class of machine learning algorithms that can learn deeper (more abstract) insights from data.

  • Use cascaded, pipeline-like sequential pass pipelines with multiple layers of processing units (non-linear) for feature extraction and transformation.
  • Features (representing data knowledge) based on learning data in an unsupervised manner. Higher-level features (found in later processing layers) are derived from lower-level features (found in initial processing layers).
  • Multiple levels represent corresponding levels of abstraction; these levels form a hierarchy of concepts.

A single layer Neural Network

A single layer neural network, no matter how the first layer (green neurons) learns, they just pass it to the output.

Two layer Neural Network

For a two-layer neural network, whatever the green hidden layer learns is passed to the blue hidden layer for further learning (about the green layer learning). Therefore, the higher the number of hidden layers, the more learning of concepts that have been learned previously.

Wide Neural Network vs Deep Neural Network

With more neurons in a layer, it doesn't gain deeper insights. Instead, it results in learning more concepts.

Example: Learning English grammar, it requires understanding a lot of concepts. In this case, a single-layer wide neural network works much better than a deep neural network, which is much smaller in width.

But in the case of learning the Fourier Transform, the learner (neural network) needs to learn deeply because there are not many concepts to learn, but each concept is complex enough to require deep learning.

Balance is Key

It is very tempting to use deep and wide neural networks for every task. This can be a very bad idea because:

  • Both obviously require more data to achieve the lowest desirable accuracy
  • Both have exponentially increased time complexity
  • A neural network that is too deep will try to decompose a fundamental concept deeper, but at this point it will make false assumptions about the concept and try to find pseudo patterns that don't exist
  • A neural network that is too wide will try to find a larger number of features (measurable properties). So, similar to the above, it will start making wrong assumptions about the data.

Curse of Dimensionality

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings.

Like English grammar or stock prizes etc have many features that affect them. Using machine learning it is necessary to represent these features in an array/matrix of finite and relatively much smaller length (than the number of features actually present). Doing this can create two problems:

  • made by a learner: Biased due to incorrect assumptions made by the learner. High bias can cause the algorithm to miss correlations between features and target outputs. This phenomenon is called underfitting.
  • insufficient learning : Small fluctuations in the training set lead to large deviations due to incomplete knowledge of the features. High variance leads to overfitting, learning errors as relevant information.

trade off

It is typically impossible to have low bias and low variance.

In the early stage of training, the deviation is large because the network output is far from meeting the requirements. The variance is small due to the small influence of the data. The bias is small in the later stages of training because the network has already learned the underlying functions.

However, if the training is too long, the network will also learn the noise specific to that dataset. This leads to high variance in results tested on different datasets because of variations in noise across datasets. In practice, algorithms with high bias often produce simpler models that do not tend to overfit, but may weaken their training data without capturing important patterns or properties of features. Models with low bias and high variance are typically more complex in structure, allowing them to more accurately represent the training set. However, in the process, they may also represent a larger proportion of noise in the training set, making their predictions less precise despite the increased complexity.

Therefore, it is usually not possible to have low bias and low variance at the same time.

Currently, with abundant data and tools, we can easily create complex machine learning models. If the learner does not provide enough information, bias actually occurs, and dealing with overfitting becomes a central job. Providing more examples means more changes, including an increase in the number of patterns.

Further reading: "The Machine Learning Master"

For more exciting content, scan the code and follow the official account: RiboseYim's Blog WeChat public account

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325812319&siteId=291194637