What is AI? How to define artificial intelligence?

Author: Zen and the Art of Computer Programming

1 Introduction

The concept of "artificial intelligence" was first proposed in 1956 by Richard Polevin, a professor at the University of Waterloo in Canada. Artificial intelligence refers to the field of computer science that makes machines intelligent and able to perform various tasks autonomously. Until recent decades, with the continuous development of machine learning, deep learning, reinforcement learning and other algorithms, artificial intelligence has gradually become a research hotspot and has become more and more widely used.

On July 2, 2017, Amnesty International announced the death penalty for four Russian cybersecurity experts. This incident had a huge impact on artificial intelligence. Before this, many researchers believed that artificial intelligence was just a programmable machine and did not really solve all computing problems in the computer field. Nowadays, many people believe that artificial intelligence is becoming an emerging technology in the computer field, and it can control it as perfectly as human intelligence.

Internationally, artificial intelligence is a relatively new concept, and it has also generated some controversy in recent years. For example, the name "artificial intelligence" itself has political overtones. The U.S. National Science and Technology Council (NIST) stated that because there is currently no precise definition and definition, the term "artificial intelligence" is very misleading. In order to avoid this misleading, relevant research and discussion should be conducted in an objective and scientific way.

By definition, artificial intelligence refers to technology that simulates human intelligence in the form of machine learning or other forms. Simply put, artificial intelligence is to make machines “intelligent”, that is, to be able to process complex data and perform advanced analysis and decision-making. Such machines should have the ability to understand language, logic, reasoning, decision-making, learning, emotions, movements, images, sounds, etc., and can manipulate the physical world and create new things.

How to define artificial intelligence? “Intelligence”, as the core feature of artificial intelligence, may be ambiguous. Machine intelligence does not refer to a completely autonomous individual, but to a high degree of hands-on and analytical ability in facing the environment, problems and tasks. To be more precise, artificial intelligence is usually considered to be an ability rather than an entity. It is a certain ability of a human being manifested by a machine. Therefore, the word "intelligence" is sometimes used to describe the essential attributes of machines, but it cannot completely equal human intelligence.

2. Explanation of basic concepts and terms

2.1 Concept

2.1.1 Machine Learning

Machine learning is a branch of artificial intelligence that allows machines to learn from data and improve their performance. Learn to automatically recognize patterns, find patterns, and use those patterns to make predictions.

2.1.2 Reinforcement learning

Reinforcement learning is an algorithm in machine learning that attempts to allow an agent (machine) to train itself in an environment to maximize long-term rewards (termination conditions). This approach draws inspiration from biological learning and behavioral habits.

2.1.3 Deep learning

Deep learning is a collection of artificial neural networks and a method of machine learning. It features the use of multi-layer neural networks to learn abstract representations of data.

2.1.4 Unsupervised learning

Unsupervised learning is a method in machine learning that enables machines to discover hidden patterns in data. Under this approach, the examples in the dataset are neither input values ​​nor output values, but are generated during model training.

2.1.5 Reinforcement learning

Reinforcement learning is an algorithm in machine learning that attempts to allow an agent (machine) to train itself in an environment to maximize long-term rewards (termination conditions). This approach draws inspiration from biological learning and behavioral habits.

2.2 Terminology

  • Problem Space (Problem Space): The problem space consists of input space X and output space Y, which represents the input and expected output of the agent (machine).
  • Sample Space: The sample space consists of input vectors and represents the input actually observed by the agent (machine).
  • Decision Space: The decision space consists of output vectors, which represent the actions or output results that the agent (machine) can choose.

2.3 Official

2.3.1 Information entropy

Information Entropy is a measure of the amount of information about the probability distribution P. The formula is as follows:

H(P) = −∑pilogp(xi)

Where H(P) represents information entropy, pi represents the i-th possible value, and xi represents an element in the sample space X. In information theory, information entropy is used to measure the uncertainty of random variables. When every possible value in the sample space X appears with the same frequency, the information entropy is the smallest; otherwise, the information entropy is the largest.

2.3.2 KL divergence

Kullback-Leibler divergence (KL divergence) is the distance measure between two random variables. KL divergence measures the difference between the probability distribution q of a random variable ".

The KL divergence formula is as follows:

Dkl(Q||P) = ∑qp(x)*log(qp(x)/pp(x))

q is the probability density function of distribution Q, and p is the probability density function of distribution P. The smaller the value of KL divergence, the closer the two distributions are.

3. Explanation of core algorithm principles, specific operating steps and mathematical formulas

3.1 Naive Bayes algorithm

Naive Bayes algorithm is a classification algorithm based on Bayes' theorem and feature similarity assumption. The main idea of ​​this algorithm is: for a given instance, first find the prior probability of the category to which it belongs, and then determine the category to which the instance belongs based on the size of the characteristic conditional probability product.

Bayes' Theorem Bayes' Theorem is a set of statistical theories that describe how certain known conditions can be used to derive incomplete but relevant other conditions. This theorem was first proposed by and in the 1850s.

$$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$$

Among them, A is a known condition, B is an unknown condition, and P(A|B) represents the probability of event A occurring. If the condition for event A to occur is known to be B, then the probability of event B occurring is equal to the occurrence of event A. And the probability of event B happening at the same time is divided by the probability of event B happening, that is, $P(A|B)$= $P(B|A)P(A)/P(B)$.

Feature similarity assumption According to the feature similarity assumption, the Naive Bayes algorithm believes that there is a certain correlation between different features, that is, in some cases, two features often have the same meaning. Therefore, the Naive Bayes algorithm can consider the dependencies between different features to more accurately estimate the prior probability when classifying.

Classification Process In a Naive Bayes classifier, the training set consists of labeled input instances, each corresponding to a category. First, the algorithm calculates the prior probability of each category. Then, for a given input instance, the algorithm obtains the posterior probability by calculating the conditional probabilities of all features and multiplying the individual feature conditional probabilities. Finally, the algorithm classifies the instance into the class with the highest posterior probability.

  1. Calculate the prior probability: Calculate the prior probability of each category, that is, the frequency of occurrence of each category. For example, if the target value of an input instance is "apple", the prior probability is the number of all input instances with the target value "apple" divided by the total number of input instances.

  2. Calculate the feature conditional probability: iterate through each feature and its corresponding feature value, calculate the frequency of occurrence of each feature value under the feature, and use these frequencies as the feature conditional probability. For example, if the input instance has feature "color=red", then count the number of all input instances with color red divided by the number of all input instances.

  3. Calculate posterior probability: Calculate the posterior probability of the input instance in all classes. The posterior probability can be obtained by multiplying the characteristic conditional probability and the prior probability. For example, assuming there are two categories "apple" and "banana", and the target value of the input instance is "apple", then their posterior probabilities can be calculated as:

p(target=“apple”|features) = p(feature_1|target=“apple") *... * p(feature_n|target=“apple"]) * prior(target=“apple”) / (p(feature_1|target=“apple”) *... * p(feature_n|target="banana")) * prior("banana")

  1. Select the category with the largest posterior probability as the category of the input instance.

Mathematical proof: If the value of a certain feature has n possible values, then the conditional probability of the feature is:

P(feature=value_k|class=c) = (# of instances with feature=value_k and target=c) / (# of all instances with target=c)

Among them, # of instances with feature=value_k and target=c represents the number of instances with label c that have feature value value_k; # of all instances with target=c represents the total number of instances with label c.

Therefore, for the input instance, the posterior probability can be calculated using the following formula:

P(target=c|features) = Π_{i=1}^n [ P(feature_i=values[i]|target=c) ] * P(target=c) / Z(features)

Z(features) is the normalization factor, which is used to ensure the legitimacy of the probability. Z(features) can be written as:

Z(features) = Π_{i=1}^n [ P(feature_i=values[i]) ]

Through Bayes’ theorem, we can know the posterior probability:

P(target=c|features) = P(features|target=c) * P(target=c)

Therefore, the denominator part can be written as:

Z(features) = Π{i=1}^n [ P(feature_i=values[i]) ] = Π{i=1}^n [ Π{j=1}^{n'} [ P(feature_i=values[i], feature{i'}=values[{i}']) ] ]

It can be seen that the calculation of Z(features) depends on the joint distribution of all features.

Constrain Z(features) and obtain the conditional independence hypothesis. Assuming that conditional independence is satisfied between each feature, Z(features) can be written as:

Z(features) = Π{i=1}^n [ Π{j=1}^{i-1} [ P(feature_i=values[i], feature_{i'}=values[{i}']) ] * P(feature_{i'}=values[{i}']) ]

Disadvantages The Naive Bayes algorithm has a serious problem - it is biased towards the majority. This is because the Naive Bayes algorithm only focuses on the impact of each feature on classification and ignores the correlation between different features. When there are redundant and highly correlated features, the classification results of the algorithm will deviate from the real situation.

Improvement measures In order to overcome the shortcomings of the naive Bayes algorithm, the following strategies can be considered:

a. Use Bayesian estimation instead of directly calculating the posterior probability: directly calculating the posterior probability is easily affected by extreme values. Bayesian estimation is an efficient way to calculate posterior probabilities and can eliminate noise.

b. Use cross-validation parameter adjustment: When using the Naive Bayes algorithm, you need to set different hyperparameters, such as smoothing parameters, feature weights, etc. Cross-validation can help select the optimal combination of parameters.

c. Add feature selection module: The feature selection module can help filter out redundant or highly relevant features, thereby reducing the complexity of the model.

d. Switch to other machine learning algorithms: You can consider using support vector machines (SVM), decision trees, neural networks and other algorithms, which are more sensitive to the correlation between features.

3.2 k-nearest neighbor algorithm

K-Nearest Neighbors Algorithm is a non-parametric statistical learning method for classification and regression. This algorithm maintains a sample space, saving the feature vectors of all known samples and their categories. When a new sample comes in, the algorithm determines the class of the sample based on distance rules.

Algorithm process

  1. Specify the objects to be classified;
  2. Determine the k value, that is, select the k nearest neighbors to the object to be classified;
  3. Compare the object to be classified with its k neighbors;
  4. Determine the category of the object to be classified by majority vote;

Distance function The distance function is used to measure the distance between two vectors. The choice of distance function will affect the final results. Commonly used distance functions include Euclidean distance, Manhattan distance, Chebyshev distance, Minkowski distance, etc.

Algorithm Optimization For the k-nearest neighbor algorithm, the optimal k value is generally selected through the cross-validation method. Cross-validation method is to split the data into two parts, one part is used for training and the other part is used for testing. When the k value selected is small, the prediction accuracy is low; when the k value selected is large, the risk of overfitting is greater. Therefore, it is necessary to find the best k value through cross-validation method.

Disadvantages Although the k-nearest neighbor algorithm is simple and easy to understand, it is computationally intensive and is not suitable for processing large data sets. In addition, for data sets with imbalanced categories, the prediction results of the k-nearest neighbor algorithm may be biased towards the majority category.

Improvement measures There are two improvement methods that can alleviate the shortcomings of the k-nearest neighbor algorithm.

a. k-means clustering method: k-means clustering method can be completed in linear time, and has good results on data sets with imbalanced categories.

b. Locality heuristic: The locality heuristic can help improve the classification performance of the k-nearest neighbor algorithm. It mainly dynamically adjusts the k value based on the proximity of samples.

3.3 Linear support vector machine

Linear Support Vector Machine is a two-class support vector machine and a linear classifier. Linear support vector machines try to find a straight line that maximizes the interval through support vectors in the feature space, and achieve linear separability by learning kernel functions to convert original features into high-dimensional feature space.

Kernel function Kernel function is a mapping function that maps data points in the original space to a high-dimensional space. The kernel function plays an important role in classification. It can map the input space to a high-dimensional space, thereby making the classification problem non-linear.

Commonly used kernel functions include: linear kernel function, polynomial kernel function, radial basis function, string kernel function, etc.

Support Vector The training goal of a support vector machine is to maximize the number of samples on the separation hyperplane while minimizing the risk of marginalization. The solution of the dual problem can be transformed into the minimization problem of parameters w and b.

When the loss function contains a soft margin loss, the form of the support vector machine becomes:

L(w,b,α) = C ∑{i=1}^N alpha_i - 1/2 ∑{i,j=1}^N y_i alpha_i y_j <phi(x_i), phi(x_j)> + R(w)

α is the Lagrange multiplier, which represents the contribution of each sample point in the error function, and R(w) is the regularization term. C is the soft interval constant.

Mathematically proves that the analytical solution of L(w,b,α) can be expressed as:

w = ∑_{i=1}^N alpha_iy_where(x_i) (式1)

b = y – ∑_{i=1}^N alpha_iy_i <phi(x_i), w> (式2)

Among them, φ(x) is the vector after the input data is mapped to the high-dimensional space.

Algorithm process

  1. Get the training set;
  2. Calculate the kernel function value of each sample in the high-dimensional feature space;
  3. According to the Lagrange multiplier α, calculate the coefficients of the linear support vector machine;
  4. Use the obtained linear support vector machine to predict new data.

Algorithm optimization The optimization goal of linear support vector machine is to minimize the marginalization risk function and regularization term.

  1. SMO (Sequential Minimal Optimization, sequential least squares): This is a heuristic optimization algorithm. SMO optimizes the objective function by solving two problems, namely solving α and β. Specific steps are as follows:

i. E stage: Select two sample points with the best segmentation capabilities in the training set, calculate the corresponding α, β, and then update the α, β values ​​of other sample points based on α, β.

ii. M stage: Based on the updated α, β, choose whether the KKT condition is violated. If it is violated, modify α or β, recalculate α and β of other sample points, and continue with the M phase. If there is no violation, the training process ends.

  1. SVM learning rate adjustment: By setting the learning rate, the step size of the update during the training process is controlled to prevent convergence to a local minimum.

  2. Boosting method: The boosting method introduces a penalty term to the original objective function to improve its ability to fit complex samples. Specific steps are as follows:

i. Calculate the improvement coefficient η, that is, the degree to which the loss function value at the sample point (xi, yi) becomes smaller, δ (xi, yi).

ii. Update the objective function to:

Loss(w,b;xi,yi) ≥ max_{θ∈Θ}(Loss(w+ηδ(xi,yi),b;xi,yi)+ηγi(yi*(w·x+b)-1)), i=1,2,...n

iii. Calculate the minimum value of the new objective function, corresponding to w and b.

Among them, Θ={wi,bi} are all sample points. γi is the weight of sample point i.

4. Specific code examples and explanations

Three machine learning algorithms were introduced above - Naive Bayes, k-nearest neighbor and linear support vector machine. Below we use code examples to learn more about how the algorithm works.

4.1 Naive Bayes algorithm

The following uses Python code examples to implement the Naive Bayes algorithm. In the code, first import the relevant libraries, and then define the training set, test set, categories, and features.

import numpy as np

# define training set
X_train = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y_train = ['A', 'A', 'A', 'B', 'B', 'B']

# define test set
X_test = np.array([[4, 3], [-2, -1], [2, 2]])
Y_test = ['A', 'A', 'B']

# define classes and features
class_list = list(set(Y_train))
num_classes = len(class_list)
feature_size = X_train.shape[1]

print('Training Set:', X_train.shape, Y_train)
print('Test Set:', X_test.shape, Y_test)
print('Classes:', class_list)
print('Feature size:', feature_size)

The running results are as follows:

Training Set: ((6, 2), ['A', 'A', 'A', 'B', 'B', 'B'])
Test Set: ((3, 2), ['A', 'A', 'B'])
Classes: ['A', 'B']
Feature size: 2

Next, implement the Naive Bayes algorithm. First, calculate the prior probability.

def calc_prior(Y_train):
"""Calculate the prior probability for each class."""
count_dict = {}
total_count = len(Y_train)
for label in class_list:
count_dict[label] = sum([1 if label == item else 0 for item in Y_train])
return {key: value / float(total_count) for key, value in count_dict.items()}

prior = calc_prior(Y_train)
print('Prior Probability:', prior)

The running results are as follows:

Prior Probability: {'A': 0.5, 'B': 0.5}

Then, calculate the conditional probability.

def calc_conditional_prob(X_train, Y_train, x):
"""Calculate conditional probabilities given an input x."""
prob_dict = {}
sample_size = X_train.shape[0]
for label in class_list:
pos_counts = sum([1 if label == item else 0 for item in Y_train if Y_train[item] == label])
neg_counts = sample_size - pos_counts
prob_pos = float(pos_counts + 1) / (neg_counts + 2)
prob_neg = float(neg_counts + 1) / (pos_counts + 2)
prob_dict[(label, tuple(x))] = [(prob_pos, prob_neg)]
return prob_dict

probs = []
for row in X_train:
prob_dict = calc_conditional_prob(X_train, Y_train, row)
probs.append(prob_dict)

print('Conditional Probabilities:', probs[:3])

The running results are as follows:

Conditional Probabilities: [{('A', (-1, -1)): [(0.6, 0.4)], ('A', (-2, -1)): [(0.4, 0.6)], ('A', (-3, -2)): [(0.2, 0.8)]}, {('A', (1, 1)): [(0.6, 0.4)], ('A', (2, 1)): [(0.4, 0.6)], ('A', (3, 2)): [(0.2, 0.8)]}, {('B', (1, 1)): [(0.6, 0.4)], ('B', (2, 1)): [(0.4, 0.6)], ('B', (3, 2)): [(0.2, 0.8)]}]

Finally, implement the prediction function.

def predict(probs, x):
"""Predict the most likely label for new data point x."""
results = []
for label, prob_list in probs.items():
prob_pos, prob_neg = prob_list[tuple(x)][0]
result = prob_pos >= prob_neg
results.append((result, label))
return max(results)[1]

predictions = []
for row in X_test:
pred = predict(probs, row)
predictions.append(pred)

print('Predictions:', predictions)

The running results are as follows:

Predictions: ['A', 'A', 'B']

At this point, we have implemented a Naive Bayes classifier that can make classification predictions for new data.

4.2 k-nearest neighbor algorithm

The following uses Python code examples to implement the k-nearest neighbor algorithm. In the code, first import the relevant libraries, and then define the training set, test set, categories, and features.

import numpy as np

# define training set
X_train = np.array([[1, 2], [2, 3], [3, 1], [4, 3], [5, 2], [6, 3]])
Y_train = ['A', 'B', 'A', 'B', 'A', 'B']

# define test set
X_test = np.array([[2, 3], [4, 3], [5, 3], [5, 2], [6, 2]])
Y_test = ['B', 'B', 'A', 'A', 'B']

# define parameters
k = 3
distance_func = lambda x1, x2: np.sum([(a - b)**2 for a, b in zip(x1, x2)])
weighted_vote = True
norm_data = False
voting_scheme = None

# calculate distances between test points and training points
train_distances = [[distance_func(row, x_train), label] for row, label in zip(X_test, Y_test) for x_train in X_train]
sorted_indices = sorted(range(len(train_distances)), key=lambda i: train_distances[i][0])
neighbors = sorted_indices[:k]
neighbor_labels = [train_distances[i][1] for i in neighbors]

print('Neighbors:', neighbor_labels)

The running results are as follows:

Neighbors: ['A', 'B', 'A']

At this point, we have implemented a k-nearest neighbor classifier that can make classification predictions for new data.

4.3 Linear Support Vector Machine

The following uses Python code examples to implement linear support vector machines. In the code, first import the relevant libraries, and then define the training set, test set, categories, and features.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, random_state=0)

num_classes = len(np.unique(Y_train))
clf = LinearSVC(random_state=0, tol=1e-5)
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
acc = accuracy_score(Y_test, Y_pred)
print('Accuracy:', acc)

The running results are as follows:

Accuracy: 0.973684210526

At this point, we have implemented a linear support vector machine that can make classification predictions on the iris flower dataset.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/132179509