Chapter 6|Spark MLlib Machine Learning (1)

MLlib is a machine learning library provided by Spark. By calling the algorithms encapsulated by MLlib, machine learning applications can be easily constructed. It provides a very rich set of machine learning algorithms, such as classification, regression, clustering and recommendation algorithms. In addition, MLlib standardizes the APIs used for machine learning algorithms, making it easier to combine multiple algorithms into a single Pipeline or workflow. Through this article, you can learn:

  • What is machine learning
  • Big data and machine learning
  • Machine learning classification
  • Introduction to Spark MLLib

Machine learning is a branch of artificial intelligence, a multi-field interdisciplinary subject, involving many subjects such as probability theory, statistics, approximation theory, convex analysis, and computational complexity theory. Machine learning theory is mainly to design and analyze some algorithms that allow computers to automatically "learn". Because a large number of statistical theories are involved in learning algorithms, machine learning is particularly closely related to inferential statistics, which is also called statistical learning theory. In terms of algorithm design, machine learning theory focuses on achievable and effective learning algorithms.

来源:Mitchell, T. (1997). Machine Learning. McGraw Hill.

What is machine learning

The application of machine learning has spread across all branches of artificial intelligence, such as expert systems, automatic reasoning, natural language understanding, pattern recognition, computer vision, intelligent robots and other fields. Machine learning is a sub-discipline of artificial intelligence. The main research is to allow machines to learn from past experiences, model the uncertainty of data, and predict the future. There are many areas of machine learning applications, such as search, recommendation systems, spam filtering, face recognition, voice recognition, and so on.

Big data and machine learning

In the era of big data, the speed of data generation is amazing. The Internet, mobile Internet, Internet of Things, GPS, etc. will generate data all the time. The storage and computing capabilities required to process these data have also grown exponentially. As a result, a series of big data technologies represented by Hadoop have been born. These big data technologies provide a reliable guarantee for processing and storing these data.

Data, information, and knowledge have three levels from large to small. Mere data is difficult to explain some problems. It is necessary to add some people’s experience to convert it into information. The so-called information is to eliminate uncertainty. We often say that information asymmetry refers to the inability to obtain enough information. It is difficult to eliminate some uncertain factors. Knowledge is the highest stage, so data mining is also called knowledge discovery.

The task of machine learning is to use some algorithms to act on big data, and then mine the underlying knowledge behind it. The more training data, the better the machine learning can show its advantages. The problems that could not be solved by machine learning in the past can now be solved by big data technology, and the performance will be greatly improved, such as speech recognition, image recognition, etc. .

Machine learning classification

Machine learning is mainly divided into the following categories:

  • Supervised learning

    Basically synonymous with classification. Learning 监督from the labeled examples in the training data set. For example, in the postal code recognition problem, a set of handwritten postal code images and their corresponding machine-readable conversion objects are used as training examples to supervise the learning of the classification model. Common supervised learning algorithms include: linear regression, logistic regression, decision trees, naive Bayes, support vector machines, and so on.

  • Unsupervised learning

    Essentially a synonym for clustering. The learning process is unsupervised because the input instances have no class labels. The task of unsupervised learning is to dig out potential structures from a given data set. For example, give photos of cats and dogs to the machine without any tags, but hope that the machine can classify these photos. In the end, the machine will classify these photos into two categories, but they don’t know which photos are cats. , Which are photos of dogs, for the machine, are equivalent to being divided into two categories: A and B. Common unsupervised learning algorithms include: K-means clustering, principal component analysis (PCA), etc.

  • Semi-supervised learning

    Semi-supervised learning is a type of machine learning technology that uses labeled and unlabeled instances when learning a model. Let the learner not rely on external interaction and automatically use unlabeled samples to improve learning performance, which is semi-supervised learning.

    The real demand for semi-supervised learning is very strong, because in real applications, a large number of unlabeled samples can often be easily collected, but the acquisition 标记requires manpower and material resources. For example, when performing computer-assisted medical image analysis, a large number of medical images can be obtained from hospitals, but it is unrealistic if medical experts are expected to identify all the lesions in the images. 有标记数据少,未标记数据多This phenomenon is more obvious in Internet applications, such as web pages. When recommending, users are required to mark webpages of interest, but few users are willing to spend a lot of time to provide marks. Therefore, there are few marked webpages, but there are countless webpages on the Internet that can be used as unmarked samples.

  • Reinforcement learning

    It is also known as enhanced learning and evaluation learning. It is an important machine learning method and has many applications in the fields of intelligent control of robots and analysis and prediction. The common model of reinforcement learning is the standard Markov Decision Process (MDP).

Introduction to Spark MLLib

MLlib is Spark's machine learning library, through which the engineering practice of machine learning can be simplified. MLlib contains a very rich set of machine learning algorithms: classification, regression, clustering, collaborative filtering, principal component analysis, etc. Currently, MLlib is divided into two code packages: spark.mllib and spark.ml .

spark.mllib

Spark MLlib is an important part of Spark, a machine learning library originally provided. This library has a shortcoming: If the data set is very complex and needs to be processed multiple times, or when new data needs to be combined with multiple already trained single models for comprehensive calculation, using Spark MLlib will make the program structure more complicated, or even Difficult to understand and implement.

spark.mllib is the original algorithm API based on RDD and is currently under maintenance. The library contains 4 types of common machine learning algorithms: classification , regression , clustering , and collaborative filtering . The point is that no new functions will be added to the RDD-based API.

spark.ml

Spark1.2 version introduced ML Pipeline. After multiple versions of development, Spark ML overcomes some of the shortcomings of MLlib in dealing with machine learning problems (complex and unclear processes), and provides users with a machine learning library based on DataFrame API, making it possible to build The entire machine learning application process becomes simple and efficient.

Spark MLNot an official name, used to refer to the MLlib library based on the DataFrame API. Compared with RDD, DataFrame provides a more friendly API. The many benefits of DataFrame include Spark data sources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and a unified API across languages.

Spark ML API provides many data feature processing functions, such as feature selection, feature conversion, category digitization, regularization, dimensionality reduction, etc. In addition, the ml library based on the DataFrame API supports the construction of a machine learning pipeline, which organizes some tasks in the machine learning process in an orderly manner for easy operation and migration. Spark officially recommends using the spark.ml library.

Data transformation

Data transformation is an important task of data preprocessing, such as normalizing, discretizing, and deriving indicators. Spark ML provides a very rich data conversion algorithm. For details, please refer to the official website. It is summarized as follows:

Among the above conversion algorithms, word frequency inverse document frequency (TF-IDF), Word2Vec, and PCA are more common. If you have done text mining processing, then this should not be unfamiliar.

Data protocol

Big data is the foundation of machine learning and provides sufficient data training sets for machine learning. When the amount of data is very large, data reduction techniques need to be used to delete or reduce redundant dimensional attributes to achieve the purpose of streamlining the data set. Similar to the idea of ​​sampling, although the data capacity is reduced, the integrity of the data is not changed. The feature selection and dimensionality reduction methods provided by Spark ML are shown in the following table:

Feature selection and dimensionality reduction are commonly used methods in machine learning. The above methods can be used to reduce the selection of features, eliminate noise while maintaining the original data structure characteristics. Especially principal component analysis (PCA), whether in the field of statistics or machine learning, has played a very important role.

Machine learning algorithm

Spark supports commonly used machine learning algorithms such as classification, regression, clustering, and recommendation. See the table below:

to sum up

This article gives a general introduction to machine learning, including the basic concepts of machine learning, the basic classification of machine learning, and an introduction to the Spark machine learning library. Through this article, I may have a preliminary understanding of machine learning. In the next article, I will share a machine learning application based on the Spark ML library, which mainly involves LDA topic model and K-means clustering.

Follow the public account [big data technology and data warehouse], reply to [data] to receive big data videos and books

Guess you like

Origin blog.csdn.net/jmx_bigdata/article/details/107775424