Machine Learning in Python: Main developments andtechnology trends in data science, machine learnin

[1] Raschka S ,  J  Patterson,  Nolet C . Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence[J]. Information (Switzerland), 2020.

abstract

Introduction

 

 Figure 1. The standard Python ecosystem for machine learning, data science, and scientific computing.

NumPy is a multidimensional array library with basic linear algebra routines, and the SciPy library adorns NumPy arrays with many important primitives, from numerical optimizers and signal processing to statistics and sparse linear algebra

While both NumPy and Pandas [10] (Figure 1) provide abstractions over a collection of data pointswith operations that work on the dataset as a whole, Pandas extends NumPy by providing a data frame-likeobject supporting heterogeneous column types and row and column metadata.

Pandas, NumPy, and SciPy remain the most user-friendly and recommended choices for many data science and computing projects.

In recent years, machine learning and deep learning technologies advanced the state-of-the-art in many fields, but one often quoted disadvantage of these technologies over more traditional approaches is a lack of interpretability and explainability

Deep learning is particularly attractive for working with large, unstructured datasets, such as text and images. In contrast, most classical ML techniques were developed with structured data in mind; that is, data in a tabular form, where training examples are stored as rows, and the accompanying observations (features) are stored as columns. 

In this context, as a rule of thumb, we consider datasets with less than 1000 training examples as small, and datasets with between 1000 and 100,000 examples as medium-sized

 

Figure 3. (a) The different stages of the AutoML process for selecting and tuning classical ML models. (b)
AutoML stages for generating and tuning models using neural architecture search.

Deeping Learning

Using classical ML, the predictive performance depends significantly on data processing and feature engineering for constructing the dataset that will be used to train the models. Classical ML methods, mentioned in Section 2, are often problematic when working with high-dimensional datasets – the algorithms are suboptimal for extracting knowledge from raw data, such as text and images [98]. Additionally, converting a training dataset into a suitable tabular (structured) format typically requires manual feature engineering. For example, in order to construct a tabular dataset, we may represent a document as a vector of word frequencies [99], or we may represent (Iris) flowers by tabulating measurements of the leaf sizes instead of using the pixels in a photographs as inputs [100] 
Classical ML is still the recommended choice for most modeling tasks that are based on tabular datasets. However, aside from the AutoML tools mentioned in Section 3 above, it depends on careful feature engineering, which requires substantial domain expertise. Data preprocessing and feature engineering can be considered an art, where the goal is to extract useful and salient information from the collected raw data in such a manner that most of the information relevant for making predictions is retained. Careless or ineffective feature engineering can result in the removal of salient information and substantially hamper the performance of predictive models. 
While some deep learning algorithms are capable of accepting tabular data as input, the majority of state-of-the-art methods that are finding the best predictive performance are general-purpose and able to extract salient information from raw data in a somewhat automated way. This automatic feature extraction is an intrinsic component of their optimization task and modeling architecture. For this reason, deep learning is often described as a representation or feature learning method. However, one major downside of deep learning is that it is not well suited to smaller, tabular datasets, and parameterizing DNNs can require larger datasets, requiring between 50 thousand and 15 million training examples for effective training. 

Guess you like

Origin blog.csdn.net/moonlightpeng/article/details/121138527