Only fourteen steps to master Python machine learning from scratch

 Recommended reading (click title to view)

1. Python data mining and machine learning practice technology application

2. R-Meta analysis and [bibliometric analysis, Bayesian, machine learning, etc.] multi-technology integration practice and expansion

3. The latest machine learning and deep learning based on MATLAB 2023a

4. [Eight Days] Summer Camp "Comprehensively Assist AI Research, Teaching and Practical Skills"

picture

"Starting" is often the hardest, especially when there are too many choices, it is often difficult for a person to make a decision to make a choice. The goal of this tutorial is to help novices with little Python machine learning background grow into knowledgeable practitioners, using only freely available materials and resources. The main goal of this outline is to walk you through the vast number of resources available. Undoubtedly, there are many resources, but which ones are the best? Which ones are complementary? In what order is the most appropriate order to study these resources?

First, I'm assuming you're not an expert on:

  • machine learning

  • Python

  • Any machine learning, scientific computing, or data analysis library for Python

Of course, it would be nice if you had some level of basic understanding of the first two topics, but that's not necessary, just spend a little more time in the early stages.

Basic

Step One: Basic Python Skills

Some basic understanding of Python is crucial if we are going to use Python to perform machine learning. Fortunately, because Python is a widely used general-purpose programming language, with its applications in scientific computing and machine learning, finding a tutorial for beginners isn't too difficult. Your level of experience with Python and programming is critical to getting started.

First, you need to install Python. Since we will be using scientific computing and machine learning packages later, I recommend you install Anaconda. This is an industrial-strength Python implementation available on Linux, OS X, and Windows, complete with packages needed for machine learning, including numpy, scikit-learn, and matplotlib. It also includes iPython Notebook, an interactive environment used in many of our tutorials. I recommend installing Python 2.7.

picture

If you don't know programming, I suggest you start with the following free online books before moving on to the material that follows:

  • Learn Python the Hard Way,作者 Zed A. Shaw:https://learnpythonthehardway.org/book/

If you have programming experience but don't know Python or are very beginner, I recommend you to take the following two courses:

  • Google Developers Python Course (highly recommended for visual learners) (http://suo.im/toMzq)

  • An Introduction to Python Scientific Computing (from M. Scott Shell, UCSB Engineering) (a good introduction, about 60 pages): (http://suo.im/2cXycM)

If you want a 30 minute quick course on Python, watch this:

  • Learn X in Y minutes (X=Python): (http://suo.im/zm6qX)

Of course, if you are already an experienced Python programmer, you can skip this step. Even so, I recommend that you always use the Python documentation: (https://www.python.org/doc/)

Step 2: Machine Learning Basic Skills

Zachary Lipton of KDnuggets has pointed out: Now, there are many different criteria for people to evaluate a "data scientist". This is actually a portrayal of the field of machine learning, because most of what a data scientist does involves using machine learning algorithms to varying degrees. Is it necessary to be very familiar with kernel methods in order to effectively create and gain insights from SVMs? of course not. Like almost everything in life, mastery of theory in depth is linked to practical application. A deep understanding of machine learning algorithms is beyond the scope of this article, and it will often require you to devote a significant amount of time to more academic courses, or at least intense self-study on your own.

The good news is that you don't need a Ph.D.-like theoretical understanding of machine learning to practice—just as you don't need computer science theory to be an effective programmer.

People tend to get rave reviews for the content of Andrew Ng's machine learning course on Coursera; however, my suggestion is to browse the class notes taken online by a previous student. Skip notes specific to Octave (a Matlab-like language that's not relevant to your Python studies). Be sure to understand that these are not official notes, but you can grasp the relevant content in Wu Enda's course materials from them. Of course, if you have time and interest, you can go to Coursera to learn Wu Enda's machine learning course now: (http://suo.im/2o1uD)

  • Unofficial Notes for Andrew Ng's class: (http://www.holehouse.org/mlclass/)

In addition to the Ng Enda courses mentioned above, if you need other courses, there are many courses for you to choose from online. For example, I really like Tom Mitchell, here is a video of his recent speech (with Maria-Florina Balcan), very approachable.

  • Tom Mitchell's machine learning course: (http://suo.im/497arw)

You don't need all the notes and videos right now. An effective method is to go directly to the specific practice questions below when you feel appropriate, refer to the above notes and the appropriate part of the video,

Step 3: Overview of Scientific Computing Python Packages

Alright, so we've mastered Python programming and had some familiarity with machine learning. And outside of Python, there are some open source software libraries that are commonly used to perform actual machine learning. Broadly speaking, there are a number of so-called scientific Python libraries that can be used to perform basic machine learning tasks (judgment here is certainly somewhat subjective):

  • numpy - mostly useful for its N-dimensional array objects 

    (http://www.numpy.org/)

  • pandas—Python data analysis library, including data frames (dataframes) and other structures (http://pandas.pydata.org/)

  • matplotlib - a 2D plotting library that produces publication-quality graphs (http://matplotlib.org/)

  • scikit-learn - Machine learning algorithms for data analysis and data mining roles

    (http://scikit-learn.org/stable/)

A good way to learn about these libraries is to study the following material:

  • Scipy Lecture Notes,by Gaël Varoquaux, Emmanuelle Gouillart, and Olav Vahtras:(http://www.scipy-lectures.org/)

  • This pandas tutorial is also very good: 10 Minutes to Pandas: (http://suo.im/4an6gY)

You'll also see some other packages later in this tutorial, such as Seaborn, a matplotlib-based data visualization library. The aforementioned packages are just some of the core libraries commonly used in Python machine learning, but understanding them should save you from getting confused when you encounter other packages later.

Let's get started!

Step 4: Learn

Check readiness first

  • Python: ready

  • Machine Learning Essentials: Ready

  • Numpy: ready

  • Pandas: Ready

  • Matplotlib: Ready

Now it's time to implement machine learning algorithms using scikit-learn, the Python machine learning standard library.

picture

scikit-learn flowchart

Many of the following tutorials and exercises are done using iPython (Jupyter) Notebook, an interactive environment for executing Python statements. iPython Notebook can be easily found online or downloaded to your local computer.

  • An overview of iPython Notebook from Stanford:

    (http://cs231n.github.io/ipython-tutorial/)

Also note that the following tutorials are made up of a collection of online resources. If you feel that there is something inappropriate in the course, you can communicate with the author. Our first tutorial starts with scikit-learn, I suggest you take a look at the following articles in order before proceeding through the tutorial.

The following is an article introducing scikit-learn, the most commonly used general-purpose machine learning library for Python, which covers the K-nearest neighbor algorithm:

  • Introduction to scikit-learn by Jake VanderPlas: (http://suo.im/3bMdEd)

The following is a more in-depth and extended introduction, including starting and completing a project from the famous database:

  • Randal Olson's machine learning case notes: (http://suo.im/RcPR6)

The next post focuses on strategies for evaluating different models on scikit-learn, including train/test splits:

  • Model Evaluation by Kevin Markham: (http://suo.im/2HIXD)

Step 5: Implement basic algorithms for machine learning on Python

With the basic knowledge of scikit-learn, we can further explore those more general and practical algorithms. We start with the very well-known k-means clustering algorithm, which is a very simple and efficient method that works well for unsupervised learning problems:

  • K-means clustering: (http://suo.im/40R8zf)

Next we can go back to the classification problem and learn the most popular classification algorithm ever:

  • Decision Trees: (http://thegrimmscientist.com/tutorial-decision-trees/)

After understanding classification problems, we can move on to continuous numerical prediction:

  • Linear regression: (http://suo.im/3EV4Qn)

We can also use the idea of ​​regression to apply to classification problems, namely logistic regression:

  • logistic regression: (http://suo.im/S2beL)

Step 6: Implement advanced machine learning algorithms on Python

Now that we are familiar with scikit-learn, we can look at more advanced algorithms. The first is support vector machines, which are nonlinear classifiers that rely on transforming data into high-dimensional spaces.

  • Support vector machine: (http://suo.im/2iZLLa)

Subsequently, we can examine learning a random forest as an ensemble classifier via the Kaggle Titanic competition:

  • Kaggle Titanic competition (using random forest): (http://suo.im/1o7ofe)

Dimensionality reduction algorithms are often used to reduce the number of variables used in a problem. Principal component analysis is a special form of unsupervised dimensionality reduction algorithm:

  • Dimensionality reduction algorithm: (http://suo.im/2k5y2E)

Before we get to step seven, we can take a moment to consider some of the progress that has been made in a relatively short period of time.

Using Python and its machine learning libraries first, we have not only looked at some of the most common and well-known machine learning algorithms (k-nearest neighbors, k-means clustering, support vector machines, etc.), but also looked at powerful ensemble techniques (random forests) and Some additional machine learning tasks (dimensionality reduction algorithms and model validation techniques). Besides some basic machine learning tricks, we've started looking for some useful toolkits.

We will learn new necessary tools further.

Step 7: Python Deep Learning

picture

A neural network contains many layers

Deep learning is everywhere. Deep learning builds on decades-old neural networks, but recent advances that started a few years ago and have dramatically improved the cognitive capabilities of deep neural networks have generated widespread interest. If you're new to neural networks, KDnuggets has a number of articles detailing the plethora of recent innovations, achievements, and praises in deep learning.

The final step is not intended to review all types of deep learning, but to explore a few simple network implementations in 2 advanced contemporary Python deep learning libraries. For readers interested in digging deep into deep learning, I recommend starting with these free online books:

  • Neural Networks and Deep Learning, by Michael Nielsen: (http://neuralnetworksanddeeplearning.com/)

1. Theano

Link: (http://deeplearning.net/software/theano/)

Theano was the first Python deep learning library we covered. See what the author of Theano said:

Theano is a Python library that enables you to efficiently define, optimize, and evaluate mathematical expressions involving multidimensional arrays.

The following introductory tutorial on learning deep learning with Theano is a bit long, but good enough, vividly described, and highly rated:

  • Theano Deep Learning Tutorial, by Colin Raffel (http://suo.im/1mPGHe)

2.Caffe 

Link: (http://caffe.berkeleyvision.org/)

Another library we will test drive is Caffe. Again, let's start with the author:

Caffe is a deep learning framework built for expressiveness, speed, and modularity, developed by the Bwekeley Vision and Learning Center and community workers.

This tutorial is the best one in this article. We have studied several interesting examples above, but none can compete with the following example, which implements Google's DeepDream through Caffe. This is quite exciting! Once you've mastered the tutorial, try making your processor go wild for fun.

  • Realize Google DeepDream through Caffe: (http://suo.im/2cUSXS)

I don't promise that it will be quick or easy, but if you put in the time and follow the 7 steps above, you'll be well on your way to understanding a plethora of machine learning algorithms as well as using popular libraries (including some of the most advanced in deep learning research today). cutting-edge library) became good at implementing algorithms in Python.

Advanced

picture

machine learning algorithm

This is the second in a series of 7 steps to mastering machine learning with Python. If you've taken the first part of the series, you should have achieved a satisfactory learning rate and proficiency; if not, you probably should Review the previous article, how much time will be spent depends on your current level of understanding. I guarantee it will be worth it. After a quick review, this post will focus more explicitly on a few machine learning related task sets. By safely skipping some basic modules - Python basics, machine learning basics, etc. - we can jump directly into different machine learning algorithms. This time we can better categorize the tutorials by function.

Step 1: Machine Learning Fundamentals Review & A New Perspective

The previous part included the following steps:

1. Python basic skills

2. Basic machine learning skills

3. Overview of Python packages

4. Beginning Machine Learning with Python: Introduction & Model Evaluation

5. Machine learning topics on Python: k-means clustering, decision trees, linear regression & logistic regression

6. Advanced Machine Learning Topics in Python: Support Vector Machines, Random Forests, PCA Dimensionality Reduction

7. Deep Learning in Python

As mentioned above, if you are going to start from scratch, I recommend that you read the first part in order. I'll also list all the starter material for beginners, and the installation instructions are included in the previous article.

However, if you've read it, I'll start with the basics:

  • Key Machine Learning Terms Explained, by Matthew Mayo.

    (Address: http://suo.im/2URQGm)

  • Wikipedia entry: Statistical classification.

    (Address: http://suo.im/mquen)

  • Machine Learning: A Complete and Detailed Overview, by Alex Castrounis.

    (Address: http://suo.im/1yjSSq)

If you're looking for an alternative or complementary approach to learning the fundamentals of machine learning, I can just refer you to Shai Ben-David's video lectures and Shai Shalev-Shwartz's textbook that I'm watching:

  • Introduction to Machine Learning video lecture by Shai Ben-David, University of Waterloo.

    (Address: http://suo.im/1TFlK6)

  • Understanding Machine Learning: From Theory to Algorithms, by Shai Ben-David & Shai Shalev-Shwartz.

    (Address: http://suo.im/1NL0ix)

Remember, you don't need to read all of this introductory material to start the series I'm writing. Video lectures, textbooks, and other resources are available when implementing models with machine learning algorithms or when appropriate concepts are practically applied in subsequent steps. Judge for yourself.

Step 2: More Categories

We start with new material, first by consolidating our classification techniques and introducing some additional algorithms. While the first part of this post covered decision trees, support vector machines, logistic regression, and random forests for synthetic classification, we'll add k-nearest neighbors, naive Bayes classifiers, and multilayer perceptrons.

picture

Scikit-learn classifiers

 k-Nearest Neighbors (kNN) is an example of a simple classifier and a lazy learner, where all computation happens at classification time (instead of ahead of time during the training step). kNN is non-parametric and decides how to classify by comparing data instances with the k nearest instances.

  • k-nearest neighbor classification using Python.

    (Address: http://suo.im/2zqW0t)

Naive Bayes is a classifier based on Bayes' theorem. It assumes that there is independence between features and that the presence of any particular feature in a class is independent of the presence of any other feature in the same class.

  • Document Classification Using Scikit-learn, by Zac Stewart.

    (Address: http://suo.im/2uwBm3)

A multilayer perceptron (MLP) is a simple feedforward neural network consisting of multiple layers of nodes, where each layer is fully connected to the subsequent layer. Multilayer Perceptron was introduced in Scikit-learn version 0.18.

Start by reading an overview of MLP classifiers from the Scikit-learn documentation, then use the tutorial to practice implementing them.

  • Neural Network Models (Supervised), Scikit-learn Documentation.

    (Address: http://suo.im/3oR76l)

  • A Beginner's Guide to Neural Networks with Python and Scikit-learn 0.18! By Jose Portilla.

    (Address: http://suo.im/2tX6rG)

Step 3: More Clustering

We now move on to clustering, a form of unsupervised learning. In the previous article, we discussed the k-means algorithm; here we introduce DBSCAN and expectation maximization (EM).

picture

Scikit-learn clustering algorithm

First, read these introductory articles; the first, a quick comparison of k-means and EM clustering techniques, is a nice continuation of new forms of clustering, and the second, a look at clustering techniques available in Scikit-learn In summary:

  • Clustering Techniques Compared: A Concise Technical Overview, by Matthew Mayo.

    (Address: http://suo.im/4ctIvI)

  • Comparing different clustering algorithms on a toy dataset, Scikit-learn documentation.

    (Address: http://suo.im/4uvbbM)

Expectation-maximization (EM) is a probabilistic clustering algorithm, and thus involves determining the probability of an instance belonging to a particular cluster. EM approximates maximum likelihood or maximum a posteriori estimation of parameters in statistical models (Han, Kamber, and Pei). The EM procedure iterates from a set of parameters until the cluster is maximized with respect to the k-clusters.

First read the tutorial on the EM algorithm. Next, take a look at the relevant Scikit-learn documentation. Finally, follow the tutorial to implement EM clustering yourself using Python.

  • Expectation-Maximization (EM) Algorithm Tutorial, by Elena Sharova.

    (Address: http://suo.im/33ukYd)

  • Gaussian Mixture Models, Scikit-learn Documentation.

    (Address: http://suo.im/20C2tZ)

  • A Quick Introduction to Building Gaussian Mixture Models in Python, by Tiago Ramalho.

    (Address: http://suo.im/4oxFsj)

If Gaussian Mixture Models seem confusing at first, this relevant section from the Scikit-learn documentation should alleviate any excess worries:

Gaussian mixture objects implement the expectation-maximization (EM) algorithm to fit mixtures of Gaussian models.

Application of Density-Based Spatial Clustering with Noise (DBSCAN) operates by grouping dense data points together and designating low-density data points as outliers.

First read and follow DBSCAN's example implementation from Scikit-learn's documentation, then follow the concise tutorial:

  • Demo of the DBSCAN clustering algorithm, Scikit-learn documentation.

    (Address: http://suo.im/1l9tvX)

  • Density-based clustering algorithm (DBSCAN) and implementation.

    (Address: http://suo.im/1LEoXC)

Step 4: More Integration Methods

The previous article only dealt with a single ensemble method: Random Forest (RF). RF is a top classifier that has achieved great success over the past few years, but it is certainly not the only ensemble classifier out there. We'll take a look at wrapping, promotion, and voting.

picture

give me a boost

First, read an overview of these ensemble learners, the first being general; the second being that they are relevant to Scikit-learn:

  • An Introduction to Ensemble Learners, by Matthew Mayo.

    (Address: http://suo.im/cLESw)

  • Ensembling Methods in Scikit-learn, Scikit-learn Documentation.

    (Address: http://suo.im/yFuY9)

Then, jump-start random forests with a new tutorial before moving on to new ensemble methods:

  • Random Forests in Python, from Yhat.

    (Address: http://suo.im/2eujI)

Wrapping, boosting, and voting are all different forms of ensemble classifiers, and all involve building multiple models; however, what algorithms those models are built from, the data the models use, and how the results are ultimately combined will vary from scheme to scheme.

  • Wrappers: Build multiple models from the same classification algorithm while using different (independent) data samples from the training set - Scikit-learn implements wrapper classifiers

  • Boosting: Building multiple models from the same classification algorithm, linking the models one after the other to boost the learning of each subsequent model - Scikit-learn implements AdaBoost

  • Voting: Build multiple models from different classification algorithms and use criteria to determine how the models are best combined - Scikit-learn implements a voting classifier

So, why combine models? To approach this from a specific perspective, here is an overview of the bias-variance tradeoff, specifically related to boosting, from the Scikit-learn documentation:

  • Single estimators vs wrappers: bias-variance decomposition, Scikit-learn documentation.

    (Address: http://suo.im/3izlRB)

Now that you've read some introductory material on ensemble learners and have a basic understanding of a few specific ensemble classifiers, here's how to implement ensemble classifiers in Python using Scikit-learn from Machine Learning Mastery:

  • Implementing Ensemble Machine Learning Algorithms in Python Using Scikit-learn by Jason Brownlee.

  • (Address: http://suo.im/9WEAr)

Step 5: Gradient Boosting

In the next step, we continue to learn about integrated classifiers and explore one of the most popular contemporary machine learning algorithms. Gradient boosting has recently had a significant impact in machine learning, becoming one of the most popular and successful algorithms in Kaggle competitions.

picture

give me a gradient boost

First, read an overview of gradient boosting:

  • Wikipedia entry: Gradient boosting. (Address: http://suo.im/TslWi)

Next, learn why gradient boosting is the "winner" method in Kaggle competitions:

  • Why does gradient boosting perfectly solve many Kaggle problems? Quora,

    (Address: http://suo.im/3rS6ZO)

  • Kaggle Guru Explains What Gradient Boosting Is, by Ben Gorman.

    (Address: http://suo.im/3nXlWR)

While Scikit-learn has its own implementation of gradient boosting, we will change it slightly and use the XGBoost library, which we mentioned is a faster implementation.

The following links provide some additional information on the XGBoost library, as well as gradient boosting (by necessity):

  • Wikipedia entry: XGBoost.

    (Address: http://suo.im/2UlJ3V)

  • XGBoost library on Ghub.

    (Address: http://suo.im/2JeQI8)

  • XGBoost documentation.

    (Address: http://suo.im/QRRrm)

Now, follow this tutorial to bring it all together:

  • An Implementation Guide to XGBoost Gradient Boosted Trees in Python, by Jesse Steinweg-Woods.

    (Address: http://suo.im/4FTqD5)

You can also harden by following these more succinct examples:

  • XGBoost example on Kaggle (Python).

    (Address: http://suo.im/4F9A1J)

  • A Simple Tutorial on the Iris Dataset and XGBoost by Ieva Zarina.

    (Address: http://suo.im/2Lyb1a)

Step 6: More Dimensionality Reduction

Dimensionality reduction is the use of procedures to obtain a set of principal variables that reduce the variables used for model building from their initial number to a reduced number.

There are two main forms of dimensionality reduction:

  • 1. Feature selection - select a subset of relevant features.

    (Address: http://suo.im/4wlkrj)

  • 2. Feature Extraction - Build an informative and non-redundant feature set of derived values.

    (Address: http://suo.im/3Gf0Yw)

The following is a pair of commonly used feature extraction methods.

picture

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values ​​of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (i.e. takes into account as much variability in the data as possible)

The above definitions are taken from the PCA wikipedia entry, further reading if interested. However, the overview/tutorial below is pretty thorough:

  • Principal Component Analysis: 3 Easy Steps, by Sebastian Raschka.

    (Address: http://suo.im/1ahFdW)

Linear discriminant analysis (LDA) is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition, and machine learning to discover features that are linearly combined or that separate two or more classes of objects or events . The resulting combination can be used as a linear classifier or, more commonly, as dimensionality reduction before subsequent classification.

LDA is closely related to analysis of variance (ANOVA) and regression analysis, which likewise attempts to represent a dependent variable as a linear combination of other features or measures. However, ANOVA uses categorical independent variables and continuous dependent variables, while discriminant analysis has continuous independent variables and categorical dependent variables (i.e. class labels).

The definition above is also from Wikipedia. Here's the full read:

  • Linear Discriminant Analysis - Up to Bits, by Sebastian Raschka.

    (Address: http://suo.im/gyDOb)

Are you confused about the actual difference between PCA and LDA for dimensionality reduction? Sebastian Raschka made the following clarification:

Both linear discriminant analysis (LDA) and principal component analysis (PCA) are linear transformation techniques commonly used for dimensionality reduction. PCA can be described as an "unsupervised" algorithm, since it "ignores" class labels, and its goal is to find directions that maximize variance in a dataset (so-called principal components). In contrast to PCA, LDA is "supervised" and computes the direction of the axis ("linear discriminant") that maximizes the separation between multiple classes.

For a brief description of this, read the following:

  • What is the difference in dimensionality reduction between LDA and PCA? By Sebastian Raschka.

    (Address: http://suo.im/2IPt0U)

Step 7: More Deep Learning

The previous article provided an entry to learn about neural networks and deep learning. If your learning has been relatively smooth so far and you want to consolidate your understanding of neural networks and practice implementing several common neural network models, then please continue to read.

picture

First, look at some basic material for deep learning:

  • Deep Learning Key Terms and Explanations, by Matthew Mayo

  • 7 Steps to Understanding Deep Learning, by Matthew Mayo.

    (Address: http://suo.im/3QmEfV)

Next, try some concise overviews/tutorials on TensorFlow, Google's open source software library for machine intelligence (an efficient deep learning framework and pretty much the best neural network tool available today):

  • Machine Learning Stepping Stones: An Introduction to TensorFlow Anyone Can Understand (Parts 1, 2)

  • Entry-level interpretation: Introduction to TensorFlow that Xiaobai can understand (Parts 3 and 4)

Finally, try out these tutorials directly from the TensorFlow website, which implements some of the most popular and common neural network models:

  • Recurrent Neural Networks, Google TensorFlow Tutorial.

    (Address: http://suo.im/2gtkze)

  • Convolutional Neural Networks, Google TensorFlow Tutorial.

    (Address: http://suo.im/g8Lbg)

In addition, an article is currently being written on 7 steps to mastering deep learning, focusing on using the high-level API on top of TensorFlow to increase the ease and flexibility of model implementation. I'll also add a link here when I'm done.

related:

  • 5 eBooks You Should Read Before Entering the Machine Learning Industry.

    (Address: http://suo.im/SlZKt)

  • 7 steps to understanding deep learning.

    (Address: http://suo.im/3QmEfV)

  • Key machine learning terms and explanations.

    (Address: http://suo.im/2URQGm)

Guess you like

Origin blog.csdn.net/weixin_56135535/article/details/132076642