Part 1 Introduction to Machine Learning
Getting Started with Machine Learning
Machine learning is the extraction of knowledge from data. It is a research field at the intersection of statistics, artificial intelligence and computer science. It is also known as predictive analytics or statistical learning.
1.1 Why Machine Learning
In the early days of "smart" applications, many systems processed data using human-crafted "if" and "else" decision rules, or adjusted based on user input
For example, the identification and cleaning of spam, this is an example of the rule system designed by experts to design "intelligent applications". Human-made decision rules are feasible for some applications, especially if people are very familiar with the process of their models. Applications
But artificially specifying decision rules has two disadvantages:
-
The logic needed to make decisions is only applicable to a single domain and a single task, and even a slight change may require rewriting the entire system
-
Making rules requires a deep understanding of the decision-making process of human experts
An example of where this artificial rule-making approach doesn't work is face detection in images, where the main problem is that the way a computer "perceives" the pixels (the pixels that make up an image in the computer) is very different from the way humans perceive faces. Big difference. It is precisely because of this difference in representation that it is basically impossible for humans to formulate a good set of rules to describe the composition of faces in digital images
But with machine learning, inputting a large number of face images into the program is enough for the algorithm to determine which features are needed to recognize faces
1.1.1 Problems that machine learning can solve
The most successful machine learning algorithms are those that automate the decision-making process by generalizing from known examples , this is supervised learning
Supervised learning : Machine learning algorithms that learn from input/output pairs
The user feeds the algorithm a paired input and expected output, and the algorithm finds a way to give the expected output given the input, especially given never-before-seen inputs without human help. Give the corresponding output (this is the decision process automation)
Problems that supervised learning can solve: Recognizing handwritten zip codes on envelopes, judging whether a tumor is benign based on medical images, detecting fraudulent behavior in credit card transactions
An interesting observation to note in these examples is that while the input and output appear to be fairly simple, the data collection process is quite different in the three examples
Unsupervised learning : only the input data is known, no output data is provided to the algorithm
Problems that unsupervised learning can solve: identifying the topic of a series of blog posts, segmenting customers into groups with similar preferences, detecting unusual access patterns to a website
Whether it is a supervised learning task or an unsupervised learning task, it is very important to represent the input data into a form that the computer can understand. Often it is useful to think of data as tables
Each data point you want to process (each email, each customer, each transaction) corresponds to a row in the table, describing each attribute of the data point (such as customer age, transaction amount, or transaction location ) corresponds to a column in the table
In machine learning, each entity or each row here is called a sample (sample) or data point , and each column (used to describe the attributes of these entities) is called a feature
1.1.2 Familiar with tasks and data
Probably the most important part of the machine learning process is understanding the data you are dealing with and how that data relates to the task you want to solve
Before you start building models, you need to understand the content of your dataset. Each algorithm differs in the type of input data and the problem it is best suited to solve.
From a larger perspective, machine learning algorithms and methods are only part of the process of solving a specific problem, and it is important to always keep in mind the big picture of the entire project (focus on problem solving)
1.2 Why choose Python
Machine learning uses Python for three main reasons:
1. Python has become a common language for many data science applications, combining the power of a general-purpose programming language with the ease of use of a domain-specific scripting language such as MATLAB or R
2. Python has libraries for various functions such as data loading, visualization, statistics, natural language processing, image processing, etc.
3. Use the terminal or other tools like Jupyter Notebook to directly interact with the code
1.3 scikit-learn
scikit-learn is an open source project that is free to use and distribute, anyone can easily get its source code to see the principles behind it, it is a very popular tool and the most famous Python machine learning library
scikit-learn can also be used with a large number of other Python scientific computing tools
scikit-learn User Guide (http://scikit-learn.org/stable/user_guide.html)
scikit-learn depends on two other Python packages: NumPy and SciPy
For plotting and interactive development, you should also install matplotlib, IPython and Jupyter Notebook
If you already have Python installed, then you can install all the above packages with pip:
pip install numpy scipy matplotlib ipython scikit-learn pandas
Note : Pay attention to the input position of the pip command. Use the pip command under the Python Scripts path and wait for the download to complete.
1.4 Necessary libraries and tools
Knowing about scikit-learn and how to use it is important, but there are other libraries that can improve your programming experience as well. scikit-learn is based on the NumPy and SciPy scientific computing libraries. In addition to NumPy and SciPy, we will also use pandas and matplotlib, and we will introduce Jupyter Notebook, a browser-based interactive programming environment
1.4.1 Jupyter Notebook
Jupyter Notebook is an interactive environment that can run code in a browser. This tool is very useful in exploratory data analysis. Although Jupyter Notebook supports multiple programming languages, we only need to support Python. Use it to integrate code, files and Graphics are very convenient
1.4.2 Numpy
NumPy is one of the fundamental packages for scientific computing in Python: its features include multidimensional arrays, advanced mathematical functions (such as linear algebra operations and Fourier transforms), and pseudorandom number generators
Note : In scikit-learn, NumPy arrays are the basic data structure
scikit-learn accepts data in the NumPy array format. All the data you use must be converted into a NumPy array. The core function of NumPy is the ndarray class, which is a multidimensional (n-dimensional) array . All elements in the array must be of the same type
For objects of the NumPy ndarray class, we refer to them simply as "NumPy arrays" or "arrays"
#输入
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6]])
print("x:\n{}".format(x))
---
#输出
x:
[[1 2 3]
[4 5 6]]
1.4.3 SciPy
SciPy is a collection of functions for scientific computing in Python. It has multiple functions such as linear algebra advanced programs, mathematical function optimization, signal processing, special mathematical functions, and statistical distribution. scikit-learn uses the collection of functions in SciPy to implement algorithms
For us, the most important thing in SciPy is scipy.sparse: it can give sparse matrices (sparse matrices), sparse matrices are another way to represent data in scikit-learn
Generally speaking, it is impossible to create a dense representation of sparse data (dense representation) (because it is too wasteful of memory), so we need to directly create its sparse representation (sparse representation)
#输入
from scipy import sparse
# 创建一个二维NumPy数组,对角线为1,其余都为0
eye = np.eye(4)
print("NumPy array:\n{}".format(eye))
---
#输出
NumPy array: [[ 1. 0. 0. 0.] [ 0. 1. 0. 0.] [ 0. 0. 1. 0.] [ 0. 0. 0. 1.]]
1.4.4 matplotlib
matplotlib is the main scientific drawing library for Python. Its function is to generate visual content that can be published, such as line charts, histograms, scatter plots, etc. Visualizing data and various analyzes can give you a deep understanding, and we will All visualizations are done with matplotlib. In Jupyter Notebook, you can use the %matplotlib notebook and %matplotlib inline commands to display images directly in the browser
It is recommended to use the %matplotlib notebook command, which can provide an interactive environment
1.4.5 pandas
Pandas is a Python library for processing and analyzing data. It is based on a data structure called DataFrame. Simply put, a pandas DataFrame is a table, similar to an Excel table. Pandas contains a large number of tables for modifying and operating form method
Unlike NumPy, which requires the elements in an array to be of the same type, each column of pandas can be of a different type, and one of its strengths is that it can extract data from many file formats and databases
Copyright statement : The above content is partially excerpted from "Python Machine Learning Basic Tutorial" - O'Reilly Media, Inc.
If the article is helpful to you, remember to support it with one click and three links~