Python machine learning introduction and basic knowledge reserve

Part 1 Introduction to Machine Learning


Machine learning is the extraction of knowledge from data. It is a research field at the intersection of statistics, artificial intelligence and computer science. It is also known as predictive analytics or statistical learning.

1.1 Why Machine Learning

In the early days of "smart" applications, many systems processed data using human-crafted "if" and "else" decision rules, or adjusted based on user input

For example, the identification and cleaning of spam, this is an example of the rule system designed by experts to design "intelligent applications". Human-made decision rules are feasible for some applications, especially if people are very familiar with the process of their models. Applications

But artificially specifying decision rules has two disadvantages:

  • The logic needed to make decisions is only applicable to a single domain and a single task, and even a slight change may require rewriting the entire system

  • Making rules requires a deep understanding of the decision-making process of human experts

An example of where this artificial rule-making approach doesn't work is face detection in images, where the main problem is that the way a computer "perceives" the pixels (the pixels that make up an image in the computer) is very different from the way humans perceive faces. Big difference. It is precisely because of this difference in representation that it is basically impossible for humans to formulate a good set of rules to describe the composition of faces in digital images

But with machine learning, inputting a large number of face images into the program is enough for the algorithm to determine which features are needed to recognize faces

1.1.1 Problems that machine learning can solve

The most successful machine learning algorithms are those that automate the decision-making process by generalizing from known examples , this is supervised learning

Supervised learning : Machine learning algorithms that learn from input/output pairs

The user feeds the algorithm a paired input and expected output, and the algorithm finds a way to give the expected output given the input, especially given never-before-seen inputs without human help. Give the corresponding output (this is the decision process automation)

Problems that supervised learning can solve: Recognizing handwritten zip codes on envelopes, judging whether a tumor is benign based on medical images, detecting fraudulent behavior in credit card transactions

An interesting observation to note in these examples is that while the input and output appear to be fairly simple, the data collection process is quite different in the three examples

Unsupervised learning : only the input data is known, no output data is provided to the algorithm

Problems that unsupervised learning can solve: identifying the topic of a series of blog posts, segmenting customers into groups with similar preferences, detecting unusual access patterns to a website

Whether it is a supervised learning task or an unsupervised learning task, it is very important to represent the input data into a form that the computer can understand. Often it is useful to think of data as tables

Each data point you want to process (each email, each customer, each transaction) corresponds to a row in the table, describing each attribute of the data point (such as customer age, transaction amount, or transaction location ) corresponds to a column in the table

In machine learning, each entity or each row here is called a sample (sample) or data point , and each column (used to describe the attributes of these entities) is called a feature

1.1.2 Familiar with tasks and data

Probably the most important part of the machine learning process is understanding the data you are dealing with and how that data relates to the task you want to solve

Before you start building models, you need to understand the content of your dataset. Each algorithm differs in the type of input data and the problem it is best suited to solve.

From a larger perspective, machine learning algorithms and methods are only part of the process of solving a specific problem, and it is important to always keep in mind the big picture of the entire project (focus on problem solving)

1.2 Why choose Python

Machine learning uses Python for three main reasons:

1. Python has become a common language for many data science applications, combining the power of a general-purpose programming language with the ease of use of a domain-specific scripting language such as MATLAB or R

2. Python has libraries for various functions such as data loading, visualization, statistics, natural language processing, image processing, etc.

3. Use the terminal or other tools like Jupyter Notebook to directly interact with the code

1.3 scikit-learn

scikit-learn is an open source project that is free to use and distribute, anyone can easily get its source code to see the principles behind it, it is a very popular tool and the most famous Python machine learning library

scikit-learn can also be used with a large number of other Python scientific computing tools

scikit-learn User Guide (http://scikit-learn.org/stable/user_guide.html)

scikit-learn depends on two other Python packages: NumPy and SciPy

For plotting and interactive development, you should also install matplotlib, IPython and Jupyter Notebook

If you already have Python installed, then you can install all the above packages with pip:

pip install numpy scipy matplotlib ipython scikit-learn pandas

Note : Pay attention to the input position of the pip command. Use the pip command under the Python Scripts path and wait for the download to complete.

insert image description here

1.4 Necessary libraries and tools

Knowing about scikit-learn and how to use it is important, but there are other libraries that can improve your programming experience as well. scikit-learn is based on the NumPy and SciPy scientific computing libraries. In addition to NumPy and SciPy, we will also use pandas and matplotlib, and we will introduce Jupyter Notebook, a browser-based interactive programming environment

1.4.1 Jupyter Notebook

Jupyter Notebook is an interactive environment that can run code in a browser. This tool is very useful in exploratory data analysis. Although Jupyter Notebook supports multiple programming languages, we only need to support Python. Use it to integrate code, files and Graphics are very convenient

1.4.2 Numpy

NumPy is one of the fundamental packages for scientific computing in Python: its features include multidimensional arrays, advanced mathematical functions (such as linear algebra operations and Fourier transforms), and pseudorandom number generators

Note : In scikit-learn, NumPy arrays are the basic data structure

scikit-learn accepts data in the NumPy array format. All the data you use must be converted into a NumPy array. The core function of NumPy is the ndarray class, which is a multidimensional (n-dimensional) array . All elements in the array must be of the same type

For objects of the NumPy ndarray class, we refer to them simply as "NumPy arrays" or "arrays"

#输入
import numpy as np 
x = np.array([[1, 2, 3], [4, 5, 6]])
print("x:\n{}".format(x))
---
#输出
x:
    [[1 2 3] 
     [4 5 6]]

1.4.3 SciPy

SciPy is a collection of functions for scientific computing in Python. It has multiple functions such as linear algebra advanced programs, mathematical function optimization, signal processing, special mathematical functions, and statistical distribution. scikit-learn uses the collection of functions in SciPy to implement algorithms

For us, the most important thing in SciPy is scipy.sparse: it can give sparse matrices (sparse matrices), sparse matrices are another way to represent data in scikit-learn

Generally speaking, it is impossible to create a dense representation of sparse data (dense representation) (because it is too wasteful of memory), so we need to directly create its sparse representation (sparse representation)

#输入
from scipy import sparse
# 创建一个二维NumPy数组,对角线为1,其余都为0 
eye = np.eye(4) 
print("NumPy array:\n{}".format(eye))
---
#输出
NumPy array: [[ 1. 0. 0. 0.] [ 0. 1. 0. 0.] [ 0. 0. 1. 0.] [ 0. 0. 0. 1.]]

1.4.4 matplotlib

matplotlib is the main scientific drawing library for Python. Its function is to generate visual content that can be published, such as line charts, histograms, scatter plots, etc. Visualizing data and various analyzes can give you a deep understanding, and we will All visualizations are done with matplotlib. In Jupyter Notebook, you can use the %matplotlib notebook and %matplotlib inline commands to display images directly in the browser

It is recommended to use the %matplotlib notebook command, which can provide an interactive environment

1.4.5 pandas

Pandas is a Python library for processing and analyzing data. It is based on a data structure called DataFrame. Simply put, a pandas DataFrame is a table, similar to an Excel table. Pandas contains a large number of tables for modifying and operating form method

Unlike NumPy, which requires the elements in an array to be of the same type, each column of pandas can be of a different type, and one of its strengths is that it can extract data from many file formats and databases

Copyright statement : The above content is partially excerpted from "Python Machine Learning Basic Tutorial" - O'Reilly Media, Inc.

If the article is helpful to you, remember to support it with one click and three links~

Guess you like

Origin blog.csdn.net/qq_50587771/article/details/123223992