Basic knowledge of machine learning (full)

Table of contents

1. Overview of machine learning

1.1 Overview of Artificial Intelligence

1.1.1 AI usage scenarios

1.1.2 Small case of artificial intelligence

1.2 The development history of artificial intelligence

1.2.1 Turing Test

1.2.2 Development History

1.2.3 Summary

1.3 Main branches of artificial intelligence

1.3.1 Artificial intelligence, machine learning and deep learning

1.3.2 Introduction of main branches

1.3.3 Three essential elements for the development of artificial intelligence

1.3.4 Expansion: Comparison of GPU and CPU

1.4 Machine Learning Workflow

1.4.1 What is machine learning

1.4.2 Machine Learning Workflow

1.4.3 Introduction to the obtained datasets

1.4.4 Basic data processing

1.4.5 Feature Engineering

1.4.6 Machine Learning and Model Evaluation Concepts

1.5 Classification of Machine Learning Algorithms

1.5.1 Supervised learning

1.5.2 Unsupervised Learning

1.5.3 Semi-supervised learning

1.5.4 Reinforcement Learning

1.5.5 Summary

1.6 Model Evaluation

1.6.1 Classification model evaluation

1.6.2 Regression Model Evaluation

1.6.3 Fitting

1.7 Introduction to Azure Platform

1.8 Introduction to Deep Learning 【Understand】

1.8.1 Deep Learning -- Introduction to Neural Networks

1.8.2 What each layer of deep learning is responsible for

2. Machine learning basic environment installation and use

2.1 Library installation 

2.2 Use of Jupyter Notebook

2.2.1 Introduction to Jupyter Notebook

2.2.2 Why use Jupyter Notebook

2.2.3 Simple operation introduction

2.2.4 markdown function

3. Matplotlib in detail

3.1 HelloWorld of Matplotlib

3.1.1 What is Matplotlib

3.1.2 Why learn Matplotlib

3.1.3 Realize a simple Matplotlib drawing

3.1.4 Know Matplotlib image structure (expand, understand)

3.1.5 Matplotlib three-layer structure (expand, understand)

3.2 Line chart (plot) and basic drawing functions

3.2.1 Line chart drawing and picture saving

3.2.2 Improve the original line chart 1 (auxiliary display layer)

3.2.3 Add grid display

3.2.4 Add description information

3.2.5 Multiple plots

3.2.6 Set graphic style

3.2.7 Multiple coordinate system display - plt.subplots (object-oriented drawing method)

3.2.8 Line chart application scenarios

3.3 Common graphic drawing

3.3.1 Types and meanings of common graphics

3.3.2 Scatter plot drawing

3.3.3 Histogram drawing

4.numpy

4.1 Advantages of numpy

4.1.1 Introduction to numpy

4.1.2 ndarray introduction 

4.1.3 Comparison of operation efficiency between ndarray and Python native list

4.1.4 Advantages of narray

4.2 N-dimensional array-ndarray

4.2.1 Properties of ndarray

4.2.2 The shape of ndarray

4.2.3 Types of ndarrays

4.3 Basic Operation

4.3.1 Method of generating array

4.3.2 Indexing and slicing of arrays

4.3.3 Shape modification

4.3.4 Type modification

4.3.5 Deduplication of arrays

4.4 ndarray operations

4.4.1 Logic operations

4.4.2 General judgment function

4.4.3 np.where (ternary operator)

4.4.4 Statistics calculation

4.5 Mathematics: Matrices

4.5.1 Matrices and vectors

4.5.2 Addition and scalar multiplication

4.5.3 Matrix-vector multiplication

4.5.4 Matrix multiplication

4.5.5 Properties of matrix multiplication

4.5.6 Reverse and Transpose

4.6 Operations between arrays

4.6.1 Operations on Arrays and Numbers

4.6.2 Array-to-Array Operations: Broadcast Mechanism

4.6.3 Matrix Multiplication API

5.Pandas

5.1 Introduction to Pandas

5.1.1 Introduction to Pandas

5.1.2 Why use Pandas

5.1.3 Case

5.1.4 DataFrame

5.2 Basic Data Operation

5.2.1 Index operations

5.2.2 Assignment operation

5.2.3 Sorting

5.3 DataFrame operation

5.3.1 Arithmetic operations

5.3.2 Logic operations

5.3.3 Statistical calculation

5.3.4 Cumulative statistics function

5.3.5 Custom functions

5.4 Pandas drawing

5.5 File reading and storage

5.5.1 CSV

5.5.2 HDF5

5.5.3 JSON

5.5.4 Expansion

5.6 Advanced processing - missing value processing

5.6.1 How to deal with nan

5.6.2 Not missing value nan, with default mark

5.7 Advanced Processing - Data Discretization

5.7.1 Why Discretization

5.7.2 What is data discretization

5.7.3 Data packet operation

5.7.4 Packet data into one-hot encoding

5.8 Advanced Processing - Data Merging

5.8.1 pd.concat realizes data merging

5.8.2 pd.merge realizes data merging

5.9 Advanced Processing - Crosstab and PivotTable

5.9.1 What are the functions of crosstab and pivot table

5.9.2 Use crosstab (crosstab) to realize the above figure

5.9.3 Case - exploring the relationship between stocks and days of the week

5.9.4 Implementation using pivot_table (pivot table)

5.10 Advanced Processing - Grouping and Aggregation

5.10.1 What grouping and aggregation are

5.10.2 Group API

5.10.3 Starbucks Retail Store Data


1. Overview of machine learning

 

1.1 Overview of Artificial Intelligence

1.1.1 AI usage scenarios

1.1.2 Small case of artificial intelligence

https://quickdraw.withgoogle.com

https://pjreddie.com/darknet/yolo/

https://deepdreamgenerator.com/

1.2 The development history of artificial intelligence

                 

1.2.1 Turing Test

1.2.2 Development History

1.2.3 Summary

1.3 Main branches of artificial intelligence

1.3.1 Artificial intelligence, machine learning and deep learning

1.3.2 Introduction of main branches

                 


1.3.3 Three essential elements for the development of artificial intelligence

1.3.4 Expansion: Comparison of GPU and CPU

CPU is good at IO processing, GPU is good at calculation.

1.4 Machine Learning Workflow

1.4.1 What is machine learning

1.4.2 Machine Learning Workflow

1.4.3 Introduction to the obtained datasets

        ​​​​​​ 

1.4.4 Basic data processing

That is, the logarithm is processed with missing values, removing outliers, etc.

1.4.5 Feature Engineering

(1) What is feature engineering

(2) Why do we need Feature Engineering?

(3) Feature engineering includes content

  • feature extraction
  • feature preprocessing
  • feature dimensionality reduction

 Feature extraction:

Feature preprocessing:

Feature dimensionality reduction:

1.4.6 Machine Learning and Model Evaluation Concepts

Machine learning: choose the appropriate algorithm to train the model.

Model Evaluation: Evaluate the trained model.

1.5 Classification of Machine Learning Algorithms

1.5.1 Supervised learning

Definition:  The input data is composed of input feature values ​​and target values.
        - the output of the function can be every successive value (called regression );

        - or the output is a finite number of discrete values ​​(called classification ).

(1) Regression problem 

For example: predict housing prices, and fit a continuous curve based on the sample set.

(2) Classification problem

For example: judging benign or malignant according to the characteristics of the tumor, the result is "benign" or "malignant", which is discrete.

1.5.2 Unsupervised Learning

Definition: The input data is composed of input feature values.
        The input data is not labeled and there are no definitive outcomes. The category of sample data is unknown, and it is necessary to classify the sample set according to the similarity between samples ( clustering ), trying to minimize the gap within the class and maximize the gap between classes.

Example:

1.5.3 Semi-supervised learning

Definition: The training set contains both labeled sample data and unlabeled sample data.
Example:

1.5.4 Reinforcement Learning

Reinforcement learning: The essence is that the make decisions problem, that is, make decisions automatically , and can make continuous decisions.

The goal of reinforcement learning is to get the most cumulative rewards.
Example:

Supervised Learning vs. Reinforcement Learning:

1.5.5 Summary

1.6 Model Evaluation

        Model evaluation is an integral part of the model development process. It helps in discovering the best model to represent the data and how well the selected model will work in the future. According to the different target values ​​of the data set, model evaluation can be divided into classification model evaluation and regression model evaluation. 

1.6.1 Classification model evaluation

  • Accuracy: The ratio of the number of correct predictions to the total number of samples.
  • Accuracy: The proportion of correct predictions that are positive to all predictions that are positive .
  • Recall: The proportion of all positive samples that are correctly predicted as positive .
  • F1-score: Mainly used to evaluate the robustness of the model.
  • AUC indicator: It is mainly used to evaluate the situation of sample imbalance.

1.6.2 Regression Model Evaluation

1.6.3 Fitting

Model evaluation is used to evaluate the performance of the trained model, and its performance can be roughly divided into two categories: overfitting and underfitting.

During the training process, you may encounter the following problems:
        the training data is trained well, and the error is not large, why is there a problem on the test set? When this happens to an algorithm in a data set, there may be a fitting problem.

(1) Underfitting 

(2) Overfitting

Over-fitting (over-fitting): The built machine learning model or deep learning model performs too well in the training samples, resulting in poor performance in the verification data set and test data set.

1.7 Introduction to Azure Platform

Azure Machine Learning ("AML") is a web-based machine learning service launched by Microsoft on its public cloud Azure. Machine learning is a branch of artificial intelligence. Identify. This approach enables historical data to be used to predict future events and behavior in a way that is significantly superior to traditional forms of exit intelligence.


Microsoft's goal: to simplify the process of using machine learning for widespread and easy application by developers, business analysts, and data scientists.


The service aims to "combine the power of machine learning with the simplicity of cloud computing."


AML is currently providing services on Microsoft's Global Azure cloud service platform, and users can apply for a free trial through the site: https://studio.azureml.net/.

URL of UCI Machine Learning Database: http://archive.ics.uci.edu/ml/
 

1.8 Introduction to Deep Learning 【Understand】

1.8.1 Deep Learning -- Introduction to Neural Networks

Deep learning demo link: http://playground.tensorflow.org

1.8.2 What each layer of deep learning is responsible for

Increase the number of layers: identify objects, organ layers, molecular layers, and atomic layers through more abstract concepts.

Increase the number of nodes: increase the types of substances in the same layer.

2. Machine learning basic environment installation and use

2.1 Library installation 

Libraries such as Matplotlib, Numpy, and Pandas will be used in the basic stage of machine learning
Note:
        During the installation process of each package, try to specify a stable version for installation.

2.2 Use of Jupyter Notebook

2.2.1 Introduction to Jupyter Notebook

The Jupyter project aims to develop open source software, open standards, and services for interactive computing across dozens of programming languages. 

2.2.2 Why use Jupyter Notebook

Summary: Jupyter Notebook has more advantages than Pycharm in drawing and data display.

2.2.3 Simple operation introduction

Enter jupyter notebook in cmd to open it.


2.2.4 markdown function

esc + m 

3. Matplotlib in detail

3.1 HelloWorld of Matplotlib

3.1.1 What is Matplotlib

  • is dedicated to developing 2D charts (including 3D charts)
  • It's super easy to use
  • Incremental, interactive approach to data visualization

3.1.2 Why learn Matplotlib

        Visualization is a key auxiliary tool in the whole data mining, which can clearly understand the data, so as to adjust our analysis method.

  • Data can be visualized and presented more intuitively
  • Make data more objective and convincing

For example, the following two pictures are digital display and graphic display:

3.1.3 Realize a simple Matplotlib drawing

import matplotlib.pyplot as plt


# 1.创建画布
plt.figure(figsize=(20,8),dpi=100)

#2.绘制图像
x=[1,2,3]
y=[4,5,6]
plt.plot(x,y)

#3.显示图像
plt.show()

operation result:

3.1.4 Know Matplotlib image structure (expand, understand)

3.1.5 Matplotlib three-layer structure (expand, understand)

(1) Container layer

The container layer is mainly composed of Canvas, Figure, and Axes.


Canvas is the lowest system layer, which acts as a drawing board during the drawing process, that is, a tool for placing the canvas (Figure).
Figure is the first layer above the Canvas, and also the first layer of the application layer that needs to be operated by the user. It acts as a canvas during the drawing process.
Axes is the second layer of the application layer, which is equivalent to the role of the drawing area on the canvas during the drawing process.

  • Figure: Refers to the entire graphic (you can set the size and resolution of the canvas through plt.figure(), etc.)
  • Axes (coordinate system): the drawing area of ​​​​the data
  • Axis (coordinate axis): An axis in the coordinate system, including size limits, scales, and scale labels

Features are:

  • A figure (image) can contain multiple axes (coordinate system/drawing area), but an axes can only belong to one figure.
  • An axes (coordinate system/drawing area) can contain multiple axes (coordinate axes), including two is the 2d coordinate system, and three is the 3d coordinate system.

(2) Auxiliary display layer

The auxiliary display layer is the content in Axes (drawing area) except for the image drawn according to the data, mainly including Axes appearance (facecolor), border line (spines), coordinate axis (axis), coordinate axis name (axis label), Axis scale (tick), axis scale label (tick label), grid line (grid), legend (legend), title (title), etc.


The setting of this layer can make the image display more intuitive and easier for users to understand, but it will not have a substantial impact on the image.

(3) Image layer 

The image layer refers to the image drawn according to the data through plot, scatter, bar, histogram, pie and other functions in Axes.

Summarize:

  • Canvas (drawing board) is located at the bottom, generally out of reach of users
  • Figure (canvas) is built on top of Canvas
  • Axes (drawing area) is built on top of Figure
  • Auxiliary display layers such as axes and legends and image layers are built on Axes

3.2 Line chart (plot) and basic drawing functions

3.2.1 Line chart drawing and picture saving

In order to better understand all basic plotting functions, we integrate all basic API usage by plotting weather temperature changes.

(1) matplotlib.pyplot module

matplotlib.pytplot contains a series of drawing functions similar to matlab. Its functions act on the current coordinate system of the current figure.
import matplotlib.pyplot as plt

(2) Line chart drawing and display

Show the weather in Shanghai for a week, for example, the weather and temperature from Monday to Sunday are as follows:

import matplotlib.pyplot as plt
#1.创建画布
plt.figure(figsize=(10,10))
#2.绘制折线图(图像层)
plt.plot([1,2,3,4,5,6,7],[17,17,18,15,11,11,13])
#3.显示图像
plt.show()

operation result:

(3) Set canvas properties and save pictures

plt.figure(figsize=(),dpi=)

        figsize: Specifies the length of the figure

        dpi: the sharpness of the image

plt.savefig(path)

Note: plt.show() will release the figure resource, if you save the picture after displaying the picture, you will only be able to save an empty picture. Therefore, the image save must be placed before the show.

3.2.2 Improve the original line chart 1 (auxiliary display layer)

Case: display temperature change status.
Requirement: Draw a line chart of temperature changes per minute in a city from 11:00 to 12:00 in one hour, with the temperature range of 15 degrees to 18 degrees.

#画出温度变化图
import random
import matplotlib.pyplot as plt

#0.准备x、y坐标的数据
x=range(60)
y_shanghai=[random.uniform(10,15) for i in x]

#1.创建画布
plt.figure(figsize=(20,8),dpi=200)

#2.绘制折线图
plt.plot(x,y_shanghai)
#2.1 添加x、y轴刻度
y_ticks=range(40)
plt.yticks(y_ticks[::5])
x_ticks_label=['11点{}分'.format(i) for i in x]
plt.xticks(x[::5],x_ticks_label[::5])
#plt.xticks(x_ticks_label[::5])#报错   ==>  必须最开始传递进去的是数字


#3.显示图像
plt.show()

operation result:

Chinese display problem solution: 

 SimHei font download path: https://us-logger1.oss-cn-beijing.aliyuncs.com/SimHei.ttf

3.2.3 Add grid display

In order to observe the corresponding value of the graph more clearly: plt.grid(True, linestyle='--', alpha= 0.5)

        Parameters:
                linestyle -- the way to draw the grid

                alpha -- transparency

3.2.4 Add description information

Add x-axis, y-axis description information and title.

plt.xlabel('time',fontsize=20)
plt.ylabel('temperature',fontsize=20)
plt.title('xxxxx',fontsize=20)

3.2.5 Multiple plots

Requirement: Add a temperature change for a city.
        The temperature changes in Beijing on the day were collected, and the temperature ranged from 1 to 3 degrees. How to add another different graphic in the same coordinate system is actually very simple, just plot again, but you need to distinguish the lines, as follows:

y_beijing=[random.uniform(1,3) for i in x]
plt.plot(x,y_beijing,color='b',linestyle='-.',label='北京')
#显示图例
plt.legend(loc='best')

3.2.6 Set graphic style

Show legend: plt.legend(loc='best')

Note: If only setting the label in plt.plot() cannot finally display the legend, you also need to display the legend through plt.legend().

3.2.7 Multiple coordinate system display - plt.subplots (object-oriented drawing method)

matplotlib.pyplot.subplots(nrows=1, ncols=1, **fig_kw) creates a figure with multiple axes (coordinate systems/plot areas).

Note: plt.functionname() is equivalent to the process-oriented drawing method, and axes.set_methodname() is equivalent to the object-oriented drawing method.

#画出温度变化图
import random
import matplotlib.pyplot as plt

#0.准备x、y坐标的数据
x=range(60)
y_shanghai=[random.uniform(15,18) for i in x]
y_beijing=[random.uniform(1,14) for i in x]

#1.创建画布
fig,axes=plt.subplots(nrows=1,ncols=2,figsize=(20,8),dpi=100)

#2.绘制折线图
axes[0].plot(x,y_shanghai,color='r',linestyle='--',label='上海')
axes[1].plot(x,y_beijing,color='g',linestyle='-.',label='北京')

#2.1 添加x、y轴刻度
x_ticks_label=['11点{}分'.format(i) for i in x]
y_ticks=range(40)
axes[0].set_xticks(x[::5])
axes[0].set_yticks(y_ticks[::5])
axes[0].set_xticklabels(x_ticks_label[::5])
axes[1].set_xticks(x[::5])
axes[1].set_yticks(y_ticks[::5])
axes[1].set_xticklabels(x_ticks_label[::5])

#2.2 添加网格
axes[0].grid(True,linestyle='--',alpha=1)
axes[1].grid(True,linestyle='--',alpha=1)


#2.3 添加描述
axes[0].set_xlabel('时间',fontsize=25)
axes[0].set_ylabel('温度',fontsize=25)
axes[0].set_title('上海',fontsize=25)
axes[1].set_xlabel('时间',fontsize=25)
axes[1].set_ylabel('温度',fontsize=25)
axes[1].set_title('北京',fontsize=25)


#2.4显示图例
axes[0].legend(loc='best')
axes[1].legend(loc='best')

#3.显示图像
plt.show()

running result:

3.2.8 Line chart application scenarios

  • Display the number of daily active users of the company's products (different regions)
  • Shows the number of app downloads per day
  • Shows the changes in the number of user clicks over time after the new product features are launched
  • Expansion: Draw various mathematical function images

Note: In addition to drawing line charts, plt.plot() can also be used to draw various mathematical function images.

import random
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
mpl.rcParams['axes.unicode_minus']=False #解决负号显示问题

#0.准备数据
x=np.linspace(-10,10,1000)#[-10,10]1000个数据
y=np.sin(x)

#1.创建画布
plt.figure(figsize=(20,8),dpi=100)

#2.绘制函数图像
plt.plot(x,y)
#2.1显示网格
plt.grid()

#3.显示图像
plt.show()

operation result:

3.3 Common graphic drawing

https://matplotlib.org/index.html

3.3.1 Types and meanings of common graphics

Matplotlib can draw line charts, scatter plots, histograms, histograms, and pie charts.
We need to know the meaning of different statistical graphs to decide which statistical graph to choose to present our data.

(1) Line chart 

Line chart: A statistical chart that expresses the increase or decrease of the statistical quantity with the rise or fall of the broken line.
Features: It can display the changing trend of data and reflect the changing situation of things. (change)
api: plt.plot(x, y)

(2) Scatter plot

Scatter plot: Use two sets of data to form multiple coordinate points, examine the distribution of coordinate points, judge whether there is a certain relationship between two variables or summarize the distribution pattern of coordinate points.
Features: Judging whether there is a quantitative correlation trend between variables, and displaying outliers (distribution rules).

api:plt.scatter(x, y)

(3) Histogram

Column chart: Data arranged in columns or rows of a worksheet can be plotted in a column chart.
Features: Draw even discrete data, you can see the size of each data at a glance, and compare the differences between data. (Statistics/Comparison)
api: plt.bar(x, width, align='center' , **kwargs)

(4) Histogram

Histogram: A situation in which the distribution of data is represented by a series of vertical stripes or line segments of unequal height. Generally, the horizontal axis represents the data range, and the vertical axis represents the distribution.
Features: Draw continuous data to show the distribution of one or more sets of data (statistics)
api: matplotlib.pyplot.hist(x, bins=None)

(5) pie chart

Pie chart: It is used to represent the proportion of different categories, and compare various categories through the size of the arc.

Features: Proportion of classified data (proportion)
api: plt.pie(x, labels=,autopct=,colors)

3.3.2 Scatter plot drawing

Demand: Explore the relationship between housing area and housing price.

import matplotlib.pyplot as plt

#房屋面积数据
x=[225.98,247.07,253.14,457.85,241.58,301.01,20.67,288.64,163.56,120.06,207.83,342.75,147.9,53.06,224.72,29.51,21.61,483.21,245.25,399.25,343.35]
#房屋价格数据
y=[196.63,203.88,210.75,372.74,202.41,247.61,24.9,239.34,140.32,104.15,176.84,288.23,128.79,49.64,191.74,33.1,30.74,400.02,205.35,330.64,283.45]

plt.figure(figsize=(20,8),dpi=100)
plt.scatter(x,y)
plt.show()

result:

3.3.3 Histogram drawing

movie_name=['雷神3∶诸神黄昏','正义联盟','东方快车谋杀案','寻梦环游记','全球风暴','降魔传','追捕','七十七天','密战','狂兽','其它']
x=range(len(movie_name))
y=[73853,57767,22354,15969,14839,8725,8716,8318,7916,6764,52222]
plt.figure(figsize=(20,8),dpi=100)
plt.bar(x,y,width=0.5,color=['b','r','g','y','c','m','y','k','c','g','b'])
plt.xticks(x,movie_name,fontsize=15)
plt.grid()
plt.title('某月电影票房统计',fontsize=20)
plt.show()

result:

4.numpy

4.1 Advantages of numpy

  

4.1.1 Introduction to numpy

  • Numpy (Numerical Python) is an open source Python scientific computing library for fast processing of arrays of any dimension
  • Numpy supports common array and matrix operations. For the same numerical computing task, using Numpy is much more concise than using Python directly.
  • Numpy handles multidimensional arrays using the ndarray object, which is a fast and flexible container for large data.

4.1.2 ndarray introduction 

NumPy provides an N-dimensional array type ndarray that describes collections of "items" of the same type .

4.1.3 Comparison of operation efficiency between ndarray and Python native list

One-dimensional arrays can be stored using Python lists, and multi-dimensional arrays can be realized by nesting lists, so why do you need to use Numpy's ndarray? 

Here we experience the benefits of ndarray by running a piece of code:

import random,time
import numpy as np
a=[]
for i in range(10000):
    a.append(random.random())
%time sum1=sum(a)

b=np.array(a)
%time sum2=np.sum(b)

        From this, we can see that the calculation speed of ndarray is much faster, saving time.
        The biggest feature of machine learning is a large amount of data calculation, so if there is no quick solution, then python may not be able to achieve good results in the field of machine learning.

4.1.4 Advantages of narray

(1) Memory block style

What is the difference between ndarray and the native python list? Please see a picture:

        From the figure, we can see that when ndarray stores data, the addresses of data and data are continuous, which makes it faster to operate array elements in batches.
        This is because the types of all elements in ndarray are the same, while the element types in Python lists are arbitrary, so ndarray can store elements continuously in memory, while native python lists can only find the next one by addressing Numpy's ndarray is inferior to Python's native list in terms of general performance, but in scientific computing, Numpy's ndarray can save a lot of loop statements, and the code usage is much simpler than Python's native list. 

(2) ndarray supports parallelized operations (vectorized operations)

(3) Efficiency is much higher than pure Python code

The bottom layer of Numpy is written in C language, and the GIL (Global Interpreter Lock) is unlocked internally. Its operation speed on arrays is not limited by the Python interpreter, so its efficiency is much higher than that of pure Python code.

4.2 N-dimensional array-ndarray

4.2.1 Properties of ndarray

Array properties reflect information inherent in the array itself.

4.2.2 The shape of ndarray

4.2.3 Types of ndarrays

dtype is numpy.dtype type, first look at what types are available for arrays:

Note: If not specified, the integer defaults to int64, and the decimal defaults to float64.

4.3 Basic Operation

4.3.1 Method of generating array

(1) Generate an array of 0 and 1 

(2) Generated from an existing array

Generation method: array is a deep copy and does not affect each other; asarray is a shallow copy and points to the same space area.

(3) Generate a fixed-range array

  • Generate equally spaced sequences: np.linspace (start, stop, num, endpoint)

  • Others are:

        numpy.arange(start,stop, step, dtype)

        numpy.logspace(start,stop, num)

(4) Generate a random array 

Using the module: np.random

Evenly distributed:

  • np.random.rand(d0, d1, ... , dn) returns a set of uniformly distributed numbers in [0.0, 1.0).

  • np.random.uniform(low=0.0, high=1.0, size=None)
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
mpl.rcParams['axes.unicode_minus']=False #解决负号显示问题

x1=np.random.uniform(-1,1,10000000)
plt.figure(figsize=(10,3),dpi=100)
plt.hist(x=x1,bins=1000)
plt.show()

operation result:

  • np.random.randint(low, high=None, size=None, dtype='I')
    Randomly sample from a uniform distribution to generate an integer or N-dimensional integer array, the range of numbers: if high is not None, Take a random integer between [low, high), otherwise take a random integer between [0, low).

Normal distribution:

  •  np.random.randn(d0, d1, ...,dn)

        Function: Return one or more sample values ​​from the standard normal distribution.

  • np.random.normal(loc=0.0, scale=1.0, size=None)
x2=np.random.normal(1.75,1,100000000)
plt.figure(figsize=(20,8),dpi=100)
plt.hist(x2,1000)
plt.show()

  • np.random.standard_normal(size=None)
    returns an array of the standard normal distribution of the specified shape.

Case: Randomly generate 2-week trading day gain data for 8 stocks. 

stock_change = np.random.normal(0, 1,(8,10))

4.3.2 Indexing and slicing of arrays

How to index one-dimensional, two-dimensional and three-dimensional arrays?

4.3.3 Shape modification

  • ndarray.reshape(shape[, order])
    Returns a new result, leaving the original result unchanged.

  • ndarray.resize(new_shape[, refcheck])
    modifies the original array.
  • Transpose of ndarray.T
    array Exchange the rows and columns of the array.

4.3.4 Type modification

  • ndarray.astype(type)
  • ndarray.tostring([order])或者ndarray.tobytes([order])

4.3.5 Deduplication of arrays

  • ndarray.unique

4.4 ndarray operations

4.4.1 Logic operations

4.4.2 General judgment function

  • np.arr()
  • np.any()

4.4.3 np.where (ternary operator)

More complex calculations can be performed by using np.where.

  • np.where()
  • Composite logic needs to be used in combination with np.logical_and and np.logical_or

4.4.4 Statistics calculation

In the field of data mining/machine learning, the value of statistical indicators is also a way for us to analyze problems.

Commonly used indicators are as follows:

  • min(a[, axis, out, keepdims])
  • max(a[, axis, out, keepdims])
  • median(a[, axis, out, overwrite_input,keepdims]) median number
  • mean(a, axis, dtype, out, keepdims]) mean
  • std (a[, axis, dtype, out, ddof, keepdims]) standard deviation
  • var(a[, axis, dtype, out, ddof, keepdims]) variance
  • np.argmax(a,axis=) maximum subscript
  • np.argmin(a,axis=) minimum subscript

When performing statistics, the value of the axis axis is not certain, and the values ​​of different API axes in Numpy are different. Here, axis=0 represents a column, and axis=1 represents a row for statistics.

4.5 Mathematics: Matrices

4.5.1 Matrices and vectors

(1) Matrix

(2) vector

4.5.2 Addition and scalar multiplication

4.5.3 Matrix-vector multiplication

4.5.4 Matrix multiplication

4.5.5 Properties of matrix multiplication

4.5.6 Reverse and Transpose

4.6 Operations between arrays

4.6.1 Operations on Arrays and Numbers

4.6.2 Array-to-Array Operations: Broadcast Mechanism

        The premise of executing broadcast is that the two ndarrays perform element-wise operations. The function of the broadcast mechanism is to facilitate mathematical operations on ndarrays of different shapes (the core data structure of the numpy library).
        When operating two arrays, numpy will compare their shapes (tuples) one by one. Only in the following cases can two arrays be able to perform array-to-array operations.

  • Dimensions are equal
  • shape (one of the corresponding places is 1)

 

4.6.3 Matrix Multiplication API

  • np.matmul matrix multiplication
  • np.dot dot product
  • Note: There is no difference between the two when performing matrix multiplication; however, dot supports matrix and digital multiplication.

5.Pandas

5.1 Introduction to Pandas

5.1.1 Introduction to Pandas

  • Library developed by WesMcKinney in 2008
  • An open source python library dedicated to data mining
  • Based on Numpy, take advantage of the high performance of the Numpy module in computing
  • Based on matplotlib, it is easy to draw pictures
  • Unique Data Structure

5.1.2 Why use Pandas

Numpy has been able to help us process data, and can combine with matplotlib to solve some problems such as data display, so what is the purpose of pandas learning?

  • Convenient data processing ability
  • Easy to read files
  • Encapsulates the drawing and calculation of Matplotlib and Numpy

5.1.3 Case

5.1.4 DataFrame

(1) DataFrame structure

DataFrame objects have both row and column indices

  • Row index, indicating different rows, horizontal index, called index, 0 axis, axis=0
  • Column index, column with different table names, vertical index, called columns, 1 axis, axis=1

(2) DataFrame attribute

  • object.shape
  • object.index      List of row indices of the DataFrame
  • object.columns    List of column indices of the DataFrame
  • Object.values       ​​directly obtains the value of the array
  • object.T               transpose
  • Object.head(5)    displays the contents of the first 5 lines
    . If no parameters are added, the default is 5 lines. Fill in the parameter N to display the first N lines
  • Object.tail(5)       displays the content of the last 5 lines
    . If no parameters are added, the default is 5 lines. Fill in the parameter N to display the next N lines

(3) DataFrame index settings

1. Modify the row and column index values:

Note: The following modifications are wrong
        

2. Reset the index:

  • reset_index(drop=False)
    - set new subscript index.
    -drop: The default is False, do not delete the original index, if it is True, delete the original index value.

  • set_index(keys, drop=True)
    -
    keys: column index name or list of column index names.
    -drop: boolean, default True. As a new index, delete the original column.

3. Set up a new index case:

  • create
  • set new index by month
  • Set multiple indexes to year and month ==> In fact, this becomes a three-dimensional array

    Note: Through the settings just now, the DataFrame becomes a DataFrame with Multilndex.

5.2 Basic Data Operation

In order to better understand these basic operations, we will read a real stock data. Regarding file operations, we will introduce them later, here we only use the API first.

5.2.1 Index operations

In Numpy, we have already talked about using index selection sequence and slice selection. Pandas also supports similar operations. You can also use column names, row names, or even a combination.

(1) Use the row and column index directly (column first, then row) 

(2) Use indexes in combination with loc or iloc

(3) Use ix to combine indexes

5.2.2 Assignment operation

5.2.3 Sorting

There are two forms of sorting, one for index sorting and one for content sorting

  • Use df.sort_values(by=, ascending=)
    -single key or multiple keys to sort, default ascending-ascending
    =False: descending
    -ascending=True: ascending
    -note: by this parameter can accept multiple values, priority is given to the first An index sorting, if the same, according to the following
  • Use df.sort_index to sort the index.
    The date index of this stock was originally from large to small, and now it is reordered from small to large:
  • When using series.sort_values(ascending=True) to sort
    series, there is only one column and no parameters are required:
  • Sorting with series.sort_index()            is the same as df

5.3 DataFrame operation

 

5.3.1 Arithmetic operations

Directly use the method add, sub... You can also use the symbol +-...

  • add(other)
  • sub(other)

5.3.2 Logic operations

(1) Logical operators <, >, |, &

  • For example, filter date data with p_change > 2

  • Complete one or more logical judgments, filter p_change > 2 and open > 15

(2) Logical operation function

  • query(expr)
    - expr: query string
  • isin(values)

5.3.3 Statistical calculation

(1)describe()

(2) Statistics function

Numpy has been introduced in detail, here we demonstrate min (minimum value), max (maximum value), mean (average value), median (median), var (variance), std (standard deviation), mode (mode )result:

When performing statistics on a single function, the coordinate axis is still "columns" (axis=0, default) by default. If you want to "index" the row, you need to specify it (axis=1).

  • max()、min()
  • std()、var()
     
  • median()

  • idxmax(), idxmin()   get the subscript of the maximum/minimum value

5.3.4 Cumulative statistics function

The above functions can operate on series and dataframe.

5.3.5 Custom functions

  • apply(func, axis=0)
    func: custom function
    ​​​​​​​​​​​​​​axis=0: the default is column, axis=1 is row operation

Example: ​​​​​​​​Define a pair of columns, the function of the maximum value - the minimum value

5.4 Pandas drawing

pandas.DataFrame.plot

5.5 File reading and storage

Most of our data exists in files, so pandas supports complex IO operations. The pandas API supports many file formats, such as CSV, SQL, XLS, JSON, and HDF5.
Note: HDF5 and CSV files are most commonly used.

5.5.1 CSV

(1)read_csv

pandas.read_csv(filepath_or_buffer,sep =',' )

  • filepath_or_buffer: file path
  • usecols: specify the column name to read, in list form

(2)to_csv

5.5.2 HDF5

(1)read_hdf和to_hdf

5.5.3 JSON

JSON is a data exchange format we commonly use. It is often used in front-end and back-end interactions, and this format will also be selected when storing. So we need to know how Pandas reads and stores JSON format.

(1)read_json

 

(2)to_json

5.5.4 Expansion

Preference is given to using HDF5 file storage:

  • HDF5 supports compression when storing. The method used is blosc, which is the fastest and is supported by pandas by default.
  • Using compression can improve disk utilization and save space
  • HDF5 is also cross-platform and can be easily migrated to Hadoop

5.6 Advanced processing - missing value processing

5.6.1 How to deal with nan

 

 

5.6.2 Not missing value nan, with default mark

The data is like this:

Analysis of processing ideas:

  • 1. First replace '?' with np.nan
    df.replace(to_replace=, value=)
            -to_replace: value before replacement
            -value: value after replacement
  • 2. In the processing of missing values

5.7 Advanced Processing - Data Discretization

 

 

5.7.1 Why Discretization

The purpose of discretization of continuous attributes is to simplify the data structure, and data discretization techniques can be used to reduce the number of given continuous attribute values. Discretization methods are often used as tools for data mining.

5.7.2 What is data discretization

The discretization of continuous attributes is to divide the value range of continuous attributes into several discrete intervals, and finally use different symbols or integer values ​​to represent the attribute values ​​falling in each sub-interval.
There are many methods of discretization, and this uses one of the simplest ways to operate:

  • The height data of primitive people: 165, 174, 160, 180, 159, 163, 192, 184
  • Suppose it is divided into several intervals according to height: 150~165, 165~180, 180~195

In this way, we divide the data into three intervals, which I can mark as short, medium, and high, and finally process it into a "dummy variable" matrix:

5.7.3 Data packet operation

Tools used:

  • pd.qcut(data, bins):
    ​​​​​​​​Grouping data Grouping data is generally used in conjunction with value_counts to count the number of each group
  • series.value_counts(): Statistics grouping
    times
     

Custom interval grouping:

  • pd.cut(data, bins)
     ​​​​​​​

5.7.4 Packet data into one-hot encoding

Generate a Boolean column for each category, and only one of these columns can take a value of 1 for this sample. It is also called one-hot encoding.

Alias: dummy variable, hot-unique encoding
Convert the table on the left in the figure below to the form on the right:

pandas.get_dummies(data, prefix=None)

  • data:array-like, Series, or DataFrame
  • prefix: group name

5.8  Advanced Processing - Data Merging

If your data consists of multiple tables, sometimes it is necessary to merge different contents together for analysis. 

 

5.8.1 pd.concat realizes data merging

pd.concat([data1, data2], axis=1)
        -Merge by row or column, axis=0 is the column index, axis=1 is the row index
        For example, we merge the one-hot encoding just processed with the original data :
        

         

5.8.2 pd.merge realizes data merging

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None)

  • It can be specified to merge according to the common key-value pairs of two sets of data or left and right separately
  • left:A DataFrame object
  • right:Another DataFrame object
  • on:Columns (names) to join on. Must be found in both the left and right DataFrame objects.
  • left_on=None, right_on=None: specify the left and right keys

 

5.9 Advanced Processing - Crosstab and PivotTable

 

5.9.1 What are the functions of crosstab and pivot table

5.9.2 Use crosstab (crosstab) to realize the above figure

Crosstab: Crosstab is used to calculate the number of groups of one column of data for another column of data (finding the relationship between two columns)

  • pd.crosstab(value1, value2)
  • DataFrame.pivot_table([], index=[])

5.9.3 Case - exploring the relationship between stocks and days of the week

 

But we see that count is just the number of good and bad days in each week, and we don’t get the ratio. How to do it?

  • Sum the total number of days for each Monday, etc., and use the division operation to find the ratio.

Visualization: 

 

5.9.4 Implementation using pivot_table (pivot table)

5.10 Advanced Processing - Grouping and Aggregation

Grouping and aggregation is usually a way of analyzing data, and it is usually used together with some statistical functions to view the grouping of data.
Think about it, in fact, the crosstab and pivot table just now also have the function of grouping, so it is a form of grouping, but they are mainly calculating the number of times or calculating the ratio!

5.10.1 What grouping and aggregation are

Grouping without aggregation has no meaning, so grouping and aggregation are generally inseparable. 

5.10.2 Group API

DataFrame.groupby(key, as_index=False)

  • key: divided column data, can be multiple

Case: Price data of different pens of different colors
        - grouping, color grouping, price aggregation

5.10.3 Starbucks Retail Store Data

​​​​​​​

​​​​​​​

Grouping according to multiple groups: Suppose we join provinces and cities to group together.

Guess you like

Origin blog.csdn.net/m0_58086930/article/details/126097452