Data analysis Python comprehensive process

Table of contents

1. Data reading

1.1. Call

1.2. Common file reading

2. Data preprocessing

2.1. Missing value processing

2.2. Duplicate value processing

 2.3. Feature encoding

 2.4. Dimensionality reduction (optional)

2.5. Feature Scaling

2.6. Other key functions

3. Building a model

3.1. Linear models

3.1.1. Least squares method

3.1.2. Ridge Regression

 3.1.3. Lasso

3.1.4. BayesianRidge Bayesian Regression

3.1.5. logistic logistic regression

3.2. Support Vector Machine (SVM)

3.3. Stochastic Gradient Descent SGD 


1. Data reading

1.1. Call

#调用pandas库
import pandas as pd

1.2. Common file reading

CSV file: A tabular file with different feature values ​​separated by commas.

df_origin=pd.read_csv(filepath_or_buffer)

Common Files: Supports common format files.

df_origin=pd.read_table(filepath_or_buffer)

Pickle file

df_origin=pd.read_pickle(filepath_or_buffer)

Other files: such as Excel, Json, Html, etc.

Reference:pandas.read_pickle — pandas 1.4.3 documentation

2. Data preprocessing

2.1. Missing value processing

#查看缺失值
df_origin.isna()

#删除缺失值
df_nona=df_origin.dropna()

#填补缺失值
#缺失值变为数值
df_nona=df_origin.fillna(0)

#缺失值由字典填补,a、b、c为特征即a特征的缺失值都变为0
df_nona=df_origin.fillna('a':0,'b':1,'c':2)

#缺失值变为同一特征上一对象的记录
df_nona=df_origin.fillna(method='ffill')

Reference: pandas.DataFrame.fillna — pandas 1.4.3 documentation

2.2. Duplicate value processing

#查看重复值: 表格自上而下第一个会显示False,出现和上面重复的值会显示True
df_nona.duplicated()

#去除重复值
df_nodup=df_nona.drop_duplicates()

#去除特定列有重复值的行
df_nodup=df.nona.drop_duplicates(subset=['Price'])

Attention should be paid to duplicate value processing: because the violent removal of duplicate values ​​will lead to changes in data distribution, you can choose to add weight to the duplicate value data after removing duplicate values. For example, if there are no records and 18 duplicate values ​​are removed, bring the original one, Weight=19.

Reference:

pandas.DataFrame.drop_duplicates — pandas 1.4.3 documentationhttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html

 2.3. Feature encoding

Since the machine can only recognize numeric values, it is necessary to change the nominal feature categorical feature (String or other types) into numeric type integer codes.

#OrdinalEncoder是将特征值bian为0到n_categories - 1的范围内
enc = preprocessing.OrdinalEncoder()
enc.fit(df_nudup.X)  #X是特征名
enc


Since OrdinalEncoder converts non-numerical data into ordered numerical, which will affect the establishment of the model, one-of-K can be used, also known as one-hot encoding or dummy encoding.

Turn n eigenvalues ​​into binary eigenvectors of length n, each value has only one bit is 1, and the rest are 0

enc = preprocessing.OneHotEncoder()
enc.fit(df_nodup.X)  
df_nodup

#可查看转换值
enc.transform([['female', 'from US', 'uses Safari'],
...                ['male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])

 Ref: https://www.sklearncn.cn/40/

 2.4. Dimensionality reduction (optional)

For 2-dimensional data, we create a new 2-dimensional space, so that the original data points fall on one of the new eigenvectors, then the eigenvalue of the eigenvector of the other new space is 0, and the eigenvector is meaningless, 2D The vector becomes a 1-dimensional vector. The same is true for N dimension.

Since dimensionality reduction will combine multiple original features, it is not suitable for models that explore the relationship between features and labels (such as linear regression)

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca=pca.fit(df_nodup)
df_dr=transform(df_nodup)#获得新的矩阵

print(pca.explained_variance_ratio_)
print(pca.singular_values_)

2.5. Feature Scaling

 Scale all features to [0, 1] or [-1, 1] so that the mean is 0 and the variance is 1.

#特征缩放
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train) # Don’t cheat - fit only on training data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) # apply same transformation to test data

2.6. Other key functions

#添加行,df2是相同特征的列表
df_nodup.append(df2)

#数组(series)的组合行是concat
s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2])

3. Building a model

3.1. Linear models

3.1.1. Least squares method

Fits a linear model that minimizes the sum of squared errors.

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit (X,Y)
reg.coef_
>>> array([ 0.5,  0.5])

3.1.2. Ridge Regression

A penalty is imposed on the magnitude of the coefficients to address the overfitting problem of ordinary least squares.

from sklearn import linear_model
reg = linear_model.Ridge (alpha = .5)
reg.fit (X,Y)

 3.1.3. Lasso

 The difference from Ridge is that an L1 paradigm is added after the regular Loss Function.

from sklearn import linear_model
reg = linear_model.Lasso(alpha = 0.1)
reg.fit(X,Y)
reg.predict(test_X)

3.1.4. BayesianRidge Bayesian Regression

Bayesian regression can be used for parameter regularization during the estimation phase.

from sklearn import linear_model
reg = linear_model.BayesianRidge()
reg.fit(X, Y)

3.1.5. logistic logistic regression

3.2. Support Vector Machine (SVM)

Suitable for high-dimensional space

#分类问题
from sklearn import svm
clf = svm.SVC(gamma='scale')
clf.fit(X, y)
#回归问题
from sklearn import svm
clf = svm.SVR()
clf.fit(X, y)

3.3. Stochastic Gradient Descent SGD 

Applicable to large-scale and sparse machine learning problems, it is strongly recommended to scale all features to [0, 1] or [-1, 1] so that the mean is 0 and the variance is 1.

#特征缩放
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train) # Don’t cheat - fit only on training data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) # apply same transformation to test data
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss="hinge", penalty="l2")
clf.fit(X, y)

3.4. Nearest Neighbor

from sklearn.neighbors import NearestNeighbors
import numpy as np
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)

3.5. Gaussian Naive Bayes

The likelihood (probability) of a feature is assumed to be a Gaussian distribution.

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)

3.6. Decision tree

#分类
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
clf.predict(X)
clf.predict_proba(X)#样本中Y各分类的概率
#回归
from sklearn import tree
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X, y)
clf.predict(test_X)

 3.7. Ensemble methods

3.7.1. Bagging

from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(),
...                             max_samples=0.5, max_features=0.5)

3.7.2. Random Forest

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X, Y)

3.7.3. Extreme Random Trees

The distinguishing threshold of each node is randomly generated, and the best threshold is selected as the segmentation rule.

Effect: It can reduce the variance of the model a little, at the cost of slightly increasing the bias

from sklearn.model_selection import cross_val_score
>>> from sklearn.datasets import make_blobs
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.tree import DecisionTreeClassifier

>>> X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
...     random_state=0)

>>> clf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
...     random_state=0)
>>> scores = cross_val_score(clf, X, y, cv=5)
>>> scores.mean()                               
0.98...

>>> clf = RandomForestClassifier(n_estimators=10, max_depth=None,
...     min_samples_split=2, random_state=0)
>>> scores = cross_val_score(clf, X, y, cv=5)
>>> scores.mean()                               
0.999...

>>> clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
...     min_samples_split=2, random_state=0)
>>> scores = cross_val_score(clf, X, y, cv=5)
>>> scores.mean() > 0.999
True

Guess you like

Origin blog.csdn.net/xzhu4571/article/details/125490416