Extended learning of data analysis library

01 Preface

In addition to the "three artifacts" of python data analysis mentioned above, you also need to further model, predict, evaluate, and visualize data. If you want to do it once and for all, you need to persist in learning some more useful ones. The data analysis library will be introduced one by one next.

02 Use of derivative libraries

Seaborn

Seaborn is a Python visualization library based on matplotlib, which provides an advanced interface for making various statistical graphics. Seaborn can be used to visualize the distribution of statistical data sets, matrix data, regression models, etc.

Common functions:

sns.distplot(): Draw a univariate distribution plot

import seaborn as sns
import numpy as np

# 生成一组随机数据
x = np.random.normal(loc=0, scale=1, size=1000)

# 绘制单变量分布图
sns.distplot(x, kde=True, rug=True, bins=20)

Explanation: In this example, we use numpy to generate a set of normally distributed data with a mean of 0 and a standard deviation of 1, and use the distplot()function to draw their univariate distribution graph.

Among them, the kde parameter specifies whether to draw a kernel density estimation map, the rug parameter specifies whether to draw a small scale line, and the bins parameter specifies the number of histograms.

sns.jointplot(): plot bivariate distribution

import seaborn as sns
import numpy as np

# 生成两组相关随机数据
x = np.random.normal(loc=0, scale=1, size=1000)
y = 0.5 * x + np.random.normal(loc=0, scale=0.5, size=1000)

# 绘制双变量分布图
sns.jointplot(x=x, y=y, kind='scatter')

Explanation: In this example, we use numpy to generate two sets of correlated random data, and use the jointplot() function to plot their bivariate distribution.

Among them, the kind parameter specifies the type of graph to be drawn, and the scatter plot is selected here.

sns.pairplot(): plot a multivariate distribution

import seaborn as sns
import pandas as pd

# 加载鸢尾花数据集
iris = sns.load_dataset('iris')

# 绘制多变量分布图
sns.pairplot(data=iris, hue='species')

Explanation: In this example, we use seaborn's built-in iris data set and use the pairplot() function to draw their multivariate distribution graphs.

Among them, the hue parameter specifies the variable used to distinguish different categories.

sns.boxplot(): Draw a boxplot

import seaborn as sns
import pandas as pd

# 加载鸢尾花数据集
iris = sns.load_dataset('iris')

# 绘制箱线图
sns.boxplot(data=iris, x='species', y='petal_length')

sns.heatmap(): Draw a heat map

A heatmap is a two-dimensional graph that is often used to represent the relative size of each value in a matrix. In data analysis and visualization, heatmaps are often used to exploreCorrelations between variables or for visualizing matrix data

Here is an example of drawing a heatmap using the Seaborn library:

import seaborn as sns
import numpy as np

# 创建一个3x3的矩阵
data = np.random.randn(3, 3)

# 绘制热力图
sns.heatmap(data, annot=True, cmap='coolwarm')

Explanation: In this example, we use the Numpy library to generate a 3x3 matrix, and use the heatmap() function in the Seaborn library to draw a heat map of the matrix.

Among them, the annot=True parameter indicates to display the value of each cell on the heat map, and the cmap='coolwarm' parameter specifies the color mapping.

In addition to generating heatmaps using random data, we can also draw heatmaps using real datasets. For example, use the flights dataset that comes with the Seaborn library to draw a heatmap of the number of passengers between months and years:


import seaborn as sns

# 加载数据集
flights = sns.load_dataset('flights')

# 将数据集重塑为矩阵形式
data = flights.pivot('month', 'year', 'passengers')

# 绘制热力图
sns.heatmap(data, cmap='YlGnBu')

Explanation: In this example, we use the flights dataset that comes with the Seaborn library and use the pivot() function to reshape it into a matrix form, where the row represents the month, the column represents the year, and the value of the cell represents the month and year number of passengers. Then, we draw a heatmap of this matrix using the heatmap() function and specify the colormap using the cmap='YlGnBu' parameter.

Scikit-learn

Scikit-learn is a Python library for machine learning that includes various classification, regression, and clustering algorithms. It provides a simple and consistent interface that makes training and evaluating models very easy.

Common functions:

sklearn.model_selection.train_test_split(): split the data set into training set and test set

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target

# 将数据集拆分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Explanation: In this example, we use sklearn's built-in iris data set and use the train_test_split() function to split it into a training set and a test set.

Among them, the test_size parameter specifies the proportion of the test set, and the random_state parameter specifies the seed of the random number generator.

sklearn.linear_model.LinearRegression(): Linear regression model

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

# 加载数据集
boston = load_boston()
X, y = boston.data, boston.target

# 训练线性回归模型
model = LinearRegression()
model.fit(X, y)

# 输出模型参数
print('Intercept:', model.intercept_)
print('Coefficients:', model.coef_)

Explanation: In this example, we use sklearn's built-in Boston house price data set and use the LinearRegression() function to train a linear regression model. Then, we output the intercept and coefficients of the model.
sklearn.tree.DecisionTreeClassifier(): Decision Tree Classification Model

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target

# 训练决策树分类器
model = DecisionTreeClassifier()
model.fit(X, y)

# 输出模型在训练集上的准确率
print('Accuracy:', model.score(X, y))

Explanation: In this example, we use sklearn's built-in iris data set and use DecisionTreeClassifier()the function to train a decision tree classifier. Then, we output the accuracy of the model on the training set.

sklearn.cluster.KMeans(): K-Means clustering model

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# 生成一组随机数据
X, y = make_blobs(n_samples=1000, centers=4, random_state=42)

# 使用K均值聚类算法进行聚类
model = KMeans(n_clusters=4)
model.fit(X)

# 输出聚类结果
print('Cluster labels:', model.labels_)

Explanation: In this example, we use sklearn's built-in make_blobs() function to generate a set of random data, and use the KMeans() function to perform K-means clustering. Then, we output the clustering results.

sklearn.metrics.accuracy_score(): Calculate the classification accuracy

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 加载数据集
iris = load_iris()
X, y = iris.data, iris.target

# 训练决策树分类器
model = DecisionTreeClassifier()
model.fit(X, y)

# 预测
y_pred = model.predict(X)

# 输出准确率
print('Accuracy:', accuracy_score(y, y_pred))

Explanation: In this example, we use sklearn's built-in iris data set and use DecisionTreeClassifier()the function to train a decision tree classifier. After making predictions, we output the accuracy of the model on the training set.

State models

Statsmodels is a Python library for statistical modeling that provides a variety of statistical models and data exploration tools. Statsmodels can be used for regression analysis, time series analysis, non-parametric estimation, etc.

Common functions:

sm.OLS(): least squares linear regression
sm.Logit(): logistic regression model
sm.GLM(): generalized linear model
sm.tsa.ARIMA(): autoregressive integrated moving average model
sm.stats.anova_lm() : ANOVA
sm.graphics.plot_regress_exog(): Draw linear regression fitting graph
sm.qqplot(): Draw QQ graph
sm.tsa.seasonal_decompose(): Seasonal decomposition
sm.stats.ttest_ind(): Independent sample t test
sm .stats.ttest_rel(): paired sample t test
sm.stats.proportions_ztest(): binomial distribution proportion test
sm.nonparametric.Kendalltau(): Kendall Tau correlation coefficient
sm.nonparametric.Kruskal(): Kruskal-Wallis rank sum test
sm.regression.mixed_linear_model.MixedLM(): Mixed linear model
sm.stats.DescrStatsW(): Descriptive statistical analysis

Because linear regression analysis is not commonly used here, I will not give too many examples here.

Guess you like

Origin blog.csdn.net/qq_54015136/article/details/129527670
Recommended