[Machine learning notes] Analysis in Python: Red wine quality analysis (data exploration)

Analysis in Python: Red wine quality analysis (data exploration)

Data set: winemag-data_first150k.csv

First import the data

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols, glm
# 将数据集读入到pandas数据框中
wine = pd.read_csv('C:\\Machine-Learning-with-Python-master\\data\\winemag-data_first150k.csv', sep=',', header=0)
wine.columns = wine.columns.str.replace(' ', '_')
print(wine.head())

 

View data set row and column information

#查看数据集行列数
print("该数据集共有 {} 行 {} 列".format(wine.shape[0],wine.shape[1])) 
wine.columns

Explain the following meanings:

Column name meaning
country Country where wine comes from
description Describe the taste, smell, appearance, feel, etc. of the wine
designation Vineyards inside the winery, the wine grapes come from the vineyard
points Wine Enthusiast rates wines from 1-100 (although they say they only comment on wines with a score> = 80)
price The cost of a bottle of wine
province Origin of wine
region_1 Origin of wine
region_2 Origin of wine
variety Types of grapes used to make wine
winery Winery producing wine

Display records in data set

 

Check the null value of the column information in the data set

 

 Descriptive statistics of each column

 

Follow price descriptive statistics

wine['province'].value_counts().head(10).plot.bar()

 

It can be seen from the above figure that the production of California is much higher than that of other provinces in the world. We may ask, what percentage of California wine accounts for the total wine? This bar chart tells us the absolute value, but it is more useful to know the relative proportion .

From the picture above, we can see that the wine produced in California accounts for almost one third of the wine magazine reviews.
The column chart is very flexible: the height can represent anything as long as it is a number. Each column can represent anything as long as it is a category.
The middle province classification in the above example is a sorted data (no inherent size or high and low order), and there is a sorted data is a sorted data, which has a degree of order relative to the type of fixed data, such as In the example, the number of reviews with different ratings for wine.

 As you can see from the picture above, the total score of each wine is between 80 and 100 points. There are 20 sub-value categories, and the histogram can exactly show these values. What if the score is 0-100? I am afraid that we cannot fully display the situation of each category. At this time, we need to use a line chart.

Use area chart

When only one variable is plotted, the difference between the area chart and the line chart is mainly visualized. In this case, they can be used interchangeably.

 

Each sample has a quality score ranging from 1 to 10, and the results of several physical and chemical tests

 

The histogram is drawn with a series of rectangles of equal width and unequal height. The width indicates the range of the data and the height indicates the frequency or frequency.
View wine price distribution

 

 

From the above picture, the prices are mainly distributed from 0 to 200. Due to the high price deviation, the price range is too large, and there is no problem. 

len(wine[wine['price'] > 200])/len(wine)

 

Through calculation, it is found that the proportion of prices above 200 is only 0.005, which can be ignored. Deal with the data deviation and re-examine the price quantity distribution when price <200

 

 

 

 

 

 

 

 

 

 

 

 

 

This picture seems to be too ordinary, there is no chart title, and the label size on the x-axis is also a bit small. Let's modify it below.

 

After the adjustment, it is clearer than when we started, and it can better convey the analysis results to the reader. The parameters are explained one by one below.
1. The size of the chart, using figsize (width, height) parameter figsize=(10, 5)
2. modify the color, the color parameter color='mediumvioletred'values of the link end of the paper, see Table
3. Settings tab text size, using the parameter fontsize fontsize=16
4. Set the title and size plot.bar(title='xxx')
, but the size of the title text is provided, Pandas did not give setting parameters. At the bottom, the panda data visualization tool is based on matplotlib, which can be achieved with the help of the set_title function of matplotlib

ax = xxx.plot.bar()
ax.set_title('title', fontsize=20)

The variable ax is an AxesSubplot object.
5. Remove the black border of the chart.
sns.despine(bottom=True, left=True)
Here we introduce a new library, seaborn, which will be introduced specifically later.
6. The graph is completely displayed, sometimes the chart label will be obscured.
plt.tight_layout()Only check the axis label, scale label and title part.
7. Charts with Chinese characters need to be set with fonts, which are not used in this article and are explained separately

import matplotlib
font = {
    'family': 'SimHei'
}
matplotlib.rc('font', **font)

A scatterplot is a graph in which one variable is the abscissa and the other variable is the ordinate, and the distribution form of the scatter (coordinate points) reflects the relationship of the variables.
Let's take a look at the relationship between wine prices and ratings:

Seeing all the points in the picture, you can hardly see the relationship between wine prices and ratings. Since the scatterplot cannot effectively handle the points mapped at the same position, in order to better represent the relationship between the two, we need to sample the data and extract 100 points to redisplay:

 

As you can see from the picture above, wines with higher prices will receive higher ratings when they are reviewed. This shows that the scatterplot is most effective for relatively small data sets and variables with a large number of unique values.
To deal with coverage plots caused by repeated data points, in addition to sampling data, you can also use Hexplot.

The hexagon graph aggregates the points in space into hexagons, and then colors these hexagons based on the values ​​inside the hexagon.

wine[wine['price'] < 100].plot.hexbin(x='price', y='points', gridsize=15)

In the case of testing price <200, it is found that the graphics are concentrated in the price range of 0-100 (the picture is not shown here), so the filter condition is adjusted to price <100.

From the above picture, seeing the information that is not informed by the scatter plot, we can see that the price of the wines reviewed in the wine magazine is concentrated at about 87.5 points, about 20 US dollars.

A stacked chart is a type of chart that places variables one on top of the other.
Recalling the univariate bar chart in the previous article, here you can simply use a stacked parameter to achieve the superposition of multiple bars.
The original text is to introduce a new data set. Here we perform the following processing on the source data to obtain the number of times of different evaluation scores of top5 wines:

#top5酒厂
winery = wine['winery'].value_counts().head(5)
wine_counts = pd.DataFrame({'points': range(80, 101)})
for name in winery.index:
    winery_grouped = wine[wine['winery'] == name]
    points_series = winery_grouped['points'].value_counts().sort_index()
    df = pd.DataFrame({'points': points_series.index, name: list(points_series)})
    wine_counts = wine_counts.merge(df, on='points',how='left').fillna(0)
wine_counts.set_index('points', inplace=True)

 

wine_counts.plot.bar(stacked=True)

Stacked bar charts have the advantages and disadvantages of univariate bar charts. They are most suitable for sorting data or a small amount of sequencing data.
Another simple example is the stacked area chart.

wine_counts.plot.area()

 

Like the univariate area chart, the multivariable area is suitable for displaying fixed data or interval data.
The stacked graph is visually very beautiful. But they have two main limitations.
First limitation: The second variable of the stacked graph must be a variable with a very limited number of possible values. 8 is sometimes referred to as the recommended upper limit. There are many data set fields that do not meet this criterion and require further data processing.
The second limitation: poor readability and difficulty in distinguishing specific values. For example, looking at the picture above, can you tell me which wine has a higher score when the score is 87.5: Testarossa (orange), williams (blue), or DFJ (green)? This is really difficult speak!

Bivariate line chart

wine_counts.plot.line()

This method uses line charts to make up for the readability of stacked charts. In this chart, we can easily answer the question in the previous example: When the score is 87.5, which wine has a higher score? We can see that Columbia Crest is the highest.

 

Published 646 original articles · praised 198 · 690,000 views

Guess you like

Origin blog.csdn.net/seagal890/article/details/105319859