Practical Tips for Improving Exploratory Data Analysis!

 Datawhale dry goods 

Translator: Zhang Feng, member of Datawhale

A practical guide to making EDA simpler (and more beautiful)!

Original link: https://towardsdatascience.com/practical-tips-for-improving-exploratory-data-analysis-1c43b3484577

introduce

Exploratory data analysis (EDA) is a necessary step before using any machine learning model. The EDA process requires the focus and patience of data analysts and data scientists: it often takes a lot of time to actively use one or more visualization libraries before gaining meaningful insights from analyzing data.

In this article, I will share with you some tips on how to simplify the EDA program and make it more convenient based on my personal experience. In particular, I'll introduce you to three important tips I've learned in the process of "killing" with EDA:

1. Use the non-trivial graph that best suits your task;

2. Make full use of the functions of the visualization library;

3. Find faster ways to produce the same content.

Note: In this post, we will use wind energy data provided by Kaggle [2] to make infographics. let's start!

Tip 1: Don't be afraid to use non-trivial graphs

I learned how to apply this technique while writing a research paper [1] related to wind energy analysis and forecasting. While doing EDA for this project, I needed to create a summary matrix to reflect all the relationships between wind energy parameters in order to find out which parameters have the most influence on each other. The first idea that popped into my head was to build an "old-fashioned" correlation matrix , the kind I've seen in many data science/data analysis projects.

As we all know, correlation matrix is ​​used to quantify and summarize the linear relationship between variables. In the code snippet below, the corrcoef function is used on the feature column of wind energy data . Here, I also applied Seaborn's heatmap function to plot the correlation matrix array as a heatmap:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# 读取数据
data = pd.read_csv('T1.csv')
print(data)

# 为列重命名,使其标题更简短
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)
cols = ['P', 'Ws', 'Power_curve', 'Wa']

# 建立矩阵
correlation_matrix = np.corrcoef(data[cols].values.T)
hm = sns.heatmap(correlation_matrix,
                 cbar=True, annot=True, square=True, fmt='.3f',
                 annot_kws={'size': 15},
                 cmap='Blues',
                 yticklabels=['P', 'Ws', 'Power_curve', 'Wa'],
                 xticklabels=['P', 'Ws', 'Power_curve', 'Wa'])

# 保存图表
plt.savefig('image.png', dpi=600, bbox_inches='tight')
plt.show()
77796c98e00344648d54c98e55a18986.png

Figure 1 Example of correlation matrix established

From the analysis of the chart results, we can conclude that there is a strong correlation between wind speed and real power, but I think many people will agree with me that this is not an explanation when using this visualization method Simple method for results, since here we only have numbers.

A scatterplot matrix is ​​a great alternative to a correlation matrix, allowing you to visualize the pairwise correlations between different features of your dataset in one place. In this case sns.pairplot should be used:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# 读取数据
data = pd.read_csv('T1.csv')
print(data)

# 为列重命名,使其标题更简短
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)
cols = ['P', 'Ws', 'Power_curve', 'Wa']

# 建立矩阵
sns.pairplot(data[cols], height=2.5)
plt.tight_layout()

# 保存图表
plt.savefig('image2.png', dpi=600, bbox_inches='tight')
plt.show()
38a6e3cb9dfc6750bef567a4e044155c.png

Figure 2 Example of scatterplot matrix

By looking at the scatterplot matrix, we can quickly visualize how the data is distributed and whether it contains outliers. However, the main disadvantage of this type of chart is that due to the pairwise plotting of the data, there will be duplicate data.

In the end, I decided to combine the above graphs into one, where the lower left part will contain a scatterplot of the selected parameter, and the upper right part will contain bubbles of different sizes and colors: the larger the circle, the stronger the linear dependence of the parameter under study . The diagonal of the matrix will show the distribution of each feature: a narrow peak here means that that particular parameter does not vary much, while other features vary.

The code to build the summary table is as follows. The map here consists of three parts: fig.map_lower, fig.map_diag and fig.map_upper:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 读取数据
data = pd.read_csv('T1.csv')
print(data)

# 为列重命名,使其标题更简短
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)
cols = ['P', 'Ws', 'Power_curve', 'Wa']

# 建立矩阵
def correlation_dots(*args, **kwargs):
    corr_r = args[0].corr(args[1], 'pearson')
    ax = plt.gca()
    ax.set_axis_off()
    marker_size = abs(corr_r) * 3000
    ax.scatter([.5], [.5], marker_size,
               [corr_r], alpha=0.5,
               cmap = 'Blues',
               vmin = -1, vmax = 1,
               transform = ax.transAxes)
    font_size = abs(corr_r) * 40 + 5

sns.set(style = 'white', font_scale = 1.6)
fig = sns.PairGrid(data, aspect = 1.4, diag_sharey = False)
fig.map_lower(sns.regplot)
fig.map_diag(sns.histplot)
fig.map_upper(correlation_dots)

# 保存图表
plt.savefig('image3.jpg', dpi = 600, bbox_inches = 'tight')
plt.show()
d10406027d1434310fb9a6b08e154e13.png

Figure 3 Summary table example

The summary table combines the advantages of the two previously studied charts—its lower (left) section mimics a scatterplot matrix, and its upper (right) fragment graphically reflects the numerical results of the correlation matrix.

Tip 2: Get the most out of your visualization library's capabilities

From time to time I need to present the results of EDA to colleagues and customers, so visualization is an important assistant for me in this task. I always try to add various elements such as arrows and annotations to the diagram to make the diagram more attractive and readable.

Let's return to the EDA implementation example for a wind energy project discussed above. When it comes to wind energy, one of the most important parameters is the power curve . A power curve for a wind turbine (or an entire wind farm) is a graph showing how much electricity is produced at different wind speeds. It is worth noting that the turbines will not operate at low wind speeds. Their activation is related to the cut-in speed, which is usually between 2.5-5 m/s. The turbine achieves rated power when the wind speed is between 12 and 15 m/s. Finally, each turbine has an upper wind speed limit at which it can safely operate. Once this limit is reached, the wind turbine will not be able to generate electricity unless the wind speed drops back into the operating range.

The data set studied includes theoretical power curves (which are typical curves provided by the manufacturer without any outliers) and actual curves (if we plot wind power versus wind speed). The latter usually contain many points outside the ideal theoretical shape, which can be caused by fan failures, SCADA measurement errors, or unplanned maintenance.

Now, we'll create a picture that shows both types of wind curves - the first, without any additions other than the legend:

import pandas as pd
import matplotlib.pyplot as plt

# 读取数据
data = pd.read_csv('T1.csv')
print(data)

# 为列重命名,使其标题更简短
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)

# 建立图表
plt.scatter(data['Ws'], data['P'], color='steelblue', marker='+', label='actual')
plt.scatter(data['Ws'], data['Power_curve'], color='black', label='theoretical')
plt.xlabel('Wind Speed')
plt.ylabel('Power')
plt.legend(loc='best')

# 保存图表
plt.savefig('image4.png', dpi=600, bbox_inches='tight')
plt.show()
43fb5c6e1aa8c92680090d457785523a.png

Figure 4 A "silent" wind energy curve

As you can see, the diagram requires explanation as it does not contain any other details.

But what if we added lines to highlight the three main areas of the graph, designating cut-in speed, rated speed, and cut-out speed, and added a note with an arrow to show one of the outliers?

Let's see how the graph looks in this case:

import pandas as pd
import matplotlib.pyplot as plt

# 读取数据
data = pd.read_csv('T1.csv')
print(data)

# 为列重命名,使其标题更简短
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)

# 建立图表
plt.scatter(data['Ws'], data['P'], color='steelblue', marker='+', label='actual')
plt.scatter(data['Ws'], data['Power_curve'], color='black', label='theoretical')

# 添加垂直线、文字注释和箭头
plt.vlines(x=3.05, ymin=10, ymax=350, lw=3, color='black')
plt.text(1.1, 355, r"cut-in", fontsize=15)
plt.vlines(x=12.5, ymin=3000, ymax=3500, lw=3, color='black')
plt.text(13.5, 2850, r"nominal", fontsize=15)
plt.vlines(x=24.5, ymin=3080, ymax=3550, lw=3, color='black')
plt.text(21.5, 2900, r"cut-out", fontsize=15)
plt.annotate('outlier!', xy=(18.4,1805), xytext=(21.5,2050),
            arrowprops={'color':'red'})

plt.xlabel('Wind Speed')
plt.ylabel('Power')
plt.legend(loc='best')

# 保存图表
plt.savefig('image4_2.png', dpi=600, bbox_inches='tight')
plt.show()
714322190595c5efb11beb77d73e5253.png

Fig. 5 A wind energy curve chart of "speaking well"

Tip 3: Always Find a Faster Way to Make It

When analyzing wind energy data, we often want to obtain comprehensive information about the wind energy potential. So, in addition to the dynamics of wind energy, there needs to be a graph showing how wind speed varies with wind direction.

To account for changes in wind energy, the following code can be used:

import pandas as pd
import matplotlib.pyplot as plt

# 读取数据
data = pd.read_csv('T1.csv')
print(data)

# 为列重命名,使其标题更简短
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)

# 将 10 分钟数据重采样为每小时测量值
data['Date/Time'] = pd.to_datetime(data['Date/Time'])
fig = plt.figure(figsize=(10,8))
group_data = (data.set_index('Date/Time')).resample('H')['P'].sum()

# 绘制风能动态图
group_data.plot(kind='line')
plt.ylabel('Power')
plt.xlabel('Date/Time')
plt.title('Power generation (resampled to 1 hour)')

# 保存图表
plt.savefig('wind_power.png', dpi=600, bbox_inches='tight')
plt.show()

The figure below is the result of drawing:

3aabda59736106422072c5d909fe9513.png

Figure 6 Dynamic changes of wind energy

As one might notice, the dynamic profile of wind energy has a rather complex irregular shape.

A wind rose or polar rose plot is a special chart used to represent the distribution of meteorological data, usually the direction distribution of wind speed [3]. There is a simple module windrose in the matplotlib library that makes it easy to build such visualizations, for example:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from windrose import WindroseAxes

# 读取数据
data = pd.read_csv('T1.csv')
print(data)

#为列重命名,使其标题更简短
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)
wd  = data['Wa']
ws = data['Ws']

# 以堆叠直方图的形式绘制正态化风玫瑰图
ax = WindroseAxes.from_ax()
ax.bar(wd, ws, normed=True, opening=0.8, edgecolor='white')
ax.set_legend()

# 保存图表
plt.savefig('windrose.png', dpi = 600, bbox_inches = 'tight')
plt.show()
e3d2ff042d7afd6b48ee563fab43d4c2.png

Figure 7 Wind rose diagram based on available data

Looking at the wind rose diagram, it can be seen that there are two main wind directions—northeast and southwest.

But how to merge these two images into one? The most obvious way is to use add_subplot. But this is not an easy task due to the specifics of the windrose library:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from windrose import WindroseAxes

# 读取数据
data = pd.read_csv('T1.csv')
print(data)

# 为列重命名,使其标题更简短
data.rename(columns={'LV ActivePower (kW)':'P',
                     'Wind Speed (m/s)':'Ws',
                     'Theoretical_Power_Curve (KWh)':'Power_curve',
                     'Wind Direction (°)': 'Wa'},inplace=True)
data['Date/Time'] = pd.to_datetime(data['Date/Time'])

fig = plt.figure(figsize=(10,8))

# 将两个图都绘制为子图
ax1 = fig.add_subplot(211)
group_data = (data.set_index('Date/Time')).resample('H')['P'].sum()
group_data.plot(kind='line')
ax1.set_ylabel('Power')
ax1.set_xlabel('Date/Time')
ax1.set_title('Power generation (resampled to 1 hour)')

ax2 = fig.add_subplot(212, projection='windrose')

wd  = data['Wa']
ws = data['Ws']

ax = WindroseAxes.from_ax()
ax2.bar(wd, ws, normed=True, opening=0.8, edgecolor='white')
ax2.set_legend()

# 保存图表
plt.savefig('image5.png', dpi=600, bbox_inches='tight')
plt.show()

In this case the result is this:

9525f787f6e096c605de0ace30cceabb.png

Figure 8 Single picture showing wind energy dynamics and wind rose diagram

The main disadvantage of this is that the two subplots are of different sizes, so there is a lot of white space around the wind rose plot.

For convenience, I suggest an alternative approach, using the Python Imaging Library (PIL) [4], which requires only a dozen lines of code:

import numpy as np
import PIL
from PIL import Image

# 列出需要合并的图片 
list_im = ['wind_power.png','windrose.png']
imgs = [PIL.Image.open(i) for i in list_im]

# 调整所有图片的大小,使其与最小图片相匹配
min_shape = sorted([(np.sum(i.size), i.size) for i in imgs])[0][1]

# 对于垂直堆叠,我们使用 vstack
images_comb = np.vstack((np.asarray(i.resize(min_shape)) for i in imgs))
images_comb = PIL.Image.fromarray(imgs_comb)

# 保存图表
imgages_comb.save('image5_2.png', dpi=(600,600))

Here the output looks a bit prettier, both images are the same size, because the code picks the smallest image and rescales the others to match:

A single image with wind dynamics and a wind rose obtained using PIL

By the way, when using PIL, we can also use horizontal stacking, for example, we can compare and contrast the "silent" and "talkative" wind curves:

import numpy as np
import PIL
from PIL import Image

list_im = ['image4.png','image4_2.png']
imgs = [PIL.Image.open(i) for i in list_im]

# 选取最小的图片 ,并调整其他图片 的大小以与之匹配(此处可任意调整图片形状)
min_shape = sorted([(np.sum(i.size), i.size) for i in imgs])[0][1]

imgs_comb = np.hstack((np.asarray(i.resize(min_shape)) for i in imgs))

# 保存图表
imgs_comb = PIL.Image.fromarray(imgs_comb)
imgs_comb.save('image4_merged.png', dpi=(600,600))
1bcb2c3e234b7383ba5096614af8135e.png

Figure 9 Comparing and Contrasting Two Wind Curves

in conclusion

In this article, I share three tips on how to make the EDA process easier. I hope these suggestions are useful for learners to start applying them to your data tasks.

These techniques fit perfectly with the formula I've been trying to apply when doing EDA: Custom → Itemized → Optimized .

You may ask, what does this even matter? I can say that this actually matters because:

  • It is very important to customize the chart according to the specific needs at hand. For example, instead of making a lot of infographics, think about how to combine several graphs into one, as we did when we made the summary matrix, which combines the best of both scatter and correlation plots.

  • All charts should speak for themselves. Therefore, you need to know how to itemize what is important in a chart, making it detailed and easy to read. Compare how big the difference is between the power curves of Silent and Eloquent.

  • Finally, every data scientist should learn how to optimize the EDA process to make work easier (and life easier). It is not always necessary to use the add_subplot option if two plots need to be merged into one.

what else? I can definitely say that EDA is a very creative and interesting step in the process of working with data (not to mention it's super important).

Make your infographic sparkle like a diamond and don't forget to enjoy the process!

Reference list

  1. Data-driven applications of wind energy analysis and forecasting: The case of “La Haute Borne” wind farm. https://doi.org/10.1016/j.dche.2022.100048

  2. Wind energy data: https://www.kaggle.com/datasets/bhavikjikadara/wind-power-generated-data?resource=download

  3. Tutorial on the windrose library: https://windrose.readthedocs.io/en/latest/index.html

  4. PIL library: https://pillow.readthedocs.io/en/stable/index.html

cddff4287a01ed165f093cffbd09b529.png

It's not easy to organize, so I like it three times

Guess you like

Origin blog.csdn.net/Datawhale/article/details/132505341