日常前言

五月六月，又陷入反反复复的项目 Bug 中了。讲道理，分析日志是越来越熟练了，代码水平其实没有很大提高，毕竟改 Bug 嘛，大多只是在原有代码的基础上，添加或者修改一些业务逻辑。虽然改原生代码的时候能学到很多东西，但是那些部分很少出现问题，绝大部分还是我们自己人加入、修改的逻辑挖出来的坑。填坑的过程真是漫长又令人心烦。
时间有限，这次的翻译也只选了五个短篇，尽量提高内容质量同时也节省出一些业余时间学习一些其它知识。
不过这次翻译对我来说，收获颇丰。在翻译其中两篇文章的时候，我做了详细的笔记，并且在阅读过程中查阅了不少相关资料，学到了很多东西。
其中一篇是数据可视化的艺术，虽然只是以网页性能分析为例，对各种常用图表作了简单的适用场景的介绍，但是这正是我最近需要了解的内容 —— 由于业务原因，我需要经常接入第三方算法，并评测其性能。然而组内一直都是用打印 Log 的方式去分析性能，很不方便，而且经常会忽略掉一些异常变化。我正需要一些方法提高我们的性能分析效率，而这篇文章则给我指明了方向。
另一篇则是关于概率数据结构的介绍。选择翻译这篇文章是因为看到了 Bloom Filter，这让我想起了大学时给老师打工写爬虫的时光……这次顺势重温了 Bloom Filter，并了解了 HyperLogLog 与 Min-Count Sketch 这两个算法。我总觉得在不久的将来我就会用上它们。
这一期文章依旧采纳了四篇：
说到版权问题，我其实不太清楚我这样中英翻译的方式发 Blog 是不是侵了英文原版的版权了。但是如果不是中英翻译的话，发这些 Blog 就是多此一举了。如果侵权了的话，以后再删掉吧~

版权相关

翻译人：StoneDemo，该成员来自云+社区翻译社
原文链接：The Art of Data Visualization
原文作者：Kameerath Kareem

The Art of Data Visualization

题目：（数据可视化的艺术）

In my last blog, we looked at how data is aggregated based on the trend of data. In this article, we discuss how this data is represented to users in a more meaningful way.

在我上一篇博客中，我们研究了如何依据数据趋势聚合数据。在本文中，我们将讨论如何以更有意义的方式将这些数据呈现给用户。

The raw data for thousands of websites across different geographies that are measuring network components, page performance, availability, and page content metrics is saved in huge databases. When this data is presented to humans without organizing and categorizing properly, it’s difficult to read, analyze, and identify conclusions.

跨越不同地域的数千个网站的原始数据保存在庞大的数据库中，这些原始数据即是网站正在测量的网络组件、页面性能、可用性，以及页面内容指标（Page content metrics）。当我们将这些数据呈现给他人而没有对其进行正确组织和分类时，这将导致难以阅读、分析和确定结论。

Presenting these data sets by organizing and categorizing in a graphical format makes it easier to achieve your goals. Here, we will look at different chart types that are used more frequently in performance analytics and that are used in various scenarios based on the data type.

通过图形方式来组织和分类这些数据集，并将其呈现，则可以更轻松地达成您的目的。接下来，我们将看到各种各样的图表类型，这些图表常常会在性能分析中使用到，并且在基于数据类型的各种场景中也适用。

Most commonly used chart types:

Bar chart.

Line chart.

Scatterplot chart.

Histogram.

Cumulative distribution chart.

Geo chart.

Bubble chart.

常用的图表类型有如下几种：

条形图（Bar chart）。
折线图（Line chart）。
散点图（Scatterplot chart）。
直方图（Histogram）。
累积分布图（Cumulative distribution chart）。
地理图（Geo chart）。
气泡图（Bubble chart）。

To determine the chart types that represent a set of data accurately, let’s look at some real-world scenarios in performance analytics.

为了准确地确定代表一组数据的图表类型，我们来看看实际情景下的一些性能分析案例。

Use Case 1

（使用案例之其一）

Often when analyzing performance data, we come across situations where we need to rank the data based on certain qualitative data. For example, consider the qualitative data for performance of a website across different cities in the US; let’s try to determine which chart would help interpret the data in the best way.

通常在分析性能数据时，我们会遇到需要根据某些定性数据（Qualitative data）对数据进行排名的情况。例如，考虑美国不同城市网站性能的定性数据，让我们试试确定哪种图表有助于以最佳方式解释数据。

Bar charts display the data in the form of vertical bars. This works in scenarios where we need to compare different qualitative data that can be categorized. So, bar graphs are appropriate when we want to represent ranking data in performance analysis.

条形图以垂直线条形式展示数据。这适用于需要比较可分类的不同定性数据的情况。因此，当我们想要在性能分析中展示排名数据时，使用条形图是恰当的。

Catchpoint’s digital experience intelligence platform provides the option to generate bar graphs at distinct levels of breakdown which is an effective way to represent qualitative data in a ranking order.

Catchpoint 的数字体验智能平台提供了以不同级别的分解来生成条形图的选项，这是按排名顺序展示定性数据的一个有效方法。

横轴：城市名称；纵轴：页面响应时间（单位 ms）。

The above bar chart shows web page load time ranking across different cities in the US; it is easy to figure out which city performed well over others looking at this chart.

上面的条形图展示了美国不同城市的网页加载时间排名。通过看这张图，我们很容易找出哪个城市比其他城市表现更好。

Use Case 2

（使用案例之其二）

Consider another scenario where performance data needs to be studied over a period to see if there is any change in performance.

考虑另一种情况：我们需要研究一段时间内的性能数据，以查看性能是否有任何变化。

A line graph can be used to represent the continued distribution of quantitative performance data of a website over a specific period. This can determine the range of time when the performance was affected. Catchpoint can provide flexibility to plot line graph for 10 different metrics at once to provide an in-depth detail to find the root cause of the issue.

折线图可以用来表示特定时期内，网站的定性性能数据的持续分布。这可以确定性能受到影响的时间范围。Catchpoint 可以灵活地提供折线图（可一次绘制 10 种不同指标的折线图），以提供详细信息以找出问题的根源。

这里写图片描述

From the above line graph chart, we see that there was a change in the performance in the month of October as there was an increase in the total number of contents on the page.

从上面的折线图中，我们看到 10 月份的性能表现发生了变化，原因是页面上的内容总数有所增加。

So, a line graph can help understand performance variations and to analyze the root cause behind the change in performance over a period of time.

因此，折线图可帮助您了解性能变化，并且分析出一段时间内性能变化背后的根本原因。

Use Case 3

（使用案例之其三）

Error filtering is an important part of data analytics. It helps identify different errors, and the time the errors occurred to evaluate the availability of the website. This also helps in evaluating website availability; hence, this chart type is frequently used in performance analysis to monitor the availability of a website.

错误过滤（Error filtering）是数据分析的重要组成部分。它能帮助识别不同的错误以及发生错误的时间，从而评估网站的可用性。这也有助于评估网站的可用性，因此，此图表类型经常用于性能分析中，以监控网站的可用性。

Some solutions offer an effortless way to filter different error types in a specific time frame. A scatterplot chart is a straightforward way to visualize all these errors, it plots every test run that had a failure.

一些解决方案提供了一种轻松的方式来过滤特定时间范围内不同的错误类型。散点图是能直观地展示所有这些错误的方法，它绘制出了每次失败的测试运行。

这里写图片描述

The above graph shows all the errors which occurred in a specified time interval for a web test, each data point can be analyzed further by clicking on a data point and viewing the waterfall data.

上图展示了指定时间间隔内，网络测试所出现的所有错误，人们可以通过单击数据点并查看瀑布式数据（Waterfall data）来进一步分析每个数据点。

A scatterplot can also be used to visualize different patterns of data for an in-depth root cause analysis. For example, consider a scenario where the page performance is impacted by the high response time of a file. Analyzing the data points reveals that the file was served from different servers and some of these servers were sending the file uncompressed and these uncompressed files added latency to the page load.

散点图也可以用来展示不同的数据模式，以便深入分析根本原因。例如，考虑到页面性能受文件高响应时间影响的情况。分析数据点揭示了来自不同服务器的文件中，有一些服务器未经压缩便发送文件，这些未压缩的文件增加了页面加载的延迟。

The scatterplot graph below shows different bands of data for file 1 and file 2, each of which has an uncompressed and compressed version served from different servers. The response time of the compressed file was much better than the larger uncompressed file as it takes longer to deliver higher bytes of data to the client from the server.

下面的散点图展示了文件 1 和文件 2 的不同数据段，每个数据段都具有从不同服务器提供的未压缩和压缩版本。压缩文件的响应时间比较大的未压缩文件要好得多，因为从服务器向客户端发送更多字节的数据需要更长的时间。

这里写图片描述

Use Case 4

（使用案例之其四）

In performance analytics, it is important to know the number of data points present in the threshold range of a performance metric. This would be useful to evaluate how many users were affected by low performance and how many experienced reliable performance.

在性能分析中，了解存在于性能指标阈值范围内的数据点的数量是非常重要的。这对于评估有多少用户受到低性能的影响，以及有多少有经验的、可靠的性能来说，很有用。

Categorizing data into range buckets will help you understand how many data points were within the desired threshold range for that website. It can also help with further analysis for the data sets that had low performance.

将数据分类到范围桶（Range buckets）中可帮助您了解有多少数据点位于该网站所需的阈值范围内。它有助于进一步分析性能较低的数据集。

A histogram chart can be used to represent data distribution in range buckets. Each bucket depicts the performance metric range and the number of data sets which fall in that range.

直方图可以用来表示范围桶中的数据分布。每个桶描述了性能指标范围，以及数据集中落入该范围的数据的数量。

这里写图片描述

The histogram graph above shows the number of data runs on the Y axis and the range of web page load time on the X axis. The second bar shows that there were 232 runs which had web page response time in the range of 5.3-6 seconds.

上面的直方图展示了 Y 轴上的数据运行次数以及 X 轴上的网页加载时间范围。第二栏显示有 232 次运行，其网页响应时间在 5.3-6 秒范围内。

The histogram gives a range bucket for looking at the number of affected users while a cumulative distribution graph gives the percent of users who crossed the threshold value for that performance metric.

直方图为查看受影响的用户数提供了一个范围桶，而累积分布图则给出了超过该性能指标阈值的用户数量的百分比。

Cumulative distribution graph is a commonly used chart type to express the performance metrics in percentile; it plots the percent of users who had performance metric greater or lesser than the threshold for the website.

累积分布图是一种常用的图表类型，它用百分表示性能指标。它绘制出了性能指标大于或小于网站阈值的用户的百分比。

The graph below shows the CDF graph for web page response time

下图显示了网页响应时间的累积分布图。

这里写图片描述

From the CDF graph above, we see that at the 90th percentile, the web page response time of a website is 10.3 seconds. This means that 10% of the users in the time frame that the data was collected in had an overall web page load time of more than 10.3 seconds.

从上面的累积分布图中，我们看到在第 90 百分位，网站的网页响应时间为 10.3 秒。这意味着，在收集到的数据的时间范围内，网页加载时间超过了 10.3 秒的用户占比为 10%。

Use Case 5

（使用案例之其五）

When a website is hosted at multiple locations, it becomes necessary to evaluate its performance from different geographic points. Catchpoint offers geo charts that display performance based on the data point’s magnitude from green for good to red for bad performance.

当网站托管在多个地点时，我们有必要从不同的地理位置评估其性能。Catchpoint 提供了展示性能的地理统计图，其中绿色到红色的变化对应着性能从好到坏的变化。

这里写图片描述

The geo chart above shows how the performance of a single website varies across geographies. From the graph, we see users in USA and Europe experienced the best web page load time, whereas the user in China experienced higher web page load time.

上面的地理图展示了单个网站的性能在不同地域间的差异。从图中，我们看到美国和欧洲的用户体验到了最佳的网页加载时间，而中国用户则体验到更长的网页加载时间。

Use Case 6

（使用案例之其六）

The chart types we discussed so far focus on a single metric which can be selected for evaluating the performance. What if we want to evaluate the performance of more than 1 metric or for a set of different websites?

迄今为止，我们所讨论的图表类型都关注于可被选择用于评估性能的单个度量标准。如果我们想评估一个以上的度量标准，或一组不同网站的性能，这时候该怎么办呢？

In such scenarios, Bubble charts are a good option to evaluate multiple performance metrics for different websites in a single view.

在这种情况下，对于在单个视图中评估不同网站的多个性能指标，气泡图是一个很好的选择。

这里写图片描述

The above bubble chart gives the performance (Document Complete, Webpage Response) of 3 different websites under a single view.

上述气泡图在单个视图下给出了 3 个不同网站的性能数据（文档完整，网页响应）。

Conclusion

（总结）

From the above-mentioned scenarios, visualization is a powerful way to express data in a more meaningful manner. It helps in finding the root cause of an issue and drawing conclusions to narrow the areas that require optimization.

从上述场景中我们可以看出，可视化是以更有意义的方式表达数据的强力方法。它有助于找出问题的根本原因并得出结论，从而缩小需要优化的区域。

The different charts types available in Catchpoint helps you slice and dice the data in diverse ways to analyze the data. In addition to analyzing the data, it is also important to monitor the trend in performance across different web pages or competitor’s website to know how the system behaves over time.

Catchpoint 中提供的不同图表类型可帮助您以不同的方式分割和切分数据，以对数据进行分析。除了分析数据以外，监测不同网页或竞争对手网站的性能趋势也很重要，以了解系统随时间的变化情况。

[大数据文章之其二] 数据可视化的艺术

日常前言

版权相关

The Art of Data Visualization

Use Case 1

Use Case 2

Use Case 3

Use Case 4

Use Case 5

Use Case 6

Conclusion

猜你喜欢