Is the GitHub website reliable? Google engineers teach you to use BigQuery to find out

Visualizing data organized in unique ways can often make some interesting points for us. Felipe Hoffa recently used BigQuery to visualize and analyze Reddit's website activity and launch time data for nearly eight years, and we saw some very interesting information. As a website reliability engineer just starting out with mission control, I always ask myself, "If I were the reliability engineer for this service, what would I do to solve this problem?"

This time, Felipe From the perspective of reliability engineers, some historical data of GitHub will be analyzed. First, we need to determine whether using BigQuery to analyze some of the event data about GitHub on the GitHub Archive is sufficient to infer the health of the GitHub website. GitHub defines many different types of active events for developers, but for the analysis of this article, we only focus on events that successfully make requests to GitHub.

We can use this query statement:

#StandardSQL
SELECT TIMESTAMP_TRUNC(created_at, MINUTE) minute, COUNT(*)
FROM `githubarchive.month.201607`
GROUP BY 1
ORDER BY 1
We can find every minute in GitHub in July 2016 The number of events that occurred. The created_at field records a timestamp in microseconds, which is truncated by the query statement in minutes. This allows us to use the COUNT aggregate function to count the number of events per minute when grouping query results by timestamp. Simply visualize the query results to get the following image:

9a26e9f745bea2c293acd60f823df35236e30f64
In the graph above we can find some very interesting data points, which correspond to an exceptionally low number of events, but it is difficult to tell exactly whether each minute is "normal" or "abnormal" just from the graph above . Therefore, we can create a histogram of event data based on the query results to make the judgment process clearer.

The graph b4346e1a391eedfaee31ff66c2d9bc79172d69be
clearly shows that for GitHub, when the total number of events processed per minute is below 200, the website is in an abnormal state, which is true at least July 2016. We assume that the very few events per minute are not related to the unusually low number of end-user requests, but are due to issues with the website's own server. Under this premise, there are two possible explanations: the user request did not reach the server, or the server could not successfully respond to the user request. This gives us a flag to approximate whether GitHub's current state is "healthy" or "unhealthy".
Click to read the full text: http://click.aliyun.com/m/10252/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326363307&siteId=291194637