Exploring Open Source: Get the Complete GitHub Community Dataset

This article talks about the acquisition and organization of GitHub open datasets, shares some detailed techniques of data organization, and the things behind some relatively superficial data.

written in front

Analyzing projects and developer acquisition on GitHub is one way to get a deep, real look at the evolution of the open source world.

In the GHArchive project, we can see that there are currently at least 20 to 30 open source projects based on GitHub for analysis in the world. They provide different functions based on different dimensions, and some projects have even been offline due to their age. Became part of the "living fossil of the Internet". If you are interested, you can read the "Other" section at the end of the article to understand the content of this part.

In July last year, I posted a Weibo , mentioning that I wanted to make an interesting gadget to provide the following capabilities:

Compared with the simple "count", I hope to be able to toss out one with the least resource dependence, which can quickly output "similar people", "projects with similar potential", "quickly judge how many stars this project has earned". class report.
Of course, it may be possible to use only CH, but it is not possible at present, so it is not ruled out to introduce some other black technologies.

Weibo records at that time

For us, GitHub's open data set of about 2TB (2011-2022) is actually a very good test data. Based on real data, the size is also suitable for general-scale data analysis: it can be used for production link testing And verify whether the usability and architecture design of data analysis tools are reliable, and whether the performance can meet expectations.

Before writing the program, let's first understand how to obtain GitHub's public data at a certain moment.

Get public data for past moments on GitHub

The GHArchive project provides information on GitHub open source related events from February 12, 2011 to the present, and archives them at the granularity of hours.

If you want to get the data at a certain moment of a certain day, such as "20:00 on February 2, 2020", you can use the following command:

wget https://data.gharchive.org/2020-02-02-20.json.gz

To get a complete day's data, you need to enumerate the 24 hours of the day, like this:

# wget https://data.gharchive.org/2020-02-02-{0..23}.json.gz

wget https://data.gharchive.org/2020-02-02-0.json.gz
wget https://data.gharchive.org/2020-02-02-1.json.gz
wget https://data.gharchive.org/2020-02-02-2.json.gz
...
wget https://data.gharchive.org/2020-02-02-23.json.gz

Similarly, if you want to get the data of a month, such as "January 2020", you need to enumerate the 24 hours of all dates in the month:

# wget https://data.gharchive.org/2020-01-{01..31}-{0..23}.json.gz

wget https://data.gharchive.org/2020-01-01-0.json.gz
wget https://data.gharchive.org/2020-01-01-1.json.gz
wget https://data.gharchive.org/2020-01-01-2.json.gz
...
wget https://data.gharchive.org/2020-01-02-23.json.gz
wget https://data.gharchive.org/2020-01-02-0.json.gz
...
wget https://data.gharchive.org/2020-01-31-22.json.gz
wget https://data.gharchive.org/2020-01-31-23.json.gz

It is similar if you want to obtain data for one year or several years.

Because if you want to conduct a complete data analysis, it is naturally better to obtain the full amount of data, so we need to enumerate the data of all dates: the download address contains about 100,000 data sets.

Although the amount of data is not large, it is more appropriate to use programs to generate tens of thousands of such addresses.

Generate download links for GitHub datasets in batches

Here, let's get all the data from 2011, since GitHub has data records to 2022. I haven't written Node.js for a long time, so let's use Node to implement the program this time:

process.env.TZ = "Etc/Universal"; // UTC +00:00

const {
    
     writeFileSync } = require("fs");

const config = {
    
    
  timeStart: new Date("2011-02-12 00:00:00") - 0,
  timeEnd: new Date("2022-12-31 23:00:00") - 0,
  interval: 60 * 60 * 1000, // 1h
};

let time = config.timeStart;
let result = [];
let count = 0;

while (time <= config.timeEnd) {
    
    
  timestamp = new Date(time)
    .toISOString()
    .replace(/(\d{4}-\d{2}-\d{2})T(\d{2}):.+/, "$1-$2")
    .replace(/0(\d{1})$/, "$1");

  result.push(`https://data.gharchive.org/${
      
      timestamp}.json.gz`);
  time += config.interval;
  count++;
}

writeFileSync("./urls.txt", result.join("\n"));

console.log(`${
      
      count} datasets.`);

In the above program, we use a whileloop to enumerate all the moments containing "hour" from February 12, 2011 to December 31, 2022.

Save the above program as: generate.js, execute it node generate.js, and when the program is executed, the output log will inform us that the download addresses of more than 100,000 pieces of data have been generated. The soon-to-be-downloaded dataset contains at least five billion pieces of public activity on the GitHub platform.

# node generate.js
104184 datasets.

We can use headthe command to preview the content of the generated file:

# head urls.txt

https://data.gharchive.org/2011-02-12-0.json.gz
https://data.gharchive.org/2011-02-12-1.json.gz
https://data.gharchive.org/2011-02-12-2.json.gz
...

Quickly download GitHub datasets

If you want to complete the download of 100,000 files hosted on overseas servers in the shortest possible time, there are some more reliable methods that you can choose or use in combination:

  1. Prepare a large downlink broadband, and don't let the broadband or other network activities on the intranet affect the efficiency of data acquisition. (I use a 1G home broadband)
  2. When downloading, enable multitasking downloads instead of sequential serial downloads. (Considering server pressure, I only opened 10 concurrency)
  3. Use domestic cloud servers, with object storage and CDN for transit. (The network bandwidth of the cloud server is small, but the connection quality is better than that of home broadband. With the large bandwidth of CDN, it can be used as a low-cost data retrieval solution)

If you don't have the above conditions, there is no problem, it's just that the data preparation time is a little longer.

When downloading data, I recommend using aria2replace wgetor curlto complete the data download. Compared with wgetor curl, aria2it naturally supports multi-task parallelism, which can make better use of bandwidth and device performance, and shorten download time. wgetIn different versions of different releases, the support for parallel downloads is different. It is not impossible to use xargsor parallel, and bashto complete batch downloads, but it is still troublesome after all, isn't it?

Installation aria2is simple:

# macOS
brew install aria2

# ubuntu / debian
apt-get update && apt-get install -y aria2

Using aria2read our ready-to-download dataset, it's even easier to start 10 tasks to download in parallel:

aria2c -x 10 -i urls.txt

When the download aria2is complete , we can also get a simple download report:

2a7c02|OK  |   9.2MiB/s|/data/2021/2021-12-31-21.json.gz
6dcf29|OK  |    10MiB/s|/data/2021/2021-12-31-23.json.gz
6088b7|OK  |   3.3MiB/s|/data/2021/2021-12-31-22.json.gz
463d0c|OK  |   1.6MiB/s|/data/2021/2021-12-31-19.json.gz
deebcb|OK  |   852KiB/s|/data/2021/2021-12-31-15.json.gz
39ced0|OK  |   571KiB/s|/data/2021/2021-12-31-20.json.gz

Status Legend:
(OK):download completed.

However, just performing the download does not guarantee that the data we get is complete and correct: in terms of the number of files and the integrity of the files.

Therefore, we still need to do two additional tasks: confirm whether the data is downloaded completely, and confirm that the downloaded files are complete.

Completing the undownloaded GitHub dataset

When we "complete" the download of the GitHub dataset, we can first count the total number of downloaded data files:

# find . -type f -name '*.json.gz' | wc -l
103663

It can be seen that for the first download, a total of more than 103,000 files were obtained, which did not match the total number of download addresses of the dataset we generated above, with a difference of 521 entries.

// 103663 目前数据集数量
// 104184 理论数据集数量

The reasons for the lack of data set files may be due to network instability, target server failure, program problems with the download program aria2, or the lack of this data in GHArchive itself (GitHub was down at the time).

Regardless of the reason, it is best to perform a data completion operation. First, you need to obtain a list of files that have been downloaded.

Get list of downloaded data files

Use to findspecify the file suffix, search the directory where the downloaded file is saved, and you can get a list of dataset files with complete addresses.

# find . -type f -name '*.json.gz'

./2019/2019-01-14-3.json.gz
./2019/2019-02-05-12.json.gz
./2019/2019-01-01-9.json.gz
./2019/2019-08-05-0.json.gz
./2019/2019-09-19-15.json.gz
...

In order to facilitate subsequent program processing, we can use awkto process the content of the following list, remove the directory information, and only leave the file name.

# find . -type f -name '*.json.gz' | awk -F "/" '{print $NF}'

2019-03-15-11.json.gz
2019-04-23-1.json.gz
2019-12-02-5.json.gz
2019-11-17-17.json.gz
2019-11-16-1.json.gz
...

Adjust the command to save the downloaded file to download.txta file for later use.

find . -type f -name '*.json.gz' | awk -F "/" '{print $NF}' > download.txt

Using Diff to detect "missing" GitHub datasets

Here, we can use a simple method to quickly find the missing datasets due to network request errors from 100,000 files.

First, use cat | sortto reorder the download list and the list of downloaded files respectively, and then save them as a.txtand b.txt:

cat urls.txt | sort > a.txt
cat download.txt | sort > b.txt

Using directly diffto compare the two files, we will get a result similar to the following:

diff a.txt b.txt                                          
8,10d7
< 2020-08-05-0.json.gz
< ...

Therefore, we can first use diffthe command to get the difference between two files, and then use grepand awkto filter and get the name of the file to be downloaded:

diff a.txt b.txt | grep '<' | awk -F '< ' '{print $2}' > not-download.txt

When we get the list of files that need to be downloaded, just continue to use ariato download:

aria2c -x 10 -i not-download.txt

Check the integrity of downloaded files

Although GHArchive does not provide a verification file for each dataset compression package, we can use gzipthe command to perform integrity verification for each dataset file. For example:

gzip -t -v 2011-11-11-11.json.gz
2011-11-11-11.json.gz:	 OK

Integrity testing of datasets in batches

In the face of 100,000 files, we can use a simple bash combination to perform batch file detection and save the basic results in the file.

find . -type f -name '*.json.gz' | xargs -I {
    
    } gzip -v -t {
    
    }  2>&1 | tee verify.txt

Here you can consider splitting the file and executing commands in parallel to improve detection efficiency. Open the file, we can see the execution result similar to the following:

./2011-12-31-3.json.gz:	 OK
./2011-12-31-4.json.gz:	 OK
./2011-12-31-5.json.gz:	 OK
...

Of course, considering the execution efficiency, we can xargsalso add -Pparameters after to perform parallel task calculations. For example, if the above command is rewritten as xargs -I {} -P 4 gzip -t -v {} ..., the program will automatically load to 4 different CPUs for calculation, without the need for us to manually split the list. It should be noted here that -Pthe command xargsis related to the command version used in linux and macOS versions, not every version supports this parameter, and there are some "compatibility" issues.

The following is the efficiency comparison of xargsdifferent :

# 0.01s user 0.02s system 0% cpu 26.518 total
xargs -I {
    
    } gzip -t -v {
    
    }  43.90s user 7.40s system 98% cpu 52.068 total
# 0.01s user 0.02s system 0% cpu 6.968 total
xargs -P 4 -I {
    
    } gzip -t -v {
    
    }  45.58s user 7.88s system 393% cpu 13.598 total
# 0.01s user 0.02s system 0% cpu 4.874 total
xargs -P 8 -I {
    
    } gzip -t -v {
    
    }  62.47s user 10.79s system 770% cpu 9.506 total


# 0.01s user 0.02s system 0% cpu 9.239 total
xargs -P 4 -I {
    
    } gzip -d {
    
    }  50.38s user 18.09s system 374% cpu 18.281 total
# 0.01s user 0.02s system 0% cpu 8.636 total
xargs -P 8 -I {
    
    } gzip -d {
    
    }  61.34s user 21.36s system 466% cpu 17.742 total

After performing the verification of all files, we can use grep -v "OK"to filter out the files that failed the verification and need to be downloaded again.

# cat verify.txt | grep -v "OK"

./2011-02-16-18.json.gz:	
gzip: ./2011-02-16-18.json.gz: invalid compressed data--crc error

./2013-05-16-1.json.gz:	
gzip: ./2013-05-16-1.json.gz: invalid compressed data--crc error

gzip: ./2013-05-16-1.json.gz: invalid compressed data--length error

./2013-10-13-4.json.gz:	
gzip: ./2013-10-13-4.json.gz: invalid compressed data--crc error

./2013-10-15-10.json.gz:	
gzip: ./2013-10-15-10.json.gz: invalid compressed data--crc error

./2017-06-19-18.json.gz:	
gzip: ./2017-06-19-18.json.gz: unexpected end of file

./2017-08-31-9.json.gz:	
gzip: ./2017-08-31-9.json.gz: invalid compressed data--crc error

...

Organize files that need to be re-downloaded

Use first grepto save the results of files with verification errors to a new file.

cat verify.txt | grep -v "OK" > error.txt

We can use awkand grepand sedto extract the file name of the dataset that needs to be re-downloaded, and then use sedto assemble the download address of the dataset to be downloaded:

cat error.txt | awk -F " " '{print $NF}' | grep ".json.gz" | sed -e 's/:$//g' | awk -F "/" '{print $NF}' | sed -e 's#^#https://data.gharchive.org/#'

After the command is executed, we can get a list of download addresses similar to the following:

https://data.gharchive.org/2011-02-16-18.json.gz
https://data.gharchive.org/2013-05-16-1.json.gz
https://data.gharchive.org/2013-10-13-4.json.gz
...

Save the wrongly downloaded files to the new download list, and then use aria2to re-download these files and verify again to ensure that the downloaded data is complete:

cat error.txt | awk -F " " '{print $NF}' | grep ".json.gz" | sed -e 's/:$//g' | awk -F "/" '{print $NF}' | sed -e 's#^#https://data.gharchive.org/#' > download.txt

ariac -x 10 -i download.txt

There are probably so many things to be aware of when it comes to getting the full dataset from GitHub.

Other: A chat about GitHub and its public datasets

Next, talk about some of the stories behind GitHub and its datasets.

The booming state of GitHub

At the beginning of this year, GitHub completed the accumulation of 100M users

GitHub is by far the largest developer community on the planet. In January of this year, it completed the user accumulation of 100M developers. When we have finished downloading all the data, even if we don't use any analytical database, we can see GitHub's vigorous development trajectory from the annual change in data volume alone.

Using du -hscan intuitively see the rapid growth of GitHub data volume in the past ten years.

# du -hs *

4.6G    2011
13G     2012
26G     2013
57G     2014
75G     2015
112G    2016
145G    2017
177G    2018
254G    2019
420G    2020
503G    2021
657G    2022

Converting the data into a graph, we can see a very upward curve, and if we exclude the data after 2020, the growth slope is close to a 45-degree angle.

Rapidly growing platform data volume

From 2011 to 2014, GitHub’s annual data volume doubled. In June 2018, after GitHub was acquired, GitHub’s data volume started a “road to soaring”. Although the growth rate was not high, the absolute value of the growth data It should not be underestimated, especially in the three years since the "Black Swan Incident", GitHub's data growth has seen faster growth.

Which projects, which languages, and which events have caused the rapid growth of the platform requires us to conduct more in-depth "data drilling" and analysis. About this kind of content, we will talk about it later in the article.

GitHub's downtime (service interruption)

Before conducting in-depth data analysis, we can find that GitHub has not provided online services due to failures in the past ten years through the list of missing files in the dataset:

cat not-download.txt | awk -F '/' '{print $NF}' | sed -e 's/.json.gz//g'

A complete list of downtime (service interruption greater than 1 hour), a total of 319 hours, a rough SLA calculation, the proportion of normal service is 99.7% (two 9), 19 years and 20 years of long-term downtime began to appear, of which 21 The "peak" of downtime occurred in 2009. In 2022, however, none of the outages lasted longer than an hour. (at least from the perspective of GH Archive data collection)

2016-10-21-18
2018-10-21-23
2018-10-22-0
2018-10-22-1
2019-05-08-12
2019-05-08-13
2019-09-12-8
2019-09-12-9
2019-09-12-10
2019-09-12-11
2019-09-12-12
2019-09-12-13
2019-09-12-14
2019-09-12-15
2019-09-12-16
2019-09-12-17
2019-09-12-18
2019-09-12-19
2019-09-12-20
2019-09-12-21
2019-09-12-22
2019-09-12-23
2019-09-13-0
2019-09-13-1
2019-09-13-2
2019-09-13-3
2019-09-13-4
2019-09-13-5
2020-03-05-22
2020-06-10-12
2020-06-10-13
2020-06-10-14
2020-06-10-15
2020-06-10-16
2020-06-10-17
2020-06-10-18
2020-06-10-19
2020-06-10-20
2020-06-10-21
2020-08-21-9
2020-08-21-10
2020-08-21-11
2020-08-21-12
2020-08-21-13
2020-08-21-14
2020-08-21-15
2020-08-21-16
2020-08-21-17
2020-08-21-18
2020-08-21-19
2020-08-21-20
2020-08-21-21
2020-08-21-22
2020-08-21-23
2020-08-22-0
2020-08-22-1
2020-08-22-2
2020-08-22-3
2020-08-22-4
2020-08-22-5
2020-08-22-6
2020-08-22-7
2020-08-22-8
2020-08-22-9
2020-08-22-10
2020-08-22-11
2020-08-22-12
2020-08-22-13
2020-08-22-14
2020-08-22-15
2020-08-22-16
2020-08-22-17
2020-08-22-18
2020-08-22-19
2020-08-22-20
2020-08-22-21
2020-08-22-22
2020-08-22-23
2020-08-23-0
2020-08-23-1
2020-08-23-2
2020-08-23-3
2020-08-23-4
2020-08-23-5
2020-08-23-6
2020-08-23-7
2020-08-23-8
2020-08-23-9
2020-08-23-10
2020-08-23-11
2020-08-23-12
2020-08-23-13
2020-08-23-14
2020-08-23-15
2021-08-25-17
2021-08-25-18
2021-08-25-19
2021-08-25-20
2021-08-25-21
2021-08-25-22
2021-08-25-23
2021-08-26-0
2021-08-26-1
2021-08-26-2
2021-08-26-3
2021-08-26-4
2021-08-26-5
2021-08-26-6
2021-08-26-7
2021-08-26-8
2021-08-26-9
2021-08-26-10
2021-08-26-11
2021-08-26-12
2021-08-26-13
2021-08-26-14
2021-08-26-15
2021-08-26-16
2021-08-26-17
2021-08-26-18
2021-08-26-19
2021-08-26-20
2021-08-26-21
2021-08-26-22
2021-08-26-23
2021-08-27-0
2021-08-27-1
2021-08-27-2
2021-08-27-3
2021-08-27-4
2021-08-27-5
2021-08-27-6
2021-08-27-7
2021-08-27-8
2021-08-27-9
2021-08-27-10
2021-08-27-11
2021-08-27-12
2021-08-27-13
2021-08-27-14
2021-08-27-15
2021-08-27-16
2021-08-27-17
2021-08-27-18
2021-08-27-19
2021-08-27-20
2021-08-27-21
2021-08-27-22
2021-10-22-5
2021-10-22-6
2021-10-22-7
2021-10-22-8
2021-10-22-9
2021-10-22-10
2021-10-22-11
2021-10-22-12
2021-10-22-13
2021-10-22-14
2021-10-22-15
2021-10-22-16
2021-10-22-17
2021-10-22-18
2021-10-22-19
2021-10-22-20
2021-10-22-21
2021-10-22-22
2021-10-23-2
2021-10-23-3
2021-10-23-4
2021-10-23-5
2021-10-23-6
2021-10-23-7
2021-10-23-8
2021-10-23-9
2021-10-23-10
2021-10-23-11
2021-10-23-12
2021-10-23-13
2021-10-23-14
2021-10-23-15
2021-10-23-16
2021-10-23-17
2021-10-23-18
2021-10-23-19
2021-10-23-20
2021-10-23-21
2021-10-23-22
2021-10-24-3
2021-10-24-4
2021-10-24-5
2021-10-24-6
2021-10-24-7
2021-10-24-8
2021-10-24-9
2021-10-24-10
2021-10-24-11
2021-10-24-12
2021-10-24-13
2021-10-24-14
2021-10-24-15
2021-10-24-16
2021-10-24-17
2021-10-24-18
2021-10-24-19
2021-10-24-20
2021-10-24-21
2021-10-24-22
2021-10-25-1
2021-10-25-2
2021-10-25-3
2021-10-25-4
2021-10-25-5
2021-10-25-6
2021-10-25-7
2021-10-25-8
2021-10-25-9
2021-10-25-10
2021-10-25-11
2021-10-25-12
2021-10-25-13
2021-10-25-14
2021-10-25-15
2021-10-25-16
2021-10-25-17
2021-10-25-18
2021-10-25-19
2021-10-25-20
2021-10-25-21
2021-10-25-22
2021-10-26-0
2021-10-26-1
2021-10-26-2
2021-10-26-3
2021-10-26-4
2021-10-26-5
2021-10-26-6
2021-10-26-7
2021-10-26-8
2021-10-26-9
2021-10-26-10
2021-10-26-11
2021-10-26-12
2021-10-26-13
2021-10-26-14
2021-10-26-15
2021-10-26-16
2021-10-26-17
2021-10-26-18
2021-10-26-19
2021-10-26-20
2021-10-26-21
2021-10-26-22
2021-10-26-23
2021-10-27-0
2021-10-27-1
2021-10-27-2
2021-10-27-3
2021-10-27-4
2021-10-27-5
2021-10-27-6
2021-10-27-7
2021-10-27-8
2021-10-27-9
2021-10-27-10
2021-10-27-11
2021-10-27-12
2021-10-27-13
2021-10-27-14
2021-10-27-15
2021-10-27-16
2021-10-27-17
2021-10-27-18
2021-10-27-19
2021-10-27-20
2021-10-27-21
2021-10-27-22
2021-10-27-23
2021-10-28-0
2021-10-28-1
2021-10-28-2
2021-10-28-3
2021-10-28-4
2021-10-28-5
2021-10-28-6
2021-10-28-7
2021-10-28-8
2021-10-28-9
2021-10-28-10
2021-10-28-11
2021-10-28-12
2021-10-28-13
2021-10-28-14
2021-10-28-15
2021-10-28-16
2021-10-28-17
2021-10-28-18
2021-10-28-19
2021-10-28-20
2021-10-28-21
2021-10-28-22
2021-10-28-23
2021-10-29-0
2021-10-29-1
2021-10-29-2
2021-10-29-3
2021-10-29-4
2021-10-29-5
2021-10-29-6
2021-10-29-7
2021-10-29-8
2021-10-29-9
2021-10-29-10
2021-10-29-11
2021-10-29-12
2021-10-29-13
2021-10-29-14
2021-10-29-15
2021-10-29-16
2021-10-29-17

GitHub's most active peak moments

Through the following command, we can get the most active moments of users and robots on the GitHub platform:

tail sort.txt -n 10 | awk -F ' ' '{print $2}' | xargs -I {
    
    } du -hs {
    
    } | sort -r

The data results are currently as follows:

380M	./2022/2022-03-12-0.json.gz
336M	./2022/2022-05-19-0.json.gz
328M	./2022/2022-05-18-23.json.gz
321M	./2022/2022-05-19-2.json.gz
304M	./2022/2022-05-18-22.json.gz
291M	./2022/2022-02-26-6.json.gz
289M	./2022/2022-02-26-8.json.gz
286M	./2022/2022-02-26-7.json.gz
284M	./2022/2022-02-26-5.json.gz
281M	./2022/2022-03-11-23.json.gz

Sure enough, staying up late at night is in line with the habits of engineers.

GitHub's most active months

To get the most active month on GitHub, we need to write a simple program to help us accumulate data:

const {
    
     readFileSync } = require("fs");

const du = (size) => {
    
    
  const i = size == 0 ? 0 : Math.floor(Math.log(size) / Math.log(1024));
  return (size / Math.pow(1024, i)).toFixed(2) * 1 + " " + ["kB", "MB", "GB", "TB"][i];
};

const totalRecords = readFileSync("./sort.txt", "utf-8")
  .split("\n")
  .map((n) => n.trim())
  .filter((n) => n)
  .map((n) => {
    
    
    let [size, filename] = n.split("\t");
    size = parseInt(size, 10);
    filename = filename.split("/")[2].split(".")[0];

    const [year, month, day, hour] = filename
      .split("-")
      .slice(0, 4)
      .map((n) => parseInt(n, 10));

    return {
    
     size, year, month, day, hour };
  });

const groupByMonth = totalRecords.reduce((prev, item) => {
    
    
  const {
    
     size, month } = item;
  prev[month] = prev[month] || 0;
  prev[month] += size;
  return prev;
}, {
    
    });

Object.keys(groupByMonth).forEach((key) => {
    
    
  groupByMonth[key] = du(groupByMonth[key]);
});

console.log(groupByMonth);

The final data is as follows:

{
    
    
  '1': '172.49 GB',
  '2': '184.24 GB',
  '3': '207.08 GB',
  '4': '196.82 GB',
  '5': '212.64 GB',
  '6': '198.75 GB',
  '7': '198.24 GB',
  '8': '198.49 GB',
  '9': '216.7 GB',
  '10': '209.58 GB',
  '11': '227.52 GB',
  '12': '226.62 GB'
}

Perhaps, only at the end of the year will everyone remember to "check in".

Stories about GitHub datasets

As mentioned at the beginning of the article, in the GHArchive project, we can see that there are currently at least 20 to 30 open source projects based on GitHub for analysis around the world. It has been offline for a long time and has become a part of the "living fossil of the Internet".

In the second half of last year, PingCAP launched a real-time GitHub insight tool, OSS Insight

In the past year, the most memorable one should be "OSS Insight". It has a more beautiful interface than various predecessors, and supports some relatively rudimentary data analysis. The PingCAP Cloud that was launched later also integrated this function. You can pay some money to experience an online Demo that belongs to you in "minutes", but there are only a few (it seems to only contain 2022.01.01 data).

Although in the official blog , the description of this project seems to be a whim in 2022, in fact, this "grass planting" should be as early as a year ago.

In the second half of last year, PingCAP launched a real-time GitHub insight tool, OSS Insight

The origin of the OSS Insight project probably came from March 2021, when an interesting boss demanded it . Whatever the reason, it doesn't hurt to have more "on a whim" boss needs that benefit the community.

However, the origin of the story about GitHub data exploration is not in 2021, but can be traced back to 2020 earlier.

In 2020, ClickHouse for GitHub data mining

In 2020, some overseas students used ClickHouse to implement data analysis on GitHub, wrote a detailed article, and published it on the ClickHouse website . This may be one of the prototypes of OSS Insight. It's a pity that the data set in this content stays in 2020 with the article, and it also lacks a lot of reproduction details, and it also lacks a beautiful front-end interface compared to OSS Insight.

Of course, GitHub's data exploration didn't just start in 2020.

Related references on the GH Archive website

On the GH Archive website, other predecessors are also listed. The exploration and contribution list of this data can be used for anyone who wants to understand the open source world to study and research.

at last

This article was completed during the Spring Festival holiday, because my courier delivery was delayed, so I can only toss the data first, in case there will be "cooking without rice" in the future. Recently, some students in the team wanted to learn more about this data set, and took the opportunity to organize the content into a document, hoping to help you who have the same needs and are curious about the open source world.

–EOF


We have a small tossing group, which gathers some friends who like tossing.

In the absence of advertisements, we will chat about software and hardware, HomeLab, and programming issues together, and will also share some information about technical salons in the group from time to time.

Friends who like tossing, welcome to read the following content, scan the code to add friends.


This article uses the "Signature 4.0 International (CC BY 4.0)" license agreement. You are welcome to reprint or re-use it, but you need to indicate the source. Attribution 4.0 International (CC BY 4.0)

Author of this article: Su Yang

Creation time: February 23, 2023
Counting words: 15834 words
Reading time: 32 minutes to read
This article link: https://soulteary.com/2023/02/23/exploring-github-open-datasets.html

Guess you like

Origin blog.csdn.net/soulteary/article/details/129189867