How to evaluate an open source project (2)丨Collaborative influence

Editor's note

"How to evaluate an open source project?" has always been a controversial problem to be solved. Just relying on some simple indicators, such as the number of stars, it is difficult to accurately reflect the quality of the project. Therefore, many platforms and organizations have launched special tools, such as Gitee's " Gitee Index " function, which evaluates the health of an open source project from multiple dimensions such as code activity, community activity, team health, fashion trends, and influence. . The CHAOSS project group under the Linux Foundation is committed to providing quantitative indicators for evaluating the development of open source communities and projects...

There are also attempts to systematically collect data from the open source community, including project popularity, reliability, and activity, to judge and evaluate the quality of an open source project.

Zhao Shengyu, a core member of X-lab Open Lab, has long-term researched open source theory and open source behavior data, and conducted related experiments. Recently, he published a series of articles on "How to Evaluate an Open Source Project". The series continues to be updated, so stay tuned!

The following main text is the second in the series.

Call for Papers:

The OSCHINA community is looking forward to more different voices, and is now inviting all users to discuss and participate. Which open source project NO.1 do you have in mind? What do you think is the most important metric to evaluate an open source project?
Welcome to participate in the discussion, and the way to participate is as follows:
  • Submit your manuscript to the editorial department ([email protected]), express your opinion and present your arguments. After the review, we will recommend articles and authors, and send a small gift.
  • Leave a message in the comment area of ​​the article, as of October 5th, the comment user with the most likes will also receive a small gift.
Participate in the norm: Start a discussion on the topic itself, justify it, and elaborate the arguments as much as possible.

text


about the author

Zhao Shengyu
Doctor of Computer Science from Tongji University, core member of X-lab open laboratory, focusing on open source theory and open source behavioral data research.

This blog follows the previous introduction on activity , and designs a new upper-level algorithm based on the activity proposed above, and hopes that some problems existing in activity measurement can be solved through this algorithm.

This article systematically introduces a project impact assessment method based on the global developer collaboration network, which is of great help in analyzing the entire open source ecosystem. While evaluating the collaborative influence of all projects at one time, it can also conduct an in-depth exploration of the collaborative relevance of the project, and automatically determine the category of the project.

The OpenGalaxy 2019 pictured above is built on collaborative impact metrics.

background

The previous article mentioned some thoughts and problems brought about by analyzing projects directly based on the statistical indicator of activity. In general, for a specific project, this analysis method can effectively track the operation of the overall project, and through the adjustment of some parameters, it can affect the value orientation of each developer and play a certain guiding role, such as the maintenance of the code. Review invest more effort, or attract more developers to participate in the community.

But in fact, this statistical algorithm does not take into account that open source is a complete ecology, and developers on GitHub are generally not active on only one project. However, this method of independently performing statistics on each item not only fails to utilize the relationship between items, but also leads to certain problems in comparison between items.

The following two types of items can illustrate some of the problems of activity in global analysis:

  • 1. Projects with more automated behaviors. For example , the pddemo project, whose description is "an Issue is automatically generated every minute in this repository". That is, the author of the repository uses automated means to submit a new Issue in this repository every minute, and the title of the Issue is random without any content, and the previously generated empty Issues are periodically deleted. This resulted in a very high level of activity for the project throughout the year, even though the project only had an event such as Open Issue, and they were all operated by the same account. Similar projects include the test project signcla-probe-repo of the CLA signing program of the Google test team . This project will generate a large number of automatically submitted PRs to test the CLA signing function. It is submitted by some robot accounts. The current project The total number of PR in up to 46W more.   
  • 2. Projects that do not fully use GitHub functions. This problem is caused by the lack of certain events in some projects. For example, many open source projects, although collaborating on GitHub, do not actually use Issue on GitHub to submit and track requirements, but use tools such as JIRA, which will cause such projects to be more active. The truth is too low to be easily detected. This is especially the case for many projects under the Apache Foundation, such as spark , hbase , flink , etc., all of which have not enabled GitHub's Issue function.  

Considering the above problems, we hope to make full use of the behavior data of open source projects to judge the collaborative influence of projects. Therefore, we propose an open source collaboration network, which aims to use the collaboration between developers in different projects to make full use of all open source ecosystems in the global open source ecosystem. Projects carry out impact calculations.

Open Source Collaborative Networks and Collaborative Influence

Open Source Collaboration Network

The idea of ​​​​building an open source collaboration network is very simple. The basic logic is: if developers are very active on two projects at the same time, then there is a high degree of collaboration between the two projects. Here we don't consider the motivation of this association for the time being, it is just an observation of the developer's behavior. In fact, according to the follow-up sampling, it is found that it is basically in line with the original idea. That is to say, in most cases, developers are highly active on two projects at the same time due to the existence of upstream and downstream relationships in the project, or the existence of some kind of dependency or cooperative relationship in use.

The most typical example is the high correlation between VSCode and flutter . Initially, we thought that the two projects used the same collaborative robot to cause the high correlation. Through in-depth analysis, it was found that it was the developer Danny Tuppeny who connected them. This is a developer from the United Kingdom, the author of the Dart language VSCode plugin Dart-Code and a contributor to flutter. Since the Dart language is currently mainly used for flutter development, his high activity links VSCode and flutter, two of the world's top projects.        

It is also the mathematical tool of collaborative networks that provides an effective means for us to discover these interesting relationships in the open source ecosystem.

The calculation of the degree of collaboration in the collaboration network is also very intuitive, in which the activity of each developer on the project conforms to the introduction to the calculation of activity in the previous article . Assuming that the activity of developer $d$ on project $p1$ is $A_{d,p1}$, and the activity on project $p2$ is $A_{d,p2}$, then the developer has a The contribution of the collaborative relevance of each project is $\frac{A_{d,p1}A_{d,p2}}{A_{d,p1}+A_{d,p2}}$. The calculation method of harmonic average is used here, which means that only when developers are very active on both projects will they have a greater impact on the correlation between the two projects, and only being active on one project will not lead to two projects. The relevance of each item has increased significantly.

Collaborative influence

Based on the collaborative network of all projects in the open source global ecosystem constructed above, we can use some graph analysis algorithms to calculate the collaborative influence of each project. Here we use the PageRank algorithm. The calculation process of this algorithm can be seen in the link, so I won't go into details here.  

PageRank is a classic algorithm, which was invented and applied to the page ranking of Google search engine, and received very good results. Moreover, this algorithm is highly computationally efficient, and its judgment of web page quality is not based on the content of the web page itself, but on the reference relationship between web pages.

That is to say, its basic proposition is: a high-quality page will be cited by more pages; and a high-quality page, the quality of other pages it refers to is also higher. Only with such a basic value assumption and the citation relationship between countless web pages on the Internet, Google can give a good judgment and ranking of web page quality. (We will not discuss the cheating behavior that may be brought about by the opening of the algorithm here)

In the collaborative influence of the open source collaboration network, we use a very similar idea: that is, a project with greater influence will have a collaborative relationship with more projects; for a project with greater influence, the degree of collaboration associated with it Higher project impact will also be higher.

With such an understanding, the top 10 projects of GitHub's global collaboration influence in 2019 are given as follows:

ranking project Collaborative influence
1 microsoft/vscode 1135
2 flutter/flutter 645
3 governors / governors 624
4 DefinitelyTyped/DefinitelyTyped 564
5 microsoft/TypeScript 544
6 tensorflow/tensorflow 535
7 gatsbyjs / gatsby 504
8 golang/go 448
9 rest-long/rest 448
10 facebook/react-native 426

It can be seen that VSCode, as a widely used and popular IDE project by developers, occupies the center position of the open source world with a high collaborative influence. Through plug-in projects in various languages, VSCode has established a greater correlation with these languages ​​and the top-level projects developed by them, demonstrating its importance in the developer ecosystem.

For more detailed descriptions and experimental results of the entire algorithm, you can refer to this blog .

think

  • 1. Collaborative influence is built on the basis of activity, but at the same time, it avoids many problems caused by activity, and makes use of some important related information contained in the global data of the open source ecology.
  • 2. In the aforementioned automation projects, although the number of certain behaviors is extremely large, because their automation accounts are not associated with other projects, this abnormally high activity is eliminated when the network is constructed.
  • 3. This also brings another huge benefit, that is, for the behavior of scoring points, it is almost impossible to take effect under this model. That is to say, the high active score of your own project cannot drive the influence of your own project unless there are more developers of other ecological projects active in your project.
  • 4. For the aforementioned projects that do not use the GitHub Issue function, since their developers are not only highly active in the project, but also in its upstream and downstream projects, these projects have performed better than the activity indicator. Influence.
  • 5. Due to the use of important information such as network relationships, the impact of artificially specifying weights in the original activity calculation will become smaller. As long as the approximate value judgment is satisfied, that is, the code contribution is greater than the review and the general issue discussion, the change of the weight in the activity will hardly affect the ranking of influence. Because the activity in the influence is a basic data, the structural information of the cooperative network makes the whole algorithm have better stability and robustness.
  • 6. This collaborative network construction brings some additional analytical capabilities in addition to calculating the collaborative impact of the project.
    • Collaboration silos. The so-called collaborative island refers to some project groups, and developers who have been active on them will not be active in other projects, and other developers will not be active in these projects, resulting in this project group being free from the entire large open source ecosystem. outside. Among the nearly 90W projects active in 2019, 90.4% of the projects constitute the core open source ecosystem, and less than 10% of the projects form a huge number of isolated islands. The most common of these is the collaborative silos caused by language isolation, such as some developer groups in Japan, Russia, France, Ukraine, and Belarus.
    • Project clustering. Since there is a certain connection between projects with a high degree of collaboration, it means that they may have similar attributes, and some clustering methods can be used to judge the category of the project. The specific algorithm and results can refer to the previous blog .

question

  • 1. The open source collaboration network can currently only analyze projects. When analyzing developers, if a similar network is used, its effectiveness is limited. The main reason is that a large number of automated collaborative accounts have great advantages under this model.
  • 2. The open source collaboration network only utilizes the GitHub behavior data of developer collaboration, and there are more metadata in the open source world that can be utilized. However, more data cannot be compatible with this model at present.
  • 3. The open source collaborative network only uses the relationship, which is also the limitation of the PageRank algorithm, but it is also one of the advantages, that is, the initial value independence (or more accurately, the Markov property). This means that we cannot use some prior knowledge in this model, such as the inherent properties of projects or developers, etc.
  • 4. Although the design of open source collaboration network is not sensitive to activity, it has requirements. That is, if low-cost behaviors such as star or fork are introduced into the activity, it will lead to a connection relationship between a large number of projects, which will lead to a decrease in the accuracy of judging the project category. That is to say, the effect of the current clustering largely depends on the underlying activity design.

Summarize

In general, the collaborative influence of projects based on the open source collaboration network solves many problems in the activity index based on statistics, and provides a very effective means to evaluate and gain insight into the impact of projects in the entire open source ecosystem. But at the same time, there are some other inherent problems, which make the whole model and indicators have great limitations in scalability.

We hope to find a graph model with high scalability, which can introduce more diverse open source data, have deeper and more accurate insights and analysis on open source projects and developers, and can also avoid indicator brushing and zoning. some questions. For the value stream network model of the open source ecosystem , please continue to pay attention to this series of articles.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324152939&siteId=291194637