How to evaluate an open source project (3) - value stream network

Editor's note

"How to evaluate an open source project?" has always been a controversial problem to be solved. Just relying on some simple indicators, such as the number of stars, it is difficult to accurately reflect the quality of the project. Therefore, many platforms and organizations have launched special tools, such as Gitee's " Gitee Index " function, which evaluates the health of an open source project from multiple dimensions such as code activity, community activity, team health, fashion trends, and influence. . The CHAOSS project group under the Linux Foundation is committed to providing quantitative indicators for evaluating the development of open source communities and projects...

There are also attempts to systematically collect data from the open source community, including project popularity, reliability, and activity, to judge and evaluate the quality of an open source project.

Zhao Shengyu, a core member of X-lab Open Lab, has long-term researched open source theory and open source behavior data, and conducted related experiments. Recently, he published a series of articles on "How to Evaluate an Open Source Project". The series continues to be updated, so stay tuned!

The following main text is the third in the series.

Call for Papers:

The OSCHINA community is looking forward to more different voices, and is now inviting all users to discuss and participate. Which open source project NO.1 do you have in mind? What do you think is the most important metric to evaluate an open source project?
Welcome to participate in the discussion, and the way to participate is as follows:
  • Submit your manuscript to the editorial department ([email protected]), express your opinion and present your arguments. After the review, we will recommend articles and authors, and send a small gift.
  • Leave a message in the comment area of ​​the article, as of October 5th, the comment user with the most likes will also receive a small gift.
Participate in the norm: Start a discussion on the topic itself, justify it, and elaborate the arguments as much as possible.

text


about the author

Zhao Shengyu
Doctor of Computer Science from Tongji University, core member of X-lab open laboratory, focusing on open source theory and open source behavioral data research.

This blog continues the previous introduction on activity and collaborative influence , hoping to solve the problem that collaborative influence cannot accommodate more data, so as to measure the open source ecosystem more comprehensively, and at the same time introduce a highly scalable mathematical model , which can quickly accommodate more metadata at any time without causing significant changes to the model. I also hope that more friends who are interested in community metrics can participate in the discussion. For my contact information, see the About page .

A complete open source galaxy to explore.

background

The previous two blogs introduced a calculation method of the activity index based on the weighted statistics of developer behavior, and the calculation method of the project collaboration influence index under the open source collaboration network based on activity. Among them, the activity degree mainly solves the cognitive difficulties caused by the multiple indicators, and at the same time, through different behavior weights, this indicator can play a positive role in stimulating the behavior of developers. The collaboration activity is from the perspective of the entire open source technology ecosystem. While considering the collaboration between projects, it solves to some extent the problem of malicious score brushing that may occur in activity and the inconsistency of sorting caused by artificially given weights. Stability issue.

However, only the above two indicators, in addition to some inherent measurement accuracy problems, also have an important defect, that is, the model needs to be revised again after the introduction of new ecological data. The data of the open source software ecosystem is far beyond the scope of GitHub global behavior data, so we need a highly scalable mathematical model that can be integrated at any time when more open data is available.

Value Stream Network

As stated in Nadia Eghbal's new book "Working in Public": "The purpose of consuming code is not to simply read and study it, but to use it. The value of open source does not come from its static qualities.", "However, measuring the value of code based on dependencies only gives us part of the equation. It matters who is using open source code, but doesn't it matter who developed it?". These two sentences are a good indication of the most basic underlying logic that we should follow to measure open source software.

That is, the two most important points to measure whether open source software or open source digital products are valuable are on the production side and the consumption side. From the production side, even if two developers have the same level of activity, the value generated by the behavior of an excellent developer is completely different from the value generated by a developer who is just getting started. On the consumer side, if an open source software is under continuous development but has never been used, the value of a project that may have been inactive for a long time but used by hundreds or thousands of people is also different. .

Therefore, the value flow network essentially hopes to analyze from the perspective of the social value generated by open source software, and generate a model from the production end to the consumer end, which can directly measure the social value of each software, and can also deduce each software in reverse. value of a developer. This is fundamental work for building a complete open source economic ecosystem.

Start with collaborative influence

Let’s go back and look at the collaborative influence model from the previous article.

In fact, in the original webpage ranking algorithm, it is based on a probability model, that is, when an Internet user browses a certain webpage, which webpages are he likely to go to next? There is a high probability that he will randomly find a link in the external link of the current page to continue to visit, or it may be closed and then open a random link in all pages. The final ranking of the web page is that the web pages with a higher probability of being visited are ranked in the top position.

But at the same time, this model can also be viewed from the perspective of value flow, that is, each web page will transmit some of its own value to the web pages linked to it, and obtain some value from the web pages linked to it. At the same time, all web pages also have a base value. Then when the value flow of the entire network structure is stable, the value of each web page is completely determined, and the web page with the greatest value will be ranked higher.

From the open source collaboration network in the previous article, it can also be considered that each project has a basic value, and at the same time, the association between projects caused by developer collaboration will bring value flow and transfer between projects until the entire When the network is stable, the collaborative influence of all projects is determined.

a simple example

From the above point of view, in fact, we can give a more generalized method to construct and calculate the value network. The simplest example is given below, adding more data relationships, especially the data on the production side and the consumption side, while considering the developer's contribution activity.

As shown in the figure above, in this simple value network, there are two nodes, developers and projects. Developers and projects have their own values, and developers have a follow relationship, similar to the Follow relationship on GitHub. The projects have dependencies, that is, the upstream and downstream usage relationships. At the same time, in addition to the developer's activity on the project, we also added the concept of attention, that is, the one-way behavior initiated by the developer on the project, such as star, watch, fork, clone, etc., expressing attention to the project , but no behavior that is actually fed back into the project. The flow of value in this network can be represented in the following table, with each cell representing the value transfer from row node to column node:

  project Developer
project rely activity, attention
Developer Activity focus on

Under such a model, in the value flow network of global open source projects and developers, the value of each developer will flow outward through its activity, project attention, and attention to other developers; Value flows outward through liveness and dependencies. That is, most of the value I create personally will flow to the projects I contribute through my specific contribution behavior, another part will flow to the projects I follow but do not contribute to, and another part will flow to the developers I follow. body. Part of the value of the project will return to the developer through the contribution relationship, and part will flow to the upstream project that provides it through the dependency relationship.

Since projects and developers will retain a part of their own value, we can add some inherent properties that are not in the network to their initial value. For example, the initial value of open source KOLs is higher, then this part of the initial value will actually be Continued impact due to having a percentage of reserved holdings.

Whether this model can finally converge and obtain a stable solution is a relatively complex mathematical problem. Interested students can refer to the content in the appendix.

Open Source Ecological Value Stream Network

The above is a mathematical model that can be quickly implemented and verified, and has good scalability, but in fact, the value network of the entire open source ecosystem is far more complex than the above network. The data contained from the production side to the consumer side is far beyond this category, especially on the consumer side, being relied on by other projects is not the way open source projects are ultimately consumed. The way all software is finally consumed is to meet the needs of some real users by becoming a service. In fact, it should refer to its ultimate social utility, not whether it is integrated for secondary development. If the project after secondary development is also not used by any user, does not solve any real needs, or in other words does not produce any social utility, then its value is limited.

Here is a more complex value network that may not be feasible at the moment for your reference:

In the complex network structure represented in the above figure, its value flow can be observed from the following table:

  project software Developer company foundation Investment agency user
project rely on, use use activity, attention have have Invest -
software use - - - - - use
Developer Activity - focus on hire member - -
company have - hire Subsidiary sponsor Invest -
foundation have - member sponsor - - -
Investment agency Invest - - Invest - - -
user Activity - focus on hire member - -

In the above figure, a large number of entities related to the software ecosystem are included, such as software, companies, foundations, investment institutions, users, etc. From this incomplete perspective, the activity of developers and the use of software by users are essentially the source of the ecological value of all open source software, while investment by investment institutions, companies through their employment relationships with developers and sponsorship of foundations The relationship injected is the external influence value, and other values ​​flow within the network according to the above-mentioned relationship.

In addition to the dependencies between the above-mentioned projects, there is also the use relationship between software and projects. In fact, traditional project dependencies mainly refer to secondary development, which is usually introduced in the form of a specific language product package. The use of software means that when the end user provides services, it must not be a separate project to provide services, but also include the underlying operating system, database, virtual machine, its development language, and other services that interact through RPC. It belongs to the category of software use.

Under this network model, if we can quantify the specific value and flow mechanism of each part, not only can the value of all entities be well evaluated, but also with the continuous improvement of data, this value will gradually approach Its corresponding real social utility. This is the ultimate goal. In fact, this method is often used to solve the complex system, and its solution is usually the steady-state solution of the complex system, revealing the operating model that the complex system should have.

Of course, this is not a final model. For example, for the introduction of security risks, those developers or companies that have been engaged in monitoring and analyzing technical security vulnerabilities for a long time also bring great value to the open source ecosystem. But the extension of the extension will cause the network to expand rapidly, so it will not be expanded here.

think

  • 1. The value stream network model actually hopes to include all the data contained in the open source ecosystem as much as possible, and more importantly, to provide an upper-level model that is compatible with more data.
  • 2. The value flow network model decouples the mathematical model and the business model, so that the upper-layer open source ecological description can be carried out without caring about the underlying mathematical model. For example, the above-mentioned more complex network structure does not involve any mathematics. Model.
  • 3. Whether the value flow network model can obtain a steady-state solution is closely related to its underlying mathematical constraints. Interested students can refer to the appendix. However, after the business model is determined, it can be adjusted collaboratively by students with underlying knowledge to make it convergent.
  • 4. What the value stream network model ultimately hopes to solve is the construction of the economic system of the entire open source ecosystem, which will be discussed in the next article.

question

  • 1. Although this model is a decoupled mathematical model and business model, and has good scalability, if the person designing the business model is not familiar with the underlying mathematical logic, it is likely to design a business model that cannot obtain a stable solution. , so there are higher requirements for business model designers.
  • 2. To accurately measure the economic system of the entire open source digital ecosystem, this model requires a large amount of data, most of which are difficult to obtain and correlate. This part will be very long-term and requires large-scale collaboration. Work will also be part of the work that we will introduce in the future.

appendix

PageRank Convergence Problem

The convergence of PageRank is generally proved by using the theory of stochastic processes from the perspective of its equivalent Markov process. Its corresponding transition matrix needs to satisfy two conditions:

  • 1. Meet the requirements of the random process, that is, the transition matrix is ​​a random matrix.
  • 2. The transition matrix needs to be a prime matrix or a primitive matrix. At this time, the matrix can satisfy irreducible and aperiodic. According to the Perron-Frobenius theorem, the random process must converge.

Then, in a high-dimensional heterogeneous information network, a similar random process also needs to meet the above two conditions, which can satisfy its convergence requirements. For more detailed information, please refer to the relevant content.

 

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324138050&siteId=291194637