The greasy statistics of the seven big data traps: power law distribution

This article is the last one to describe the statistical traps. Let's take a look at the last of these six pictures, the player's income .

 

 

F player's income

Like other series of articles, we put the best data distribution at the end.

On average, a rugby player earns about 1.5 million US dollars a year.

The exact average income of 999 players in the 2017 season was $1.489 million.

Moreover, it is known that there is a distribution mechanism for players’ income,

So, as the enlightenment brought by the first five pictures, how many players' individual income levels can the overall average income represent?

To recap, the average (herein referred to as "arithmetic mean") is an indicator of the degree of data concentration in the entire distribution .

In other words, if we change the actual salary of each player to an average salary of $1.489 million,

Then all people together will get the same total salary as the players in the entire league, which is about 2.97 billion US dollars.

If this is the case, many players will be satisfied with this arrangement.

Let’s look at the distribution of player salaries in the figure below. There are three biggest intervals—$0-499,000, $500,000-$1 million, and $1 million-$1.5 million-almost equal to or below the average.

In fact, in the wage data of 1999 players in the 2017 season that can be found, 1532 players have wages lower than average wages. Accounted for 76.6% of all players .

If you presuppose that they can earn $1.5 million per year, then the vast majority of players will be helpless. Of course, they will think about this average income for a while.

But some players may show contempt on their faces. For example, quarterback Kirk Cousins ​​(Kirk Cousins), his income in 2017 was close to 24 million US dollars. He is very inconspicuous on the far right side of the histogram, but his salary is 16 times this average.

If the standard deviation is used to estimate this distribution, Kirk Cousins, it is 10 standard deviations larger than the average.

If there is a player who is taller than the average height by this standard, then the player's height is 8 feet 4 inches.

In contrast, the tallest person in the world (8 feet 1 inch) is now a full 3 inches shorter than him.

The standard deviation of the salary data is $2.25 million, which is larger than the average value itself.

As you can see, wage distribution is a completely different mechanism. It is neither uniformly distributed nor like a normal distribution.

This is the power-law distribution , which is as famous as the normal distribution , and it is ubiquitous in the field of social sciences.

Try to imagine the distribution of the number of fans for each social media account: relatively few accounts have a large number of fans,

Each such account has thousands of fans, and most of the remaining accounts may each have a few or fewer fans.

Thus formed an incredible long tail.

This is true for many things, such as book sales, website visitors, music on streaming services, and movie views.

As humans, we devote a lot of attention, most of our money and love to relatively few other people and products. In many things in human life, winners often get rich benefits.

This is why the power-law distribution is often said to follow the Pareto law , or the law of two to eight : 80% of the benefits go to 20% of people.

By the way, the numbers 80 and 20 are of course not definite data.

It's just a simple way of data rhetoric to make people understand that relatively few people get relatively excessive benefits.

Taking the salary of a football player as an example, what we are considering now is that 80% of the salary will be earned by the top 800 players in the league, which is 40% of the entire league.

In the entire league, a full half of the revenue was taken by 214 players, which is a little more than 10% of the league's total.

If we draw the cumulative salary of each player, starting with the player with the highest salary, such as Kirk Cousins ​​on the left, and then adding the salary of the next player to the total salary,

With this development, we can see how skewed this distribution is,

As shown below:

In contrast to this is the cumulative distribution of player height, the highest, as mentioned above, Nate Wozniak is placed on the far left,

As shown below

This is almost a perfect straight line.

Here is another way of thinking. If you build a staircase, the height of each step is proportional to the height of each player. The first step from the ground is the tallest player, and the last step from the top is the shortest player. From a distance, the entire staircase is almost invisible. To what a difference.

If you do the same thing, but let each step be the cumulative value of each player's income so far. It is not difficult to find that the stairs will bend as sharply as the arc above.

 

Conclusion

The only analysis here is the data of an industry in a certain period of time, and you can see that a few large numbers are still significantly higher than the average.

In fact, the target groups and scope of income data distribution follows a power law distribution involves a very wide range , from all the players, all the people on Earth, this distribution is not exactly like a person's height distribution, it's not like craps .

So far, the six distribution maps have been analyzed.

It should be said that it has given us a vivid example, showing that the median or average value can only be viewed from the overall perspective. If these indicators must be used to label individuals, I am afraid that one foot has already stepped on the data trap.

The next article will introduce you to the pitfalls of statistical inference , so stay tuned.

 

Guess you like

Origin blog.csdn.net/qq_40433634/article/details/109123612