The greasy statistics of the seven big data traps: a depressing description

In statistics, the most basic and common branch is the so-called descriptive statistics (descriptive statistics) : a set of data is summarized into several indicators that can describe or summarize the data set itself.

such as:

  • The average income of all employees of a company
  • The college entrance examination score range of a class
  • Return error of stock portfolio
  • Average height of players in a team

Some people may ask, describing statistics is nothing more than summarizing the data. Does such a simple matter involve traps?

In fact, of course there is.

In the previous series, even simple additions involved traps.

The descriptive statistics here will have a slightly more complicated mean or standard deviation statistics, which has expanded the possibility of making mistakes in structure.

In descriptive statistics, the central tendency of the data set will be discussed, which involves statistics such as the mean and median .

The most common mistakes do not happen when calculating these indicators, and these formulas are really not difficult.

The real difficulty brought by central tendency indicators is that when showing people these indicators , such as the mean, some people always think that --- since the mean is like this, then the values ​​in this data set should be like this.

Of course, this is a very lazy idea, and laziness easily falls into the pit. Here is an example from the sports world.

 

Here comes the example

The average statistics for male players in the National Football League are as follows:

He is 25 years old, is about 6 feet 2 inches tall, weighs 244.7 pounds, has an annual income of 1.5 million U.S. dollars, wears a No. 51 jersey , and has 13 characters in his full name (including spaces, hyphens, etc.).

These statements are so-called literal data facts, derived from the information of 2,874 active players on the preseason roster of the 32 North American Professional Football League teams in 2018.

Seeing these data, some people may think: just find a player, these indicators will be very close to the average value of the data provided, the error should not be too big.

Speaking of whether this is a feeling, it is not far from the pit. Would you be surprised if there is a player who is 9 feet 3 inches tall (50% deviation from the mean)?

Some people do, but they don’t have to. Obviously, there are preconceived things.

Make a histogram for the above 6 attributes, as shown below.

It can be seen that the shapes are different, you can try to guess what attributes each graph should be.

All the attributes referred to by each figure will be given later.

Now, please do the connection problem as shown in the figure below.

I hope that from these different data distribution graphs, think about whether the central trend is representative of the entire data.

For example, in Figure A, the left side starts to be smaller, and then the overall data is basically maintained at the same level. To the right end, the data density does not attenuate, but suddenly there is no data. What attribute should this be?

Look at E again. There is actually a middle low point. What could this be? No hurry, everyone guess slowly.

 

Okay, usually the answer is available in 10 minutes, as shown in the figure below.

Please think carefully based on the answer. What factors will affect the answer, is it common sense? background knowledge? Quantitative estimation? Or something else.

 

A. Evenly distributed: jersey numbers

Whenever the answer comes out, there is a very reasonable feeling, right! ?

Let's take a look: In a completely uniform distribution, any value is randomly selected with the same probability.

Of course, empirical data sets from the real world almost never follow a distribution perfectly. But as can be seen from the figure below, the overall still shows a high consistency, except for the first one on the left, but only about 5% of the players.

 

The mean value of jersey numbers in this dataset is 51 .

There is a background here, that is, the jersey number is between 1 and 99, and there is no number 100.

This is why the right end eventually stopped suddenly.

Then, in the context of this specific data distribution, if you randomly sample a value, or sample a few more, someone can really guess that these values ​​should not be much different from the mean; so boldly assume that the mean at this time Can represent but most of the data situation!

At this point, he has completely entered the pit.

However, something interesting is coming. During the 2018 preseason, of the 2,874 active players, only 27 had the jersey number exactly 51.

This means that with an average of less than 1% chance of guessing the jersey number of any player.

There is a little bit of knowledge. According to official rules, only players in the "center" position (players in the middle of the offensive line, passing the ball to the quarterback) can have this body.

After removing the outliers, you can also find that no matter which code you guess, there is a 1% correct rate.

In 2018, the most frequently seen jersey number in the league's player roster is No. 38. If you guess No. 38 is the average, it is only 1.347% correct.

Make another histogram for this group of data and change the size of each group to 1, as shown in the figure below:

 

For statistics novices, at least they should have this understanding :

When encountering a uniform distribution, it is necessary to see the minimum and maximum values, while the mean and median are located in the center of the range, and there is no additional information.

But is the No. 51 jersey the "typical" jersey worn by NFL players? Of course, this is within the scope of possible. It will not be said that this is atypical, just like the jersey No. 1.

But using the word "typical" does not provide any effective information. After all, there are not many centers in each team.

In the next article, let's talk about the second distribution, which is the most famous normal distribution in statistics .

 

Guess you like

Origin blog.csdn.net/qq_40433634/article/details/108968547