basic concepts of data analysis

With the further convergence of the Internet, the analysis of big data will inevitably become the work of a key strategic department.

Just like many things first exist and then become reasonable, data analysts also exist because of the actual needs of some companies, and then the jobs and skills they engage in will continue to enrich and improve.

When it comes to data analysis, Xiao Cheng will think of Sherlock Home. To solve a case, you need to analyze data:
Data Analyst 2

But as an ordinary technician, readers don't need to be as "smart" as the characters in TV dramas. They only need to master general knowledge and skills to be able to do their jobs, and then they can continuously improve their abilities.

Some organizations have defined the skills that data analysts should have based on their own understanding, such as the following picture from the Internet:
Skills for Data Analysts

This picture has a certain rationality. Readers who are determined to become data analysts can refer to the skill requirements mentioned in it.

As the beginning of data analysis, this article introduces several concepts that are often mentioned in data analysis.

The concepts introduced below may be boring to readers, and it is recommended to skip reading.

(1) Average

Average, which is the arithmetic mean, that is, the sum divided by the number (or the sum of other units). The average value is a frequently used concept, such as "the average student can be divided into 2 iPhones", "the average download speed is 1MB/s", "the average monthly cost is 4,000 yuan".

A flaw of the mean is that when extreme conditions exist, that is, when the maximum and minimum values ​​are outrageous, the averaged value becomes unreasonable. Remove the reason for re-averaging.

For an example of this flaw, take a look at the following pictures from the web:
unreasonable average

The recruiter tells the reader that the average salary is 1,800 if he takes a job, but when the reader is actually an employee, the salary is only 800.

This is also an example of the fallacy of the mean.

Look at another picture:
Household income

The income gap between different levels is very large. If the income of several families is collected and the average is taken to represent the general family income, it is unreliable. The rich equalize the poor.

For this kind of statistics, you can remove the extreme values ​​and re-statistics, or take the proportion of each interval, or use the median or mode described below.

(2) Median number

The median is the separation value of the large and small values, and the occurrence of maximum or minimum values ​​does not affect the median, so in this extreme case, the median is a useful reference value.

For odd-numbered sequences of numbers (sorted), the median is the middle value. For an even number, the median is the sum of the two middle values ​​divided by 2.

For example: 1, 2, 3, 4, 5 The median is 3.

For example: 1, 2, 3, 4, 5, 6 The median is (3+4)/2=3.5.

(3) Mode

The mode is the value that occurs most frequently. There may not be one mode, or there may be multiple modes.

For example: 1, 1, 2, 5, 3, 5, 1 mode is 1.

For example: 5, 4, 6, 2, 5, 6 The mode is 5 and 6.

The mode is "everyone is like this", which has a certain reference significance.

(4) Absolute numbers and relative numbers

Absolute numbers are numbers without comparison. For example, the weather is 27 degrees, there are 50 students in a class, the monthly salary is 50,000 yuan, and so on.

A relative number is a ratio, such as a 10% gain, less than half the weight of someone, a ratio of 1:3, etc.

In simple terms, absolute numbers are natural numbers, while relative numbers are generally percentages (or can be converted to percentages).

(5) Percentage and percentile

An 80% increase in cost and a 30% decrease in speed, these are percentages, which is a recurring form.

A point, or a percentage point, is 1%.

Generally, when the magnitude of the percentage changes, the percentage point is used, such as from 3% to 5%, which is an increase of 2 percentage points.

(6) Proportion and ratio

The proportion of the part to the whole is the proportion. For example, the failure rate is 0.01% (the sum of failures and successes), male colleagues account for 70% of all colleagues, and so on.

The ratio is the ratio of each part. For example, the ratio of female students to male students is 1:3, and so on.

(7) Multiples and Fans

Generally, in a rising scenario, use multiples, such as 2 times. In the case of a decline, a percentage is used, such as a 30% decrease in revenue, and of course, a percentage can be used when it is rising, such as a 300% increase in the number of participants.

Fan number, representing 2 to the Nth power.

The net income has been doubled, indicating an increase of 1 times (2 to the power of 1, which is 2 times the original).

Twice, it means 4 times (2 to the power of 2); 3 times, it means 8 times, and so on.

(8) Year-on-year and month-on-month

Year-on-year, for comparison, for example, now is May, year-on-year in May last year, this month's major failures fell by 30%.

MoM, used for trends, such as the previous week, the previous month, this week or this month.


To sum up, this article briefly introduces the concepts that are often encountered in data analysis, such as average, percentage, fan number, year-on-year and month-on-month.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325642584&siteId=291194637