How to deal with the era of big data

In recent years, big data has gradually penetrated into real life, from medical care to credit, it can be described as various industries.
From the term "big data" alone, it means that the amount of data is huge. If these data results are not processed and presented as pure numbers, I believe that you will not be able to watch for more than 10 seconds, and your scalp will be numb. Your scalp is numb, and our customers are even more numb. If this problem cannot be solved, it will greatly affect the development of big data. From this, a profession will emerge, that is, data visualization engineer, and its responsibility is to make the results of big data clear at a glance, reducing the reading time and reading threshold of customers.
This tutorial will be perfected as soon as possible to form a classic network tutorial for training data visualization engineers.
Now we enter the course, how to deal with the era of big data! I have summed up three effective cheats.
Three secrets:
● Abandon inaccurate sample data and analyze all data statistically.
Up to now, the data we have obtained and collected is still very limited, so it is more "random sampling analysis".
Definition of random sampling analysis: A method of estimating a certain biological characteristic of a population by taking a sample with equal chance from a population of organisms.
A method of drawing samples according to the principle of randomness, that is, to ensure that each unit in the population has an equal chance of being selected.
Advantages: When inferring the population based on the sample data, the reliability of the inferred value can be objectively measured by means of probability, so that the inference can be established on a scientific basis. For this reason, random sampling analysis is widely used in social surveys and social studies.
Disadvantages: It is only suitable for the limited number of overall units, otherwise the numbering work is heavy; for complex populations, the representativeness of the samples is difficult to guarantee; the known information of the population cannot be used, etc. The scope of market research is limited, or the situation of the survey object is unknown, and it is difficult to classify. And it is necessary to have a better understanding of the overall situation of each unit, otherwise it is impossible to make a scientific classification. This is often difficult to do before the actual investigation, resulting in a poor sample representation.
For example, to find out how satisfied Chinese citizens are with a policy, it is impossible to do a survey of all Chinese citizens. The usual practice is to randomly find 10,000 people and use the satisfaction of these 10,000 people to represent everyone.
In order to make the results as accurate as possible, we design the questionnaire as precise as possible and make the sample sufficiently random.
This is the practice of the "small data era", where random sampling analysis has achieved great success in various fields when it is impossible to collect all the data.
However, here comes the problem:
1. Depends on randomness, and randomness is difficult to achieve. For example, using landline phones to randomly call 10,000 households is also lack of randomness, because it does not take into account the fact that young people use mobile phones.
2. It looks good from a distance, but once it is focused on a certain point, it is blurred. For example, we use 10,000 people to represent the whole country, and these 1,000 people are randomly selected from the whole country. However, if this result is used to judge Tibet's satisfaction, it lacks precision. That is, the analysis results cannot be applied locally.
3. The results of the sampling can only answer the questions you designed in advance, not the questions you suddenly realized.
In the "big data era", sample = population. Today, we have the ability to collect comprehensive and complete data.
Usually what we call big data is based on mastering all the data, at least as much data as possible.
● Focus on the integrity and complexity of the data, and weaken the accuracy of a single piece of data.
In the era of "small data", the first thing we need to solve is to reduce measurement errors. Because the information collected by itself is relatively small, it is necessary to ensure that the results are as accurate as possible It is necessary to ensure that the recorded information is correct, otherwise subtle errors will be magnified infinitely. Therefore, we must first optimize the measurement tool. And this is how modern science has developed. Kelvin, the physicist who formulated the international unit of temperature, once said: "Measuring is cognition." To be a good scientist you must be able to collect and manage data accurately.
In the era of "big data", we can easily obtain all the data, and the number is huge to trillions of data. Because of this, if we pursue the accuracy of each data, it will be unimaginable. If the accuracy of the data is weakened, the confusion of the data is inevitable.
However, if the amount of data is large enough, the confusion it brings will not necessarily bring bad results. It is for this reason that we relax the standards of data, and the more data we can collect, we can use this data to do more things.
To give an example:
to measure the salt content of an acre of land, if there is only one measuring instrument, it must be ensured that the measuring instrument is accurate and can work all the time. But if there is one meter per square meter of land, some of the measurements will be wrong, but all the data will be combined to get a more accurate result.
Therefore, "big data" usually speaks in terms of more convincing probabilities, rather than relying on the precision of measurement tools. This requires us to re-examine the idea of ​​​​obtaining and collecting data. Due to the particularly large amount of data, we give up individual accuracy, and certainly cannot achieve individual accuracy.
For example, we can see on the computer storage that all files can be found through a path. For example, to find a song, you must first find a partition, then find its folder, and finally find the desired song step by step. And that's the traditional method. If there are few partitions or folders on the computer, you can do this search, but what if there are 100 million partitions? What about a billion folders? The data on the network can be far more than the files on the personal computer, and the movement can be billions. If a clear classification is used, not only the person who categorizes it, but also the person who queries it will be crazy. Therefore, "tags" are now widely used on the Internet to retrieve pictures, videos, music, etc. through tags. Of course, sometimes people mislabel a label, which is a pain for those who are used to precision, but accepting "clutter" also pays us dividends:
by having far more labels than "categorical", we can get more content.
Content can be filtered by tag combination.
For another example, if we were to search for "white dove". And "white dove" is associated with a lot of information: for example, an animal, or a brand, or a celebrity. Once we follow the traditional taxonomy, "white dove" will be divided into animal category, brand category, celebrity category. One result is that the person inquiring does not know that it has other categories, and it is possible that they only want to look up the animal "white pigeon", so they will not look in the brand category or celebrity category. However, if you use "label", enter "white pigeon" + "animal", you can find the desired result; enter "white pigeon" + "brand" to find the desired result; enter "white pigeon" + "Celebrities" to find the desired results.
It can be seen that using "label" instead of "category", although there are a lot of imprecise data, but thanks to a large number of labels, it makes our search more convenient.
● Think about the relevance of data, give up single causality and
first study the data itself, no longer need to delve into the reasons for the formation of data, let the data speak for itself.
Case in point:
Walmart is the largest retailer in the world and holds a lot of retail data. Through sales data analysis, Walmart found that sales of flashlights and egg tarts increased each time a seasonal hurricane hit. As a result, when a seasonal hurricane comes, Walmart will place its stocked egg tarts close to the hurricane supplies to encourage customers to buy.
Someone will definitely ask "Why do people buy egg tarts when the hurricane comes?"
And this "why" is a causal relationship. And this "cause" is extremely difficult and complicated to analyze, and even if it is finally obtained, it is of little significance. For Walmart, when the hurricane hits, just lay out the tarts and that's it. This is where the data speaks for itself.
And we know that hurricanes have something to do with egg tarts, and it's okay to make money.
This is the way to deal with the era of big data, which is to think about the correlation of data and abandon the single causal relationship.
This approach can help us better understand the world. Sometimes, causality can also give us some wrong perceptions.
E.g:
We heard from our parents to bring a hat and gloves when it's cold or you'll catch a cold. However, this is not the case with colds. Or maybe we eat in a restaurant and suddenly have a stomach ache, and we can think of a food problem. But in fact, it is likely to be related to exposure to external germs.
Correlations provide new perspectives when analyzing problems, allowing us to understand what the data speaks for itself. However, causality should not be completely abandoned, but should be viewed from the standpoint of scientific relevance.

A new question arises: how to make data clear at a glance in the era of big data? The answer is here

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325463324&siteId=291194637