ADP-01-Seven big data traps, how many have you stepped on?

Reading Avoiding Data Pitfalls by Ben Jones , in the first chapter, the author quoted the sentence You need to give yourself permission to be human from Joyce Brothers , which is meant to show that you and I are mortals and you will inevitably make mistakes. For those who do data, there is no one who has never fallen into the pit, and definitely more than once. Sometimes it’s a long time after falling into the pit, and then I’m aware of it. Don’t mention the remorse and sadness. I hope these mistakes and lessons can pave the way for future data work.

Presumably, most data analysis practitioners have encountered such a situation: you are presenting a data report to everyone, with extraordinary insights, beautiful and consistent charts, and the conclusions are beyond doubt. Just at this moment, a kicker came and asked how you know about the incomplete or problematic data in the data collection or database, how to deal with it, what is the basis, and how to determine the data currently used Is really available, etc. If you are not prepared enough, questions like this will surely make the preacher extremely nervous.

Not being aware of the cognitive blind zone or neglecting the blind zone is the root cause of falling into the data trap. Especially in recent years, the blessing of many data technologies has allowed data workers to get ready-made data for analysis and research. Gradually, data analysts no longer pay attention to the status of data flow at the engineering level, and some companies no longer require analysis. Teachers have this ability; from the perspective of division of labor, this is reasonable. But the problem is that once the data has problems at the engineering level, analysts often fail to detect them in time, and will still inherently process the already problematic data, which will obviously cause bad results. Therefore, in actual work, data analysts and data scientists are still required to have sufficient data engineering qualities.

Of course, the data trap does not materially hinder the progress of data work. Over the years, due to the introduction of data methods and technologies, people have a deeper understanding of chromosomes and nervous systems, as well as geological and climatic phenomena, and have drawn extensive and detailed astronomical galaxy maps. And product recommendation in e-commerce has become commonplace. All kinds of data applications, different.

Correspondingly, the road is one foot high and the magic is high, and the level of data trap hazard is also rising. Data abuse alone has caused great harm and losses, such as the failure prediction model of Wall Street financial analysts and the ambiguities in Google's flu trend prediction. It can be seen that our data applications are not always so successful, and sometimes they can backfire on humans themselves.

People seem to have an inertia to make mistakes. If someone else did it, as a bystander, we often find the problem quickly. But if we do it ourselves, we might find it very late, sometimes only after being reminded, but the problem is no longer small. Once you encounter this situation, don't complain, quickly troubleshoot the problem and fix it as soon as possible.

Although there are many data traps, I am very grateful to Ben Jones for helping us organize them into seven categories, which are introduced below.

Pitfall 1: cognitive error: how do we view data

People often ask: What can data tell us? In fact, perhaps more importantly, what can't the data tell us? We know that epistemology, in the field of philosophy, mainly studies the reliability of knowledge, and can be used to reveal what is a reasonable belief and what is only a subjective assumption. People often process data under the wrong way of thinking and premise, which leads to errors, such as:

· Assuming that the data we use is a perfect reflection of reality

· Draw conclusions about the future based on historical data only

· Attempt to use data to verify previously held beliefs, not to test its true or false

Avoiding cognitive errors, ensuring clear thinking, and understanding whether the premise is reasonable or not are the basis for competent data analysis

Pitfall 2: technical trap: how do we deal with data

When we decide to do data analysis and research on a specific problem, we usually go through these processing steps:  collection, storage, connection, conversion, and cleaning to make it in the correct form . And improper handling can lead to:

· The data does not match the category level and the entry of dirty data

· Inconsistent or incompatible measurement units or date fields

· Clustering different data sets together, there are missing values ​​or duplications, which changes the original data distribution

These steps are complicated and troublesome, but they can ensure the accuracy of the analysis. Sometimes, the amount of information will be lost in the process, so you must think twice about each step, and it is best to note the intention of doing so for future review, and there is a trace of modification. If you ignore the data set and the processing problems, and make a conclusion hastily, you will undoubtedly fall into a pit and know the pain.

Pitfall 3: math mistakes: how do we calculate data

Data processing cannot be separated from mathematical calculations. Improper processing of quantitative data can lead to, for example:

· Aggregate data at different levels

· Overcalculation of ratios or proportions

· Mixing ratio and percentage

· Processing data of different dimensions

The above are just a few examples of how to obtain certain data fields and use them to create new fields; it seems that it is not difficult, but in actual work, the error rate is not low and usually causes no small problems. For example, in 1999, an engineer's clerical error caused the NASA Mars spacecraft to fail and cost 125 million U.S. dollars. This is no longer a trap to describe, it is simply a black hole.

Pitfall 4: statistical errors: how do we compare data

"Lies, damn lies, and statistics. This is a well-known proverb in the West, which implies that some people fabricate numbers to mislead others. But when we encounter this problem ourselves, we also use statistics to cause self-deception. , And whether descriptive statistics or inferential statistics, confusion is everywhere:

· Does our measure of central tendency or change lead our research astray?

· Does the sample we are studying represent the population we expect to study?

· Is the comparison method we use effective and statistically reasonable?

There are countless similar problems, and it’s hard to see the micro-knowledge. This is because of the way of thinking in dealing with problems, even experts

Sometimes things go wrong. For example, "simple random sample" is difficult to easily obtain the required sample; how to explain the meaning of significance or p-value to Xiaobai. These are not easy things

Pitfall 5: analysis bias: how do we analyze data

Data analysis is the core content of data work, which can draw conclusions and guide decision-making. Although there are many data analyst positions, in fact, many people cannot do without data analysis. Data analysis can certainly improve the quality of work to a very high level, but improper handling can also seriously reduce the quality of work, such as:

· Model and historical data overfitting

· Important information in the missing data set

· Invalidation

· Data indicators are not representative

Take Google's flu trend prediction as an example. Even though the search algorithm has been improving and user feedback is also involved, this is still not enough to make people believe that this can accurately predict the number of flu people. In fact, it is still far away.

Pitfall 6: graphical errors: how we visualize data

Because of the visual and vivid image, issues in this area are often noticed and discussed. Sometimes looking at these things is dazzling and incomprehensible, even for those in the industry. There are a bunch of pie charts, column charts, and countless slices, with a y-axis standing aside. I don’t know who to look at. But fortunately, these problems have been studied for a long time, and can be identified by the following questions:

· Does the chart match the current theme?

· If a point of view has been clearly expressed, is it necessary to put so much effort into the chart?

· Has the rule of thumb been overused?

Of course, if we fall into one of the first five traps above, it is meaningless to make the chart type perfect and correct, but if we succeed in avoiding the first five traps, if we mess up at this step, it is really It is a pity.

Pitfall 7: design danger: how we dress up data

People usually appreciate a good design image. When driving a well-designed car to work, all the control devices are in the correct position; sitting on the chairs at the desk, these chairs fit our body contours perfectly. So why should we sit down and open our browser to look at some boring and fancy infographics or clumsy and boring data dashboards? It can be seen that design is very important. We need to consider the following points:

Does the color choice confuse the audience or make it clearer?

· Did you use enough creativity to appropriately beautify the chart, or did you miss an opportunity to add value to the chart with aesthetic elements?

· Is the created visual object easy to interact with, or is it confusing to use?

Good design elements can make the audience focus on the information to be expressed, rather than other irrelevant objects.

to sum up

These seven traps are like seven double-edged swords-any of them can accomplish or destroy our data work. But there is no need to be afraid of them. When we find ourselves in them, learn how to recover quickly, or, better yet, learn to avoid them altogether.

In the follow-up, these seven traps will be interpreted in detail and the countermeasures will be given. There will be a big easter egg at the end. Stay tuned

 

Main references Avoiding Data Pitfalls, Ben Jones Statistics From Data to Conclusion, Wu Xizhi

For more content, please follow the official account of Haidata Lab.

Share this issue here, we will update the content every day, we will see you next issue, and look forward to your visit again. If you have any suggestions, such as the knowledge you want to know, the problems in the content, the materials you want, the content to be shared next time, and the problems encountered in learning, please leave a message below. Please pay attention if you like it.

Guess you like

Origin blog.csdn.net/qq_40433634/article/details/108771238