Garbage in, Garbage out!!!

Garbage in, Garbage out!!!

Everyone who engages in econometrics pays attention to this account.

Posted by: [email protected]

All the code programs, macro and micro databases and various software of the econometric circle methodology are placed in the community. Welcome to the econometric circle community for exchanges and visits.
Garbage in, Garbage out!!!

A few days ago, we introduced ① "200 articles used in empirical research, a toolkit for social science scholars", ②50 famous empirical posts commonly used in empirical article writing, a must-read series for students, ③AER in the past 10 years The Articles album on Chinese topics, ④AEA announced the top ten research topics that received the most attention in 2017-19, and give you the direction of topic selection, ⑤The key topic selection direction of Chinese Top Journals in 2020, just write these for writing papers. Later, we introduced a collection of selected articles using CFPS, CHFS, CHNS data for empirical research! , ②These 40 micro-databases are enough for your Ph.D., anyway, relying on these libraries to become a professor, ③The most complete collection of shortcut keys in the history of Python, Stata, and R software! , ④ 100 selected Articles albums about (fuzzy) breakpoint regression design! , ⑤ 32 selected Articles of DID about the double difference method! , ⑥ 33 selected Articles of SCM about the synthesis control method! ⑦Compilation of the latest 80 papers about China's international trade field! ⑧Compilation of 70 recent economic papers on China's environmental ecology! ⑨A collection of selected articles using CEPS, CHARLS, CGSS, CLHLS database empirical research! ⑩Compilation of the last 50 papers using the system GMM to conduct empirical research! These articles have been welcomed and discussed by scholars, and doctoral supervisors have recommended them to students.

Attention: The research assistant position is open to undergraduates at home and abroad. At this stage, the main job is to translate and compile cutting-edge econometrics information, and you can get more academic contacts than your classmates have. Partners who meet the following requirements can send their resumes to the mailbox: [email protected].

Requirement: Ability to read and write in both Chinese and English, be careful and rigorous, and be able to complete the project within the specified time.

This foreigner used China's data to publish a number of top five publications including AER, JPE, and RES! Ashamed!

The input is garbage, the output is also garbage

Translator: Pan Yingkun, Business Administration, Central University of Finance and Economics

Mailbox: [email protected]

Avoiding garbage in and out: the importance of high-quality data

High-quality data is an important factor for effective analysis

High-quality analysis depends on many factors, the most important of which is an in-depth understanding of the business problem to be analyzed, and a team of experienced and knowledgeable data professionals who have the right tools and techniques to perform the analysis.

But for effective analysis, the most important element is high-quality data. In this three-part article, we will understand what high-quality data is and how to ensure that your analysis is based on the best possible data. Essentially, data quality can be attributed to three factors: input data, methodology and quality control. In the first part, we will examine the input data.

Part1 understand the input data

As the saying goes: rubbish in, rubbish out. High-quality input data is the basis for generating reliable models and data sets. No matter how good the model is, if the input data used to build and implement the model is not of high quality (incomplete, outdated, biased or inaccurate), then the resulting prediction or data set is almost impossible to be reliable.

Nowadays, you sometimes hear data providers bluntly saying that there is no perfect data. But the reality is that data depends on how, where, when, and from what object it is collected, and any of these aspects can be the source of bias or error. . Therefore, it is very necessary to understand the source of the input data and determine the truthfulness of the data before starting any analysis work.

Data, no matter how "new", is presented as a fleeting "snapshot", and that inevitably happens in the past. Because of this, knowing when (accuracy and frequency) and how (process) the data was collected is essential to determine the "cleanliness" of the data, and it also helps researchers make decisions about methods and types of analysis. An informed choice.

The freshness of the input data largely determines their ability to reflect the current state of affairs. All other conditions being equal, the data used for 5 years must be less representative than the data used for 5 minutes. In addition, the frequency of data collection is also very important because it affects the types of models that researchers can use, how often these models can be calibrated and the frequency of related predictions. As researchers, an unchangeable fact is that we have to use history to predict the future. Our responsibility is to determine that historical data can well reflect the current or predict the future is our job, and make adjustments when necessary. This is where the researcher's skills, experience, and domain knowledge come into play. Building most models is fairly simple, and the real challenge is to use data results wisely.

The second key part of understanding input data is knowing how to collect data. The data collection process is always flawed, and this often leads to errors, outliers and deviations in the resulting data. Although in many cases, researchers have little to do with the flaws in the collection method, it is important to understand these flaws. For example, the data about purchasing behavior collected through surveys will be completely different from the data collected at the point of sale (POS). People say that what they do is often very different from what they actually do. Therefore, the way researchers process data from the survey system and the POS system should also be completely different. In some cases, the "location", "method", "time" and "who from" in data collection will greatly limit the types of technology and analysis we can use.

When we receive data, we will conduct a series of checks and questions before applying it. Here are some of the elements we checked and the level of granularity reduced to help us evaluate the input data:
How many unique records are there?

How many duplicate records are there? Should it be repeated?

   2.1.数据集中有多少个字段可用,它们是什么数据类型?

a. For string fields, should they have a specific structure? For example, the column representing the postal code should contain 6 characters, and these characters should have a specific structure.
b. How many records meet the required format?
c. Is there a way to clean up records that do not conform to the prescribed format?
d. In future analysis, should the cleaned records be treated differently?
2.2. For numerical variables, what are the range, variance, and central tendency?

a. Are they logical? For example, if 99% of the data ranges between 0 and 100, but 1% of the data is negative or exceeds 1,000, does it make sense?
b. Are these outliers true? Is there anyone adding value during the data collection process, data input or processing errors?
c. If outliers are detected, how should they be handled? Should they be excluded from all analyses or replaced with other estimates? (The answer depends on the nature of the outliers, the purpose of the analysis, and the type of model used.)
d. Does the aggregation of variables make sense?
2.3. For categorical variables, are all categories represented?

a. Are the categories consistently and correctly labeled?
Are there any missing values? Are there empty entries or blank entries for specific cells in the data set?

      3.1.某些记录比其他记录有更多缺失值吗?

      3.2.某些字段比其他字段有更多缺失值吗?

      3.3.如何处理缺失值?是否应将它们排除在分析之外,或由其他估计值取代?(答案取决于分析练习的目的和所使用的模型类型)。

How representative are these data?

       4.1.数据收集方式是否有已知的偏差?例如,由于在线调查要求参与者能够访问互联网,因此结果无法推广到整个人群。

        4.2.数据中代表了哪些地理位置?

       4.3.数据在多大程度上反映了某个地理区域内人或家庭的相对分布?

        4.4.属性与其他权威数据源的相似属性相比,情况如何?例如,如果客户数据库包含年龄,那么年龄与总人口年龄的匹配程度如何?如果数据中存在已知差距或偏差,是否有足够的信息来纠正这些差距和偏差?

Answering these questions can help researchers understand the input data and begin to plan their own methods to use the data to build reliable data sets and models. In this three-part article, we will examine the important role that methods play in ensuring data quality.

Part 2 The importance of the right approach is
in the first part of the article on data quality, and how recency, frequency, and data collection process affect the quality and type of analysis that can be done. In this article, we turn to the role of methods in creating high-quality data and the factors in choosing the right method. The discussion will involve some technical issues, but it is worth grasping these important concepts to understand the process of creating high-quality data.
Applying the correct method is
essential for us and this investigation. "Method" refers to the technology we use to build data products and the technology we use to execute custom projects. These technologies range from simple rule-based algorithms to machine learning methods. To a large extent, the type, quantity, reliability, and timeliness of the available data determine the methods we use.
It is helpful to think of methodology as a spectrum with model accuracy on one end and model generalization on the other. There is no direct trade-off between accuracy and generality. The best models have high accuracy and generality at the same time. However, modeling techniques often start from one end of the spectrum and proceed to the other end of the spectrum through model training, calibration, and testing. The following figure illustrates the various modeling techniques with accuracy to summarize where the continuum begins.
Garbage in, Garbage out!!!

Figure 1. Method spectrum and common modeling techniques.
When deciding which techniques to use to build standard data sets or execute custom projects, we focus on comparing the advantages and disadvantages of accuracy techniques and general techniques, as shown in Table 2. Show.
Table 1. Advantages and disadvantages of accuracy and generality
Garbage in, Garbage out!!!

This table uses several technical terms that are important to anyone who uses data, methods, and models. Let's start with correlation and causation. Correlation is just a statistical indicator, a mathematical formula that compares two variables. Correlation does not indicate the existence of a real-world relationship between the two variables, nor does it indicate the nature of this relationship. On the other hand, causality explicitly examines the interaction of attributes or phenomena.
For example, if we try to predict how many jelly beans an employee in the office eats in a day, we may find that certain variables are highly correlated with the consumption of jelly beans, such as the amount of soda consumed in a day, the distance from the employee’s desk to the jelly bean bowl, and The number of hours the worker spends in the office. In this case, it is easy to infer: high soda consumption leads to jelly bean consumption.
However, this would be an inappropriate conclusion. Jelly bean consumption and soda water are related, but this is an indirect connection. In this case, the factor driving jelly bean consumption is more likely to be workers' attitudes towards nutrition; soda consumption is an alternative. If soda is removed from the office environment, the consumption of jelly beans is likely to increase rather than decrease. In fact, through further testing, we may be able to determine that the distance to the jelly bean bowl (contact), the hours spent in the office (exposure), and the consumption of jelly beans have a significant and direct causal relationship. The greater the exposure of workers to jelly beans, the more they are exposed to the jelly beans, the more jelly beans the workers will eat. Keep in mind: Correlation is not causation.
When evaluating modeling techniques and methods, other terms we need to understand are "out of sample", "out of time" and overfitting-three interrelated terms. When we describe the model as overfitting, the essence is that the model is not well generalized. The overfitting model treats random noise as system noise. When tested with the data from which the model was created, the overfitted model performed exceptionally well—meaning there were a few errors. However, when using independent data-called "out-of-sample" data-rather than the data used to build the model to test the model. Data that is both outside the sample and from different sample periods is called "out of time period". By testing the model out of sample and time period, we will avoid overfitting and understand the true fit of the model.
For example, Figure 2 shows the error obtained for the model prediction compared to the data used to train the model and compared to the out-of-sample data. This figure tells us two things: 1) The model is not very accurate when applied outside the sample; 2) After the training step 12, the performance of the model gradually deteriorates; except for the training step 12, the model is obviously overfitting. Therefore, the model generated from training step 12 should be a model for further analysis or generation of future predictions, because it has the best out-of-sample detection performance.
Garbage in, Garbage out!!!

Figure 2. When comparing the training data error and the out-of-sample data error, the model is overfitting. When
constructing the data set, try to balance the prediction accuracy and model generality. We test different modeling frameworks at almost every geographic level and set of variables that we generate. If data is provided with high frequency and reliability, we wisely apply techniques that focus more on prediction accuracy. When data is provided infrequently and with low reliability, or we have to predict the long-term future, we will focus on building a well-generalized model and try our best to truly obtain causality rather than correlation .
A single scale will not be universally applicable to various situations in which analysis models and data sets are created. Modeling techniques are designed to use specific types of data to solve specific types of problems based on a set of assumptions. Most modeling techniques can be adapted to various applications and input data types. However, this can only be done within certain limits. When choosing the right method, it is important to understand the limitations of different modeling techniques and the limitations imposed by the input data.
In the first two parts of this article, we have seen how input data and methods have a huge impact on data quality. Without analysis, any analysis project cannot be started. In the last part that follows, we will examine the role of quality control in creating high-quality data.

Part 3 Quality control is essential.

In the third part of our article on data quality, we turn our attention to the third component: quality control.
In the previous two parts discussing data quality, we studied the role of input data and methods in creating high-quality data. In the final post, we will turn our attention to the third component: quality control, also known as quality assurance, or QA. Quality control includes evaluating the model and the data it generates, and such testing should be performed as frequently as possible.
Ensure quality control
Quality control or assurance (QA) can actually be boiled down to two key elements: comparison with authoritative data sources and authoritative judgments. With authoritative data, it is very simple to calibrate certain models, test results, and evaluate prediction accuracy. This is a very valuable step in any modeling exercise. In essence, it is an extension of cross-validation techniques used in statistics and machine learning. All good modelers build models, make predictions, measure the accuracy of these predictions, and then optimize their models on this basis. No one skipped the final optimization step.
The second element, judgment, is more challenging and somewhat subjective. In our business, there may be a relatively long period of time between when we make predictions and when we have authoritative data to predict them. In the case of DemoStats, we have to wait at least 5 years to evaluate and measure accuracy.
We spend as much time on quality control as we spend on model building. When we perform quality control on data, we will use our experience, domain knowledge and best judgment to test the reliability of data and models. Constructing mutual competition to test our core method is a means we use for quality control. This process usually leads to some very important questions: How many predictions are comparable? Why and which forecasts are different? Which prediction is more credible? Is there a systematic difference between the two forecasts we can use? In addition, when new authoritative data is available, we will compare the various methods we use to determine whether the core needs to be modified.
QA is an indispensable part of constructing the data set and ensuring its quality. Our investment in QA means that we continue to improve methods and data sets. This also means that our researchers will not be complacent. Without a thorough QA process, researchers can easily fall into the trap of using the same methods and data sources just because they are the methods and data sources used in the past. The most disliked by any company is complacent researchers!


In this three-part article, we examine the challenges of creating high-quality data. We increasingly understand that no data is perfect and it is important to determine the cleanliness of the input data. In terms of methodology, a single scale is not suitable for all situations, and it must be wisely considered and weighed based on the nature of the data and how it is used. Finally, creating high-quality data requires testing and evaluating the model as frequently as possible, and then adjusting it based on the evaluation and new data. The quality of business decisions depends on the quality of the analysis behind it, and the quality of the analysis depends on the quality of the data. We will never forget this most basic relationship.
Source: https://environicsanalytics.com/en-ca/resources/blogs/ea-blog/2016/05/01/avoiding-garbage-in-garbage-out-the-importance-of-data-quality-part- 1
expand reading
February 21, referring to the scholars of the two databases using the guidelines of the epidemic Wind information financial terminal operations Guide CEIC database and operating instructions, refer to the "Beijing Tsinghua SEM Social database What? Do not envy jealous hate !". On February 22, the "Estimated Poisson regression model with two high-dimensional fixed effects" was introduced, which included panel Poisson regression, panel negative binomial regression, control function method CF, restricted cubic spline, and so on. On February 27th, introduced the "Harvard University revised and completed classic masterpiece of causal inference for free download! With data and code!" and "The clearest detailed explanation of endogenous problems and software operation plan! An essential tool for empirical research!"
Before, our circle recommended some databases (of course, the database in the community is far more than these), as follows: 1. These 40 micro-databases are enough for your PhD graduation; 2. The Chinese industrial enterprise database matches the complete program and 160 steps Corresponding data; 3. Night light data of Chinese provinces/prefecture-level cities; 4. 1997-2014 authoritative version of China's marketization index; 5. 1998-2016 China's prefecture-level cities' annual average PM2.5; 6. Econometric economic circle economic and social database collection; 7. Chinese dialects, officials, administrative approval and the opening of the provincial governor database; 8. 2005-2015 China's CO2 data by province and industry; 9. Data evolution and contemporary issues in international trade research; 10. Chinese microdata manuals commonly used in economic research.
The following short-linked articles belong to a collection and can be collected and read, otherwise they will not be found in the future.
In 2 years, nearly 1,000 articles were published on the official account of the econometric circle,

Econometrics Circle

Guess you like

Origin blog.51cto.com/15057855/2677886