Big data modeling need to know the eight laws of

Data mining is the use of business knowledge discovery from data, process analysis and interpretation of knowledge (or mode), and this knowledge is in the form of natural or artificial re-creation of new knowledge. At the same time this new knowledge can bring a lot of value, so people did so in droves.

The current form of data mining, was born in the 1990s in the field of practice, it is in a form suitable for commercial integrated data analysis algorithms under the mining development support platform. Perhaps because of data mining from practice rather than theory, it is not noticeable in the understanding of its processes. The late 1990s, the development of CRISP-DM, has become a standardized process of data mining process, to be more and more successful use of data mining practitioners and follow.

Although CRISP-DM can guide how to implement data mining, data mining, but it does not explain what is appropriate to do so or why. In this article I will explain my proposed nine criteria or data mining "law" (most of which are well known to practitioners) and another number of other well-known explanation. Theoretically starts (the just described) to interpret the data mining process.

My purpose is not to comment CRISP-DM, but many of the concepts for the understanding of the CRISP-DM data mining is crucial, this article will also depend on common terms of CRISP-DM. CRISP-DM is only discussed in the beginning of this process.

First, the goal of the law: the goal is the source of all business data solutions.

It defines data mining theme: focus on data mining to solve business issues and industry to achieve business goals. Data mining is not primarily a technology, but a process, it is the business objective of the core. No business goals, no data mining (regardless of whether such statements clearly). Therefore, this criterion can say: Data mining is a business process.

Second, knowledge of the law: business knowledge is the core of the data mining process each step.

This defines a key feature of the data mining process. CRISP-DM one kind of simple interpretation is that business knowledge is only defined role and implementation of the final destination of the result of the data mining process begins, it will miss a key attribute of the data mining process, namely business knowledge is the core of every step.

To facilitate understanding, I use the CRISP-DM stage to illustrate:

Business understanding must be based on business knowledge, data mining goal must be to map business objectives (This mapping is also based on data and knowledge data mining knowledge);

Using data to understand the data associated with business knowledge to understand business issues, and how they are related;

Data preprocessing is to use knowledge to shape the data traffic, such traffic problems and answers can be presented (more detailed section III - Preparation law);

Modeling is to create predictive models using data mining algorithms, while explaining the characteristics model and business objectives, understanding the business that is the correlation between them;

Assess the impact of the business model for understanding;

The embodiment is applied to the result of data mining business process;

In short, there is no business knowledge, every step of the data mining process is invalid, there is no "purely technical" step. Business knowledge to guide the process yield beneficial results, and makes those beneficial results to be recognized. Data mining is an iterative process, knowledge is its core business, drive the continuous improvement of results.

The reason behind this can be explained (Alan Montgomery in the 1990s to a point raised by data mining) with a "performance gap" (chasm of representation). Montgomery pointed out that data mining goals related to the reality of the business, but the data only represents a part of reality; data and the real world there is a gap (or "gap") of. In the process of data mining, business knowledge to fill this gap, no matter what was found in the data, only using business knowledge to explain to show its importance, any missing data must be offset by business knowledge. Only business knowledge in order to compensate for this lack, which is why the core business knowledge is the cause of every step of data mining process.

Third, the law of readiness: a data preprocessing any other process is more important than data mining.

This is the famous dictum of data mining, data mining projects in the most demanding thing is data acquisition and pre-processing. Unofficial estimates, its time to take up the project is 50% -80%. The simplest explanation can be summarized as "data is difficult", often using automation to alleviate this "problem" of data acquisition, data cleansing, data conversion, data preprocessing workload of each part. Although automation is beneficial, supporters believe that this technology can reduce a lot of workload data pretreatment process, but it is also misleading data preprocessing data mining process is necessary cause.

Object data is preprocessed data mining problem into formatted data, such analysis it easier to use (such as data mining algorithms). Any form of data change (including clean-up, maximum and minimum conversion, growth, etc.) change means that the problem space, so this analysis must be exploratory. This is important because the data preprocessing, and possession of such a large workload in the data mining process, so that data miners could easily manipulate the problem space, making it easy to find them suitable for the analysis method.

There are two ways to "shape" the problem space. The first method is to convert the data into a fully formatted data can be analyzed, for example, most data mining algorithms require a single data table form, a sample is a record. Data miners know what kind of data in the form of what kind of algorithm needs, so the data can be converted into a suitable format. The second method is to make the data can contain more information on business issues, for example, some of the data mining problem in some areas, data mining can know through business knowledge and data knowledge. Through knowledge in these areas, data mining might be easier to find a suitable technical solution for the problem by manipulating space.

Therefore, business knowledge, data knowledge, data mining knowledge from data preprocessing fundamentally makes more handy. These aspects of data preprocessing and can not be achieved simply automated.

This law also explains the phenomenon of a doubt, that is, to create a data warehouse although through data acquisition, cleaning, blending, etc., but is still essential data preprocessing, still account for more than half of the workload data mining process. In addition, like CRISP-DM showed, even after the main data pre-processing stage, creating a useful iterative process model, the necessary further data preprocessing.

There are five factors that explain the test to look for data mining solution is necessary:

Business objectives defined in the data mining project interest range (domain), data mining goals reflect this;

Business-related data object and its corresponding target data mining data in this domain is generated during excavation;

The process is affected by the rules restrictions, and the data generated by these processes reflect these rules;

In these processes, data mining rules object is disclosed by this domain TECHNICAL mode (data mining algorithm) and may explain the results of the algorithm combining knowledge discovery service;

Data mining is necessary to generate data in this field, data contained in the pattern is inevitably limited by these rules.

Here we emphasize that last point, changing business objectives in data mining, CRISP-DM hinted, but often can not easily be perceived. Widely known CRISP-DM is not the next process step next to the "waterfall" process on a step. In fact, anywhere in the project can be carried out any CRISP-DM steps, the same business understanding may also be present in any step. The goal is not simply in the business for a given start, which runs through the whole process. This may explain some of the data mining project began in the absence of a clear business objectives, they know business goal is a result of data mining, not statically given.

Wolpert's "no free lunch" theory has been applied to the field of machine learning, unbiased better than the state (such as a specific algorithm) any other possible problems (data sets) appear the state average. This is because, if we consider all possible problems, their solution is evenly distributed, so that an algorithm (or bias) to a subset is beneficial, while another subset is unfavorable. This data miners known to have striking similarities, there is no algorithm for every problem. But issues addressed through data mining or data sets are not random, nor is evenly distributed to all possible questions, they represent a biased sample, so why apply the NFL's conclusion? The answer relates to the factors mentioned above : the initial problem space is unknown, multiple problems, and each space may be related to data mining target, the problem space could be manipulated by the data pre-processing, the model can not technology assessment, business problem itself may vary. For these reasons, data mining, data mining problem space deployed process and in the process are constantly changing, such that under the constraint conditions, using a random selection algorithm simulation data set is valid. For data mining are concerned: There is no free lunch.

This generally describes the data mining process. However, in certain circumstances conditional, such business goal is stable, and the data which is pre-stabilized, an acceptable combination of algorithms or algorithm can solve this problem. In these cases, the general steps of the data mining process will be reduced. However, if this situation is stable and continuous, data miners lunch is free, or at least relatively inexpensive. Stability like this is temporary, because the data business understanding (second law) and understanding of the issues (eighth law) will change.

Fourth, Law mode (David Law): total content of the data pattern.

This rule was first proposed by David Watkins. We might expect some data mining projects fail, because the pattern solve business problems do not exist in the data, but that data mining's experience is not relevant.

The foregoing exposition has already been mentioned, this is because: find something interesting in a data set will always be business-related, so that even if some of the desired pattern can not be found, but a number of other useful things can be found ( this data mining's experience is relevant); unless business experts expected pattern exists, otherwise the data mining project is not carried out, it should not be surprised, because business experts usually right.

However, Watkins proposed a simpler and more straightforward view: "always contains the data mode." This data miners experience more consistent than the previous exposition. This view was later revised after Watkins, based on data mining project customer relationships, there is always such a pattern of behavior that is future customers and previous behavior is always relevant, clear that these patterns are profitable (customer relationship management Watkins of the law ). However, empirical data miners are not limited to customer relationship management, data mining problem will exist any mode (Watkins universal law).

Watkins universal law explained as follows:

Business objectives defined in the data mining project interest range (domain), data mining goals reflect this;

Business-related data object and its corresponding target data mining data in this domain is generated during excavation;

The process is affected by the rules restrictions, and the data generated by these processes reflect these rules;

In these processes, data mining rules object is disclosed by this domain TECHNICAL mode (data mining algorithm) and may explain the results of the algorithm combining knowledge discovery service;

Data mining is necessary to generate data in this field, data contained in the pattern is inevitably limited by these rules.

Summarized this view: there is always a pattern in the data, because data will inevitably produce such by-products in the process. To explore the model, the process from (you already know it) - to start business knowledge.

The use of knowledge discovery business model is an iterative process; these patterns also contribute to the knowledge of the business, while business knowledge are the main factors explain the pattern. In this iterative process, the data mining algorithms simply connecting the hidden patterns and business knowledge.

If this interpretation is correct, then the law of David is completely generic. Unless there is no guarantee that the data related to, or in each data for each domain there is always the problem of mining model.

Fifth, the law Insight: Data Mining to increase awareness of the business.

? How data mining is to produce a close insight of the laws of the core data mining: Why data mining must be a business process rather than a technical process. Business problems are caused by people rather than algorithms to solve. Data miners and business experts to find solutions to the problem, namely the need to achieve business goals model from the domain of the problem. Data mining completely or partially contribute to the cognitive process. Data mining algorithms disclosed mode is usually not a human in a normal manner can be appreciated. The combination of these algorithms and normal human perception of the data mining process is quick in nature. In the process of data mining, data mining problem solver interpret the results generated by the algorithm, and a unified understanding of the business, so this is a business process.

This is similar to the concept of "intelligent amplifier" in the field of early artificial intelligence, AI's first practical results are not intelligent machines, but the tool is called "smart amp", which can help improve the human user to obtain a valid capability information. Data mining provides a similar "smart amp" to help business professionals to solve business problems that they can not complete alone.

In short, data mining algorithms to provide a normal way to explore the human ability to transcend mode, the data mining process allows data miners and business experts this capability integrated in the business processes and their respective issues.

Sixth, forecast law: forecast by the generalization of information.

"Predict" what has become an acceptable description of data mining models can do, that is, we often say "predictive models" and "predictive analytics." This is because many popular data mining models often use "predict the most likely outcome" (or how to explain possible outcomes possible). This method is a typical application of classification and regression models.

However, other types of data mining models, such as clustering and association model also has "prediction" feature. This is a rather ambiguous meaning of the term. A clustering model is described as "forecast" an individual belongs to which group, an association model might be described as based on the known basic properties of a "forecast" or more properties.

Similarly, we can also analyze application "predict" the term in a different theme: a classification model may be said to predict customer behavior - it can be predicted more precisely determine the behavior of some target customers, if not all, target individual behaviors are consistent with the results of "prediction" of. A fraud detection model may be said to predict whether individual transactions with high risk, even if not all of the projected transaction has a fraud.

"Prediction" is a broad term used results in so-called "predictive analysis" Data mining is used as a generic term, and has been widely used in business solutions. But we should realize that this is not the daily said, "forecast", we can not expect to predict the behavior of a particular individual or a particular fraud investigation results.

What, then, in the sense of "prediction" is? Classification, regression, clustering, and association algorithms as well as what they have in common integration model? The answer is to "score", which is forecasting model is applied to a sample of new ways. Generating a model estimated value or score, which is part of this new information in the sample; and summarized on the basis of the induction, the sample can be improved by using the information obtained, the mode is found to embody algorithms and models. It is noteworthy that this new information is not on the "given" meaningful "data", it only statistically significant.

Seventh, the law of value: the value of data mining result does not depend on the accuracy of prediction of stability or model.

The accuracy and stability are two commonly used measure of predictive models. Accuracy is the proportion of correct predictions occupied; stability means that when creating the data model changes, the data used to predict the same caliber, the predictions vary much (or little). In view of the data mining prediction concept of a central role, accuracy and stability is often considered a prediction model determines the size of the value of their results, it is not.

Reflect the value of predictive models in two ways: one is to use predictive models to improve results or influence the behavior of the other model is capable of delivering opinion (or new knowledge) lead to changes in policy.

For the latter, any link value and accuracy of the transfer of new knowledge is not so close; some of the predictive power of the model may be necessary to make us believe that model found to be true. However, predictions or complex model of a completely opaque with high accuracy is difficult to understand, but knowledge transfer is not so insightful; however, a simple model of low accuracy may convey more useful insights.

The separation between the accuracy and value is not obvious in the case of improved behavior, but a prominent issue is the "predictive model for the right thing, or for the right reasons?" In other words, a model and its value predictive accuracy, as are derived from its business problems. For example, customer churn model might require a high accuracy of prediction, it would not be as effective for guidance on business. Contrary to what a high degree of customer churn model accuracy may provide effective guidance, retain old customers, but it is only part of the profits of a minimum customer base. If you do not fit business problems, high accuracy does not improve the value of the model.

The same is true model of stability, although stability is an interesting measure of predictive models, stability can not replace business model provides the ability to understand or solve business problems, other techniques as well.

In short, the value of predictive models is not determined by technical indicators. Watch predictive accuracy, model and other technical measure of stability in the case of data mining should not harm business understand and adapt to the business issues in the model.

Eighth, changes in the law: all modes due to business changes.

Data mining discovery mode is not always the same. Many applications of data mining is well known, but the universality of this nature has not been widely appreciated.

Application of data mining in terms of marketing and CRM is easy to understand customer behavior patterns over time and change. Changes in behavioral changes, market changes, changes in competition and the overall economic situation, the forecast model will become obsolete because of these changes, when they can not accurately predict, should be updated on a regular basis.

The same is true in the application of data mining models and fraud risk models, with the changes in the environment is changing fraud, because criminals to change their behavior in order to keep ahead of fraud. Fraud detection application must be designed just like the old, familiar as fraud can handle new, unknown type of fraud.

Certain types of data mining may be considered pattern discovery does not change over time, applications in science such as data mining, we have found no general rule the same? Perhaps surprisingly, the answer is even these modes are also expected to be changed. On the grounds that these models are not simple rules exist in this world, but the response data - These rules may in some areas is really static.

However, the data pattern mining discovery is part of the cognitive processes, data mining is a dynamic process established between the observer and the world of business experts or cognitive data described. Because our knowledge in the sustainable development and growth, so we expect the pattern will change. It looks similar data tomorrow's surface, but it may have a different set of modes (may subtly) different purposes, different semantics; knowledge analysis due to driving operations, it will change with changes in business knowledge . For these reasons, the pattern will be different.

In short, all modes will change, not only because they reflect a changing world, but also reflects our changing perception.

postscript:

This eight law is simply true knowledge about data mining. This eight law has been mostly known as data miners, but there are still some unfamiliar (for example, fourth, fifth, sixth). Most new ideas are explained and the law relating to these eight, the reasons behind it tried to explain the well-known process of data mining.

Why why should we care about the form of data mining process used it? In addition to the knowledge and understanding of these simple demands, there is a real reason to explore these issues.

Data mining process exist in its current form because of the development of technology - the popularity of machine learning algorithms and other technology integrated comprehensive platform for the development of these algorithms, so that business users easy to accept. Should we expect due to changes in technology and changes in the data mining process? It will change eventually, but if we understand the reasons for the process of the formation of data mining, then we can identify the technology can be changed and can not be changed.

Some technical development of a revolutionary role in the field of predictive analysis, such as the reconstruction of automated data preprocessing, model and prediction models by integrating business rules deployed in the framework. Nine law of data mining and its explanation: the development of technology does not change the nature of the data mining process. This nine the law and the further development of these ideas, in addition to the educational value of data miners outside, should be used to determine the future course of revolutionary change any data mining demands.

Highly recommended reading articles

Big Data engineers must understand the concept of the seven

The future of cloud computing and big data Five Trends

How to quickly build their own knowledge of large data

 

Guess you like

Origin blog.csdn.net/sdddddddddddg/article/details/91471860