Big data application of beer diapers

Share hot words today: data mining data analysis

Data mining

Regarding the definition of the concept of data mining, the editor has also been crawling on the Internet for a long time, and it is three thousand weak. Let's take a few scoops to drink:

1. The first is the definition on Think Tank Encyclopedia, from two perspectives

Technical point of view: The so-called data mining refers to the non-trivial process of revealing hidden, previously unknown and potentially valuable information from a large amount of data in the database. What does it mean? Now there are countless T/P/E... so many independent data in the database, to find out the relationship data or the relationship in the data through mining. The question is here. How to find it? It is mainly based on artificial intelligence, machine learning, pattern recognition, statistics, database, visualization technology, etc., using artificial neural networks, genetic algorithms, close proximity algorithms, and decision trees to find the relationship between data. Realize some functions of data mining, such as data classification and clustering, association rules, features, etc. At this time, the work of data mining in the narrow sense has ended.

Business perspective: Data mining is a new business information processing technology, its main feature is to extract, transform, analyze and other model processing of a large number of business data in the business database, and extract the key data to assist business decision-making. This statement is very close to the broad concept of data analysis. It can be understood that data mining is a deep-level data analysis or a broad data analysis.

2. Next, let’s look at how Oracle defines data mining:

Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events.

This statement is more from a technical point of view: beyond simple data analysis, the use of artificial intelligence and other technologies to identify relationships or trends in a large amount of data, but also involves complex algorithms, so as to achieve a classification of data to support decision making.

data analysis

Data analysis: refers to the process of analyzing a large amount of collected data with appropriate statistical analysis methods, extracting useful information and forming conclusions, and then conducting detailed research and generalizing the data.

The statistical methods mentioned here mainly include the following:

Descriptive statistical analysis: including the description of the basic data of the sample, the frequency distribution and percentage analysis of each variable to understand the distribution of the sample. For example, there are more than a dozen data in front of you, and you can analyze its average, range, variance, and standard deviation to indicate the degree of concentration or dispersion.

Exploratory analysis: It is a method of systematically analyzing data, used to display the distribution of data variables, and use hash matrix diagrams and scatter diagrams to analyze the correlation between variables. Correlation analysis is the most commonly used in exploratory analysis, mainly to determine whether the variables are positively correlated, negatively correlated, or uncorrelated.

In the above two analysis processes, comparative analysis (such as the ring comparison we often hear, year-on-year and other nouns based on time and space comparison), group analysis (dividing data objects into different groups according to certain characteristics) will also be used. ) And regression analysis (regression analysis may be more suitable for long-term variables, such as the sales volume of a clothing store in each quarter and month of each year, and the regression model established by the previous sales volume to predict the current future sales volume in the same period) Wait.

The most common feature in data mining and data analysis is to better support decision-making.

Example: beer and diapers

This is a story about the retail empire Wal-Mart. After a routine data analysis, the researchers suddenly discovered that the most purchased product with diapers is actually beer! Diapers and beer sound irrelevant, but this is the result of mining historical data and reflects the laws of the data level. This relationship is puzzling, but after follow-up investigations, researchers found that some young dads often go to the supermarket to buy baby diapers, and 30%-40% of new dads will buy some beer to reward themselves by the way. Subsequently, Wal-Mart carried out a bundled sale of beer and diapers, and as expected, both sales volume increased.

In this case: in the data mining stage, the association rule "shopping basket rule" is mainly used. There may be doubts here. For example, when buying diapers, you may still buy cigarettes, but why only beer is mentioned? A threshold is involved, that is, when this value is not reached, they cannot be considered to be related, that is, the amount of beer diapers purchased at the same time (provided that it is greater than the set threshold) is greater than the amount of cigarette diapers purchased at the same time, and beer cigarettes are regarded as Outliers are ignored (or cigarettes as a related variable next to beer). After discovering this association, the researchers began tracking. The tracking here is not direct observation in the store, but the use of artificial intelligence and other technical means to lock the observations of these two variables, and follow these two quantities. Chinese research is based on a decision brought about by data mining.

In the data analysis stage, such as the 30%-40% mentioned in the article, this is actually a quantitative data performance of data analysis. The above-mentioned bundling or placing of goods on adjacent shelves is also a performance of data analysis in supporting decision-making.

The same example is chocolate and TT, etc.

Guess you like

Origin blog.csdn.net/lyw5200/article/details/109407744