And examples of data mining algorithms

In general, data mining algorithm includes four types, i.e., classification, prediction, clustering, association. The first two belong to the supervised learning, the latter two belong to unsupervised learning, pattern recognition and discovery belongs descriptive. Supervised learning supervised learning that there is a target variable, we need to explore the relationship between the characteristic variables and target variables, learning and optimization algorithm under the supervision of the target variable. For example, credit scoring model is typical of supervised learning, the target variable is "whether the breach of contract." The purpose of the algorithm is to study the characteristics of variables (demographics, assets, property, etc.) and the relationship between the target variable.

Classification algorithms

The biggest difference between prediction algorithm and classification algorithm is that the former classification of the target variable is discrete (eg, if overdue, whether the tumor cells, whether spam, etc.), the latter target variable is continuous. In general, the specific classification algorithms including logistic regression, decision trees, KNN, Bayes discriminant, SVM, random forests, neural networks.

Prediction algorithm

Class prediction algorithm, which is usually the target variable continuous variables. Common algorithms, including linear regression, regression trees, neural networks, SVM and so on.

Unsupervised Learning

Unsupervised learning that there is no target variable, based on the data itself, to the inherent characteristics between the model and identify the variables. For example correlation analysis, the correlation between the data found in the projects A and B. Cluster analysis example, the distance, all the samples is divided into several stable distinguishable groups. These are pattern recognition and analysis in the absence of the target variable supervision.

Cluster analysis

Purpose of clustering is to achieve subdivision of the sample, characterized in that the sample is similar to the same group, the different sample characteristics quite different groups. Common clustering algorithms include kmeans, pedigree clustering, density clustering.

Correlation Analysis

The purpose association analysis is to identify the intrinsic link between the item (item). Often it refers to the market basket analysis, that while consumers often buy what products (such as swimming trunks, sunscreen) to help bundled business.

Case-based applications and data mining

 

Case-based classification model (spam classification and determination, determination of tumor cells, and resolution.)

Discrimination spam

How mail system to distinguish whether an Email is spam? It should belong to the category of text mining, often using Naive Bayes method to discriminate. Its main principle is that, according to the word in the message body, if often appear in spam, judge. For example, if a text message with "reimbursement", when the "invoice", "promotion" and other words, the message is judged as spam probability will be relatively large.

In general, to determine whether the message is spam, it should include the following steps.

First, the message body broken down into combinations of words, assuming an article 100 comprising a message word.

Secondly, according to the Bayesian conditional probability calculated a message that there have been 100 words, the probability of belonging to the spam probability and normal mail. If the results show that the probability of belonging to the spam probability greater than normal mail. Then the message will be classified as spam.

Tumor medical judgment

How to determine whether the cells are tumor cells do? Normal cells and tumor cells, there is a difference. However, doctors need to be very experienced, judged by biopsy to. If by way of machine learning, such that the system automatically recognizes the tumor cells. At this time, the efficiency will be rapidly improved. Further, subjective (physician) + objective (model) mode identifying tumor cells, the results of cross-validation, the conclusion may be more reliable.

How? Identified by the classification model. Briefly, comprises two steps. First, a series of indicators portrayed by cell characteristics such as the radius, texture, perimeter, area, smoothness, symmetry, irregularities, etc. of the cells constituting the cell characteristics data. Secondly, based on cell characteristics table width, the determination of tumor cells by building the classification model.

Case-based predictive model (by quality chemical properties determine and predict the wine. Fluctuations and to predict and determine trends in stock prices through the search engine.)

Quality wine judged

How evaluation of wine? Experienced people would say, the most important is the taste of wine. The taste is good or bad, is affected by many factors, such as year, origin, climate, brewing process, and so on. However, statisticians did not have time to sample a variety of wines, they think through some chemical properties characteristic can very well judge the quality of the wine. And now many breweries are in fact done so, by monitoring the content of the chemical composition of the wine in order to control the quality and taste of wine.

So, how to judge the quality of Kam wine it?

The first step, a lot of wine samples were collected, they organize detecting chemical properties, such as acid, sugar, chloride content, sulfur content, alcohol, PH value, density and the like.

The second step, to predict and judge wine quality and grade by classification and regression tree model.

Search engines search volume and stock price volatility

A South American rainforest butterfly, occasionally flapping its wings a few, can be in two weeks, caused by a tornado in Texas. If you search on the Internet will affect the company's stock price fluctuations?

Before long, it has been documented that Internet keyword searches (such as influenza) than CDC 1-2 weeks predicts an outbreak of influenza in advance.

Similarly, some scholars have now found such a phenomenon that the Internet search company in the amount of change will significantly affect the company's stock price fluctuations and trends, the so-called theory of investor attention. The theory is that the company searches in search engines, represents the degree of attention that stock investors. Therefore, when searching for a stock frequency increased, indicating investor concern about lifting the stock, so that the stock is more likely to be individual investors to buy, leading to further rise in stock prices, bring forward stock returns. It's already been validated in numerous papers.

Case-based association analysis: Wal-Mart's beer diapers

Beer diapers is a very, very old old story. The story is this, Wal-Mart found a very interesting phenomenon, that diapers and beer both totally unrelated commodities put together, can significantly increase sales of both. The reason is that American women are usually at home to take care of children, so they often asked her husband to buy diapers for their children on the way home from work, and her husband bought diapers at the same time will easily buy their own drink beer. Wal-Mart discovered this correlation from the data, therefore, the juxtaposition of these two commodities, thus greatly improving the related sales.

Beer diapers is primarily concerned with the correlation between the product, if a large number of data indicate that consumers buy goods at the same A, B will slide down to buy the product. So there is a relationship between A and B. In the supermarket, often we see the two bundle of goods, most likely associated with the analysis of the results.

Based on cluster analysis of the case: retail customer segmentation

Customer segmentation, it is quite common. Segmentation function, that can be effectively divided into groups of customers, so that the inner members of the group have similar properties, but there are differences between the groups. Its purpose is to identify different customer groups and for different customer groups, product design and precise push to save marketing costs, improve marketing efficiency.

For example, for a breakdown of commercial banks in the retail customers, characteristic variables (demographic characteristics, characteristics of assets, liabilities features, billing feature) based on retail customers, calculate the distance between customers. Then, according to the distance from, the aggregate is similar to a class of customers, customer segments effectively. All customers will be classified as such, those who finance preference, the fund's preference, current preferences who prefer bonds, venture balancer, channel preferences persons.

Case-based analysis of outliers: the payment of transaction fraud detection

When using Alipay payment or credit card payment system in real time to determine whether this behavior is fraudulent credit card. Judged by a determination element card the time, place, business name, amount, frequency and so on. The basic principle that there is to find outliers. If your credit card is judged to be abnormal, the deal may be terminated.

Judgment outliers, should be based on a fraud rule base. It may contain two types of rules, namely rules and event class model class rules. First, the event class rules, such as credit card time is abnormal (early morning card), credit card location is abnormal (non-recurring location card), credit card business is abnormal (blacklisted cash business), credit card transactions is abnormal (normal deviates from the mean three standard deviations), whether the abnormal frequency card (card-intensive high frequency). Second, the model class rules, by the algorithm to determine whether transactions are fraudulent. Usually by payment data, the seller data, billing data, build models to determine the classification problem.

Based collaborative filtering case: you may also like the electricity supplier and recommendation engine

The electricity supplier guess you like it, it should be most familiar. Jingdong Mall or shopping in the Amazon, there are always "guess you like", "well recommended for you based on your browsing history," "Customers who bought this product also purchased goods", "view of the customer's final product purchase of goods, "which are the result of recommendation engine operation.

There is, really like Amazon's recommendation, by "people who bought this product also purchased the commodity **", often find some of the higher quality, more recognized by the book. In general, the electricity supplier, "guess you like" (ie, a recommendation engine) is based on collaborative filtering algorithm (Collaborative Filter), based on the rule base to build a line with its own characteristics. That is, the algorithm will also consider other consumer choice and behavior, building products and the user similarity matrix similarity matrix on this basis. Based on this, identify the most similar customers or products are most associated, thus completing the recommended products.

Case-based social network analysis: the seeds of customers in telecommunications

Seeds clients and social networks, research in the field of telecommunications earliest. That is, people's phone records, can sketched out people's networks. The field of telecommunications networks, generally influence and analyze customer churn, product proliferation relationship.

Based on the call log, you can build customer impact indicators system. Index used, presumably including the following, once contacts, contacts the second time, the three veins saving people, the average frequency of calls, average call volume and so on. Social influence based on the results of the analysis suggest that high impact associated with the loss of customers will lead to loss of customers. Secondly, the proliferation of products, select high-impact client as a starting point spread, it is easy to promote a new package of diffusion and osmosis.

In addition, the social network in the bank (secured network), insurance (gang fraud), Internet (social interaction) also have a lot of applications and case.

Case-based text analysis

Character Recognition: Scan Wang APP

When the cell phone camera will automatically recognize human faces, and some APP, such as scanning king, you can scan books, then the contents of scanned automatically translate into word. These are image recognition and character recognition (Optical Character Recognition). Image recognition is more complex, character recognition relatively easier to understand.

Find some information about character recognition principle is as follows, with the characters S, for example.

First, the character image is reduced to the standard pixel size, for example 12 * 16. Note that the image is composed of pixels, the character image including black and white pixels.

Second, feature vectors are extracted characters. How to extract features characters two-dimensional histogram projection. Is the character (* 12 pixels in FIG. 16) projected to the horizontal and vertical directions. 12 has dimensions in the horizontal direction, vertical direction 16 dimensions. Thus the total number of black pixels in each pixel row and a column of pixels in the horizontal direction accumulated number of black pixels in the vertical direction are calculated. 12 to thereby obtain the horizontal dimension feature vector values, wherein the vertical direction vector 16 of dimension values. This constitutes a feature vector contains 28 character dimensions.

Third, based on the previous character feature vectors, learning by the neural network, to identify the characters and valid classification.

Literature and statistics: Dream of Red Mansions attribution

This is a very famous debate unresolved. For the author of Dream of Red Mansions, generally considered the first 80 rounds are written by Cao Xueqin, Gao E Forty rounds as written. In fact, the main problem, is to determine, before and after the 8040 round bout whether significant differences exist in terms of words and sentences.

Compare this matter so that a group of Tongjixuejia excited. Some scholars statistical terms, the correlation coefficient between the verbs, adjectives, adverbs, function words appear frequency, as well as different parts of speech to make a judgment. Some scholars function words (for example, its, or, also, the, the, do not put, do not good), the difference before and after the judgment of style. Some scholars difference scene (flowers, trees, food, medicine and poetry) frequency, the statistics do judgment. All in all, mainly through a number of quantitative indicators, and whether there are significant differences between the comparison metrics to be writing style judgment.

Published 65 original articles · won praise 12 · views 10000 +

Guess you like

Origin blog.csdn.net/sereasuesue/article/details/80437462