Common data analysis project type

1 target customer characteristics analysis

        In a typical feature analysis of target customers, the business scenario can be virtual operators to explore the features of the previous test (simulation of historical data sources speculate), it can be analyzed after the test operation on the basis of data from real operations, the mining and refining, both target the same, just different ideas, different data sources only. In addition, analysis techniques have some differences.

2 target customer forecasts (response, classification) model

        Here predictive models include the loss of early warning model, pay-forecasting model, renewals predictive models, response models and other operational activities, mainly involving data mining techniques including logistic regression, decision trees, neural networks, support vector machines. No algorithm in any scene will always be the best qualified to build a response model, the data analyst will try a variety of different algorithms, and then weighed against the effect of a subsequent verification resources and the value of specific business projects, and make the final selection.

        According to modeling data in response to the ratio of the actual size classification, response model can be broken down into common response model and response model rare events. Generally, if the response ratio is less than 1%, as rare events should be processed response model, wherein the core is sampled, amplified human analysis in response to the scaled data sample events, increasing the concentration response to the event, thereby modeling better capture the relationship between the fitting independent and dependent variables.

        In addition to the prediction model can predict important relationship between the input variables and target variables outside the probability of an individual response, the model shows itself also has significant business value, for example, can be transformed into refined associated factors associated with the occurrence of response. Although this relationship is not necessarily a causal relationship, the latter requires in-depth analysis, but this relationship will often have important operational data of the reference value.

Definition 3 operational activity groups

There is no uniform definition of the activity described, but the two most common basic points:

  • Composition index activity should be the core business scenario behavioral factors
  • Measure the activity of the definition of appropriate or significant judgments based on their ability to effectively answer the business needs of the ultimate goal.

        For example, now a need to define the activity, so that a user must meet activity score can be relatively easily converted into paying customers. Therefore, the ultimate goal of this analysis is to facilitate the conversion of paying customers is an important basis for this assessment is the definition of out in accordance with the activity of active user groups, the number of actual paying customers can be covered.

Definition of statistical techniques activity involved mainly two, one is the principal component analysis, other data are standardized.

4 user path analysis

        User path analysis is the analysis of the topic-specific Internet industry, it is to analyze rules and characteristics of the user flow on the page, find the path mode frequently accessed. Found that these paths can have a lot of business purposes, including refining the mainstream path to a specific user groups, web design optimization and revision, the user may browse a page forecast, browser and other features specific groups. Path analysis used data primarily web server log data, these data are usually massive scale level. Path analysis commonly used technique has two types, one algorithm is supported, and the other is in strict accordance with the sequence of steps to traverse the main path.

        In practice, the operation of Internet data, if can simply path analysis technology, algorithms and other analytical mining technology integration, it will generate greater value. This fusion ideas by clustering techniques include dividing the different groups, and wherein different groups of the path analysis, for example, compared to non-paying and paying path characteristics people to optimize the page layout and the like, according to the single charge frequent path the abnormal pattern may be paid for page design optimization.

5 Cross-selling model

        Once a customer purchases a product, companies will find ways to retain customers, operators generally have two directions, one to delay the loss of customers, usually churn prediction model, ahead of the target customers most likely to churn, customer care and take various measures customer retention; the second is to allow customers to consume more goods and services, mining customer profitability, to meet customer needs, in this type of scenario, is the main model involves cross-selling model.

        Cross-selling model by analyzing historical consumption data mining user, identify clearly linked to the nature of commodity combination (can be purchased at the same time, they can have the order), and then use different modeling methods, constructing consumers to purchase these goods associated with combination the possibility of reuse excellent model in which to predict the likelihood of new customers to purchase a specific commodity combinations.

        Chinese and foreign enterprises practice of comprehensive data mining point of view, there are at least four kinds of different ideas. First, in accordance with the related technology, also known as basket analysis, found that those who were more likely to be purchased together with the goods, they will be there for promotions and bundling, which is cross-selling; second is to learn from the response model of thinking are established for a few important commodities forecasting model, for potential consumers to filter through these prediction models, and then precise marketing for 5% of consumers are most likely ago; the third is still drawing response model of thinking, so significant pairwise combinations of goods, to identify the most likely to purchase potential customers; Fourth, through the decision tree tree clear rules and found that specific rules based on specific data resources.

Appropriate modeling techniques include correlation analysis, sequence analysis, i.e., based on correlation analysis, an increase of the order of consideration, and the prediction modeling techniques, such as logistic regression, decision trees.

6 Information Quality Model

Electricity supplier industry connect buyers and sellers the most direct and critical link is the massive catalog, product display, it is necessary to improve the quality and structure of product information, to complete the elements, rational layout, friendly interface.

Internet industry where information quality model applied mainly offer quality merchandise optimization, optimization of the quality of online shops, online forums posting quality optimization, optimization filter illegal information.

        Sometimes the target variable build quality of the information model is whether the transaction information generated in a specific period of time, at this time the target variable is binary, is not. But in other cases, there is no clear target variables from the actual data, the expert scoring, model fitting is a more suitable alternative strategies. For example, the constituent elements of the right to offer the goods heavy scoring, including the header length, the number of pictures, the proportion of optional attributes, whether there tiered price range, whether to fill in the amount of information available, whether there are operational instructions, whether to support third-party online payment . First of all samples taken, please scoring industry experts, these scores as the target variable, use a variety of data mining model fitting relationship between these factors and the total score.

7 Service Support Model

For example, let the seller to buy the right value-added products, so the seller renewals appropriate value-added products, the amount of information the seller valley contraband filtration, the seller community posting cold judgment.

8 user (seller, buyer) hierarchical model

        Hierarchical model is a compromise between operation and extensive filtering and between the individual probability model prediction model taking into account both the need for fine, they do not need to build and maintain into the prediction model, the data of the thus analysis of the initial operational and strategic levels of greater value.

        Their usual scene for the customer service team needs to provide different rhetoric and corresponding service package according to a hierarchical model for different groups; corporate management needs to be based on the number of online transactions form the core of its sellers sellers hierarchical view of evolution; Operations Team We need customer segmentation model to guide the development and implementation of the corresponding operational programs.

        Common techniques include layered model statistical analysis (correlation analysis, principal component analysis), but may contain prediction (response, classification) technology model, such as found in the most important case where the input variables and sorting by the prediction model, based on these important stratification variables were roughly divided, indicators and threshold stratified according to the business case to establish the relationship between the input variables and predict delamination threshold to see if the predictions of the model can contain most of the actual situation and actual data verify to see if it has stability over a certain length of time.

9 sellers (buyers) Trading Model

        The main types of analysis involved include: automatic matching (forecast) Goods interested buyers (ie product recommendation model), the transaction funnel model (to find out the trading process of the loss of the funnel, to help improve transaction efficiency), the buyer subdivision (help improve personalized goods and services), transaction path optimization design (enhance consumer experience buyers).

10 Credit Risk Model

        Credit risk here, including fraud alert, warning of disputes, high-risk user judgment. In the conventional data mining analysis compared the aging model of credit risk shorter, more needs to be updated frequency, timeliness and accuracy of the model has a great challenge, because change is largely random frauds of.

Product Recommendation Model 11

11.1 Introduction product recommendations

        Depending on the business needs of e-commerce in addition to the main product recommendations, as well as query recommendation, recommendation merchandise category, merchandise labels recommended, recommended and other shops. Commonly used product recommendation model is divided into regular model, collaborative filtering and content-based recommendation model. For rule model, commonly used algorithm Apriori algorithm; and K nearest neighbor algorithm, model and other factors involved in the collaborative filtering.

11.2 association rules (Apriori algorithm)

Given association rule X → Y, i.e., derived according to X Y. Formally defined as:

The number of records support (X → Y) = X and Y contains / total number of recorded datasets

Confidence (X → Y) = X and Y contains the number of the recording / X dataset containing the number of records

Algorithmic process:

  • Frequent calculate a set. Statistics of the number of times each item appears, select not less than the minimum support commodity, candidates get set.
  • Frequent two calculated sets. And a set of frequently itself be connected operation (i.e., the various binary combinations of goods), the number of records are calculated according to the binary composition, obtained according to the frequent two sets minimum support.
  • The frequent two sets, computing Similarly frequent three sets. And pruning, i.e. nonempty subset frequent three sets must be frequent.
  • And has been calculated to pruning, frequent n terms set is empty.
  • The frequent itemsets, association rules is calculated. I.e. according to the prior combination of frequent itemsets different items to obtain an arbitrary X → Y, calculates a confidence level. Removal of low confidence.

11.3 collaborative filtering algorithm

        Heuristic collaborative filtering algorithm comprises three main steps:

  • Collecting user preference information
  • Look for similar goods or users
  • Generating recommendations

        Collaborative filtering user input data set is mainly comments or behavior data set data set. These data sets are divided into explicit data and implicit data. Among them, the dominant user data mainly scoring data, such as user ratings for commodities, but there are certain problems explicit data, such as user rarely comment on, there may be suspicion of fraud, resulting in sparse or inaccurate scoring data. The hidden data refer to the user's clicking behavior, buying behavior and search behavior, these data reveal a hidden user preferences for commodities. But there are some hidden data problems, such as how to identify the user is buying or gifting for themselves.

(1) User-based collaborative filtering (User-based)

User-based collaborative filtering algorithm first based on user behavior history information, look for similarities to other users, according to these similar users to predict the current user might like items to other items of evaluation information.

In the collaborative filtering, one important aspect is the user similarity computing, typically using Pearson correlation coefficient and the cosine similarity, two users with common evaluation information data for some items.

        Another important aspect is to calculate the user prediction score did not score commodities. 'Represents a user with user u u and s (u, u)' similarity, N denotes the set of neighbors, U represents a set of users, R & lt u, i represents a user u score item i, R & lt ~ u represents an average user u score. Rating for user u p prediction item i u, i , is calculated as follows:

pu,i=ru+(∑ns(u,u')*(ru',i - ru'))/(∑n| s(u,u') | )

(2) based on collaborative filtering items (Item-based)

        Item-based collaborative filtering algorithm computing the similarity between the item, thereby predicting user rating. Calculating the degree of similarity may be a Pearson item or cosine similarity, based on a formula given herein calculated conditional probability: items i and j represent item similarity with s (i ,, j), freq (iΛj) represents i and j the probability of co-occurrence, represents resistance factor [alpha] it is mainly used for balance control and popular popular item

s(i,,j)=(freq(iΛj))/(freq(i)*freq(j)α

        Next prediction score, P u, i represents a user u prediction score item i, S i represent like items and items set, R & lt u, j represents the user u j, score item:

p u, i = (Σ s  s (i ,, j) * r u, j ) / (Σ s  | s (i ,, j) |)

 

Source: https://www.cnblogs.com/data-science-chinchilla/p/8976920.html

Guess you like

Origin www.cnblogs.com/jing-yan/p/12335171.html