3. Common types of data analysis projects in data operation - "Data Mining and Data Operation Practice"

3.1 Characteristic analysis of target customers

        In the typical feature analysis of target customers, the business scenario can be virtual feature exploration before trial operation (simulation of historical data from sources), or analysis, mining and refining based on real operational data after trial operation. Consistent, but the thinking is different and the data sources are different. In addition, there are some differences in analysis techniques.

3.2 Prediction (response, classification) model of target customers

        The forecasting models here include loss early warning model, payment forecasting model, renewal forecasting model, operational activity response model, etc. The main data mining techniques involved include logistic regression, decision tree, neural network, support vector machine, etc. No algorithm is always optimal for building a response model in any scenario, so data analysts will try a variety of different algorithms, and then weigh the resources and value of the first-level specific business project based on the subsequent verification effect, and make a decision. Final choice.

        According to the actual response proportion in the modeling data, the response model can also be subdivided into common response model and rare event response model. Generally speaking, if the response ratio is less than 1%, it should be handled as a rare event response model, the core of which is sampling, artificially amplifying the proportion of response events in the analysis data sample, and increasing the concentration of response events, so that in the modeling Better capture and fit the relationship between independent variables and dependent variables.

        In addition to effectively predicting the probability of individual responses, the relationship between important input variables and target variables displayed by the model itself also has important business value, for example, it can be transformed into the extraction of associated factors that accompany the response. Although this relationship is not necessarily a causal relationship and requires in-depth analysis later, this relationship often has important reference value for data operations.

3.3 Definition of Activity of Operation Groups

There is no uniform description of the definition of liveness, but there are two most common basic points:

  • The component indicators of activity should be the core behavioral factors in this business scenario
  • An important basis for judging whether the definition of activity is appropriate or not is whether it can effectively answer the ultimate goal of business needs.

        For example, it is now necessary to define an activity level so that users who meet a certain activity level score can be easily converted into paying users. Therefore, the ultimate goal of the analysis is to promote the conversion of paying users, and an important evaluation basis is how many actual paying users can be covered in the active user group defined by the activity.

There are two main statistical techniques involved in the definition of activity, one is principal component analysis, and the other is data standardization.

3.4 User Path Analysis

        User path analysis is a unique analysis topic in the Internet industry. It mainly analyzes the rules and characteristics of users' circulation on web pages, and finds the path patterns of frequent visits. The discovery of these paths can have many business uses, including refining the mainstream paths of a specific user group, optimization and revision of web page design, prediction of the next page a user may browse, and browsing characteristics of a specific group. The data used for path analysis is mainly log data in the web server, and the scale of these data is usually massive. There are two types of techniques commonly used in path analysis, one is supported by an algorithm, and the other is to traverse the main path strictly in the order of steps.

        In the practice of Internet data operation, if pure path analysis technology, algorithm and other analysis and mining technology can be integrated, it will produce greater application value. The idea of ​​​​this integration includes dividing different groups through clustering technology, and then analyzing the path characteristics of different groups, such as comparing the path characteristics of paying and non-paying people, optimizing page layout, etc., according to the frequent occurrence of order payment paths of unusual patterns may be used to optimize the pay page design.

3.5 Cross-selling model

        Once a customer buys a product, the company will try to retain the customer. Generally, there are two operational directions. One is to delay customer churn. Usually, a customer churn early warning model is used to target customers most likely to churn in advance, and then various customer care measures are taken. Retain customers; the second is to allow customers to consume more goods and services, tap customer profits, and meet customer needs. In this type of scenario, the main model involved is the cross-selling model.

        The cross-selling model finds clearly related product combinations through the analysis and mining of users' historical consumption data (can be purchased at the same time or in a sequential order), and then uses different modeling methods to build consumer purchases of these related product combinations. , and then use one of the best models to predict the likelihood that new customers will purchase a particular combination of items.

        From the comprehensive data mining practice of Chinese and foreign enterprises, there are at least four completely different ideas. One is to find out those commodities that are more likely to be purchased together according to the correlation technology, also known as shopping basket analysis, and carry out targeted promotion and bundling of them, which is cross-selling; the other is to learn from the idea of ​​​​response model , establish prediction models for certain important commodities, filter potential consumers through these prediction models, and then carry out accurate marketing promotion for the most likely top 5% consumers; third, still draw on the idea of ​​​​response models, let The important commodities are combined in pairs to find the potential customers most likely to consume; the fourth is to discover the specific rules based on specific data resources through the clear tree-like rules of the decision tree.

Corresponding modeling techniques include association analysis and sequence analysis, that is, on the basis of association analysis, the consideration of sequence is added, and predictive modeling techniques, such as logistic regression and decision trees.

3.6 Information Quality Model

The most direct and critical link connecting buyers and sellers in the e-commerce industry is the massive catalogues and product displays. Therefore, it is necessary to improve the quality and structure of commodity information to achieve complete elements, reasonable layout and friendly interface.

The application of the information quality model in the Internet industry mainly includes optimization of product offer quality, online store quality optimization, online forum posting quality optimization, and illegal information filtering optimization.

        Sometimes the target variable for building an information quality model is whether the information generates a transaction in a specific time period, and the target variable at this time is binary, yes or no. But in other cases, there is no clear target variable from actual data, then expert scoring and model fitting are a more suitable workaround strategy. For example, to score the weight of the components of the product offer, including the title length, the number of pictures, the proportion of optional attributes, whether there is a hierarchical price range, whether to fill in the total supply information, whether there is an operation description, whether to support online third-party payment . First, select a certain sample, ask industry experts to score, use these scores as target variables, and use various models of data mining to fit the relationship between these elements and the total score.

3.7 Service Assurance Model

For example, asking sellers to purchase suitable value-added products, asking sellers to renew suitable value-added products, prohibiting filtering of sellers' valley value information, and hot and cold judgment of sellers' community postings, etc.

3.8 User (Seller, Buyer) Hierarchical Model

        The hierarchical model is a compromise and filtering model between the extensive operation and the prediction model based on individual probability. It not only takes into account the need for refinement, but also does not need to be invested in the construction and maintenance of the prediction model. Analysis at the initial and strategic level of operations has great application value.

        The commonly used scenarios are that the customer service team needs to provide different rhetoric and corresponding service packages for different groups according to the hierarchical model; the enterprise management needs to form a hierarchical evolution view of sellers based on the number of online transaction sellers; the operation team needs to A customer tiered model is needed to guide the formulation and execution of the corresponding operating plan.

        Commonly used techniques for hierarchical models include statistical analysis techniques (correlation analysis, principal component analysis), and can also include techniques for prediction (response, classification) models, such as discovering the most important input variables and ranking through the prediction model. The variables roughly divide the stratification, determine the stratification indicators and thresholds according to the business situation, establish the prediction relationship between the input variables and the stratification threshold, and see if the prediction results of the model can include most of the actual situations, and conduct the analysis through the actual data. Verify to see if it is stable over a certain length of time.

3.9 Seller (Buyer) Transaction Model

        The main types of analysis involved include: automatic matching (prediction) of products that buyers are interested in (ie product recommendation model), transaction funnel model (to find out the loss funnel in the transaction link, help improve transaction efficiency), buyer segmentation (to help improve Personalized goods and services), optimized transaction path design (enhancing buyer consumption experience).

3.10 Credit Risk Model

        The credit risk here includes fraud warning, dispute warning, judgment of high-risk users, etc. Compared with conventional data analysis and mining, the credit risk analysis model has a shorter timeliness, needs to be updated more frequently, and has great challenges in the timeliness and accuracy of the model, because the changes in fraudulent methods are largely random. of.

3.11 Product Recommendation Model

3.11.1 Product recommendation introduction

        According to different business needs, in addition to the main product recommendation in e-commerce, there are also query recommendation, product category recommendation, product label recommendation, store recommendation, etc. Commonly used product recommendation models are mainly divided into rule models, collaborative filtering and content-based recommendation models. For the rule model, the commonly used algorithm is the Apriori algorithm; while the collaborative filtering involves the K nearest neighbor algorithm, the factor model and so on.

3.11.2 Association Rules (Apriori Algorithm)

Given an association rule X→Y, Y is derived from X. Formally defined as:

Support (X→Y) = number of records containing both X and Y / total number of records in the dataset

Confidence (X→Y) = number of records containing both X and Y / number of records containing X in the dataset

Algorithmic process:

  • Compute frequent 1 itemsets. Count the number of occurrences of each commodity, select the commodity with the minimum support degree or more, and get the candidate item set.
  • Compute frequent 2 itemsets. The frequent 1 item set is connected with itself (that is, various binary combinations of commodities), the number of records is calculated according to the binary combination, and the frequent 2 itemsets are obtained according to the minimum support degree.
  • According to the frequent 2 itemsets, the frequent 3 itemsets are calculated in the same way. And carry out pruning, that is, the non-empty subset of frequent 3-item sets must be frequent.
  • It has been calculated until after pruning, the frequent n itemsets are empty.
  • Based on frequent itemsets, association rules are calculated. That is, according to the combination of different items in the existing frequent item set, any X→Y is obtained, and its confidence is calculated. Remove low confidence ones.

3.11.3 Collaborative Filtering Algorithm

        The heuristic collaborative filtering algorithm mainly includes three steps:

  • Collect user preference information
  • Find similar items or users
  • generate recommendations

        The input datasets of collaborative filtering are mainly user comment datasets or behavior datasets. These datasets are further divided into explicit data and implicit data. Among them, the explicit data is mainly user rating data, such as the user's rating of the product, but there are certain problems with the explicit data, such as users rarely participate in comments, there may be fraud suspicion, resulting in sparse or untrue rating data. The implicit data refers to the user's click behavior, purchase behavior and search behavior, which implicitly reveal the user's preference for products. But there are also certain problems with implicit data, such as how to identify whether a user is buying or giving a gift for himself.

(1) User-based collaborative filtering

The user-based collaborative filtering algorithm first searches for other similar users according to the user's historical behavior information, and predicts the items that the current user may like according to the evaluation information of these similar users on other items.

In collaborative filtering, an important link is to calculate the similarity of users. Generally, Pearson correlation coefficient and cosine similarity are used, and the evaluation information data of two users on certain products is used together.

        Another important step is to calculate the user's predicted score for unrated items. Let s(u,u') represent the similarity between user u and user u', N represents the neighbor set, U represents the user set, r u,i represents the user u's rating on item i, r u represents the average user u score. To predict user u's rating p u,i for item i , the formula is as follows:

pu,i=ru+(∑ns(u,u')*(ru',i - ru'))/(∑n| s(u,u') | )

(2) Item-based collaborative filtering

        Item-based collaborative filtering algorithm calculates the similarity between items to predict user ratings. Pearson or cosine similarity can also be used to calculate item similarity. Here is a formula based on conditional probability calculation: use s(i,,j) to represent the similarity between item i and item j, and freq(iΛj) to represent The probability of i and j co-occurring, α represents the resistance factor, which is mainly used to balance the control of popular and popular items

s(i,,j)=(freq(iΛj))/(freq(i)*freq(j)α

        Next, predict the score, p u,i represents the predicted score of item i by user u, S represents the set of items similar to item i, and r u,j represents the score of user u on item j:

p u, i = ∑ s s (i ,, j)* r u, j)/(∑ s | s (i ,, j) |)

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325211451&siteId=291194637