(Business direction) Data analysis knowledge + products

data analysis method

Trend analysis method, comparative analysis method, multi-dimensional decomposition method, user scrutiny, funnel analysis, retention analysis, AB test method, 4P theory, PESTEL theory, SWOT analysis, 5W2H theory, logic tree theory, user behavior theory, AARRR model


Data Index System

1 Overview

Indicators reflect certain things or phenomena, and describe concepts such as scale, degree, proportion, and structure under certain time and conditions, and usually consist of indicator names and indicator values.

  • Simple counting indicators:  refers to indicators that can obtain numerical values ​​by repeatedly adding 1 to the mathematical behavior, such as UV (Unique Visit, number of unique visitors), PV (Page View, page views)
  • Composite indicators:  obtained from simple counting indicators after four calculations, such as bounce rate , purchase conversion rate , MAU monthly active users , CTR = click UV/exposure UV,  user retention rate = continuing users/number of new users , ARPU average revenue per user

(1) Split into the sum of multiple sub-indicators according to the scene

For example: DAU daily active users ≈ daily new users + retained users + returning users;

(2) According to a certain relationship, it is divided into the product of multiple sub-indicators

1) Rely on logical relationships for index splitting. like:

  1. GMV (Gross Merchandise Volume) ≈ number of users x purchase frequency x unit price per customer;
  2. Sales ≈ total number of users x payment rate x unit price of customer;
  3. LTV (total value of life cycle, life time value) = LT (life cycle, life time) x ARPU (average cost per user, Average Revenue Per Use)
  4. Return on investment (ROI) = annual profit or average annual profit / total investment × 100%

2) Relying on time sequence for index splitting.

Such as: channel recommendation effect ≈ display times x click rate x conversion rate

2. Indicators of various industries

2.1 Advertising fee indicators:

  • CPC: Advertisement cost per click
  • CPA : pay for action (actual advertising effect)
  • CPM (Cost Per Mile) cost per thousand impressions: total consumption/impressions*1000

Comparison of the three:

  1. CPM charges the advertising fee in the first step, that is, the advertiser only needs to display the advertisement to the audience, and the advertiser will pay.
  2. CPC charges the second-step fee, that is, when the user sees the advertisement and clicks, the advertiser will pay.
  3. CPA charges the third step fee, that is, the user clicks on the advertisement after the advertisement, and completes certain specific actions after further understanding the activity, such as filling in the form, registering, downloading, purchasing, etc., and the advertiser will pay.

2.2 Game industry indicators

ARPPU (Average Revenue Per Paying User) average revenue per paying user:

Average revenue per paying user for your app over time

2.3 Retail industry

Dynamic sales rate = Number of dynamic sales items/Total number of items in warehouse

Customer unit price (per customer transaction) = sales / number of customers

Joint rate = total sales quantity ÷ sales receipt quantity = average product quantity per single consumption

Sold out rate  = cumulative sales ÷ total purchases

3. Indicator scaling

  • Santy's 1-9 scale method: pairwise judgment of relative importance

4.  The method of establishing the indicator system

5. Make reports

 6. Data report

  1. Clarify the purpose of analysis
  2. Dismantling indicators to find problems
  3. Disassembly problem
  4. Expand the dimensions to explore the differences in indicators
  5. Writing reports and beautifying


Index analysis of user portraits

 The established user labels can be divided into statistical categories, rule categories and machine learning mining categories according to label types. From the perspective of established label dimensions, it can be divided into common types such as user attribute, user behavior, user consumption and risk control.

 RFM model

  • R: The latest consumption (recency) represents the reverse value of the user's time from the current last consumption. The larger R is, the lower the user value is.
  • F: Consumption frequency (frequency) The consumption frequency of users in a product within a period of time. The key point is our definition of a period of time. The larger the positive value F is, the higher the user value is.
  • M: Consumption amount (monetary) represents the positive value of the user's value contribution. The larger the value M is, the higher the value of the user
     

retention analysis

Retention analysis model = "retention rules" + "filter conditions" + "table data display" + "visual data display" + "operation"


Target disassembly method - turning business goals into design goals

1. Behavior path analysis method - research user behavior data

Based on the user's behavior path (the user's behavior path is the visualization of the data that the user clicks and browses), the goal is disassembled, and the design can be found to achieve the goal.

The difficulty of this method is that you need to be very familiar with the business, and you need to know all the user paths in detail. Usually, you can also use the "grasp the big and let go of the small" method to sort out the user's main path, conduct research on the main path, and temporarily abandon the sub-path. For example, the user may need to go through ABCDEF to complete the goal G, and sort out the UVs of each page, so as to find the most leaky point in the middle for optimization.

2. Formula analysis method - a relatively open method

3. Data layering method - a more divergent method

  • user path data

  • User portrait data

  • product data


business analysis model

1. 4P theoretical model

Product (product), Price (price), Place (channel) and Promotion (promotion)

  • The first P means product , which means that we should pay attention to the function of the product, require the product to have unique selling points, and put the functional appeal of the product first;
  • The second P represents the price . We need to formulate different price strategies according to different market positioning, and the product pricing is based on the company's brand;
  • The third P means channel , emphasizing that enterprises do not directly face consumers, but focus on the cultivation of dealers and the establishment of sales network;
  • The fourth P means publicity , and publicity refers to promotional activities, such as discounts, buy one get one free, etc.

2. Boston matrix (market growth rate - relative market share matrix)

Analyze and determine the company's product structure through sales growth rate (an indicator that reflects market attractiveness) and market share (an indicator that reflects enterprise strength)

business model

Big customer model, direct sales model, distribution model, free model, conference marketing model, community model, experiential marketing model, scene marketing model, community model


analysis model

1. Funnel analysis

Funnel, in simple terms, is to abstract a process in a website or APP, and observe the conversion and loss of each step in the process.

The three elements of the funnel:

  • Time:
    The conversion cycle of the funnel, which is the collection of time required to complete each layer of the funnel. Generally speaking, the shorter the conversion cycle of a funnel, the better, especially in some industries with a long conversion cycle, such as: online education industry, B2B e-commerce industry. In addition, looking at the time of each layer of the funnel separately can also reveal some problems. For example, if it is found that the consumption time of the traffic imported from a certain channel is surprisingly consistent in a certain layer of funnel, it means that the traffic of this channel is likely to be abnormal.
  • Node:
    Each layer of funnel is a node. For nodes, the core indicator is the conversion rate, the formula is as follows:                   conversion rate = traffic passing through this layer/traffic reaching this layer
    The conversion rate of the entire funnel and the conversion rate of each layer can help us clarify the optimization Direction: Find the node with low conversion rate and find a way to improve it.
  • Traffic:
    traffic, that is, crowds. The performance of different groups of people under the same funnel must be different. For example, in the shopping funnel of Taobao, the conversion rates of men and women are different, and the conversion rates of young people and old people are also different.
    Through crowd classification, we can quickly check the conversion rate of a specific crowd and locate problems more clearly.

1.1 AARRR traffic funnel  , also known as pirate model

Refers to the five links in the entire life cycle of users before and after using the product.

  1. Get users (Acquisition)
  2. Improve user activity (Activation)
  3. Improve user retention rate (Retention)
  4. Get income (Revenue)
  5. Self-propagation (Refer)

Main indicators of concern at different stages:

  1. Daily new users
  2. The number of registered people, the number of completed novice tutorials, the number of people who have used the product at least once, and the number of subscriptions
  3. User engagement, time since last login, daily/monthly active usage, churn rate
  4. Customer price per customer (ARPU), payment rate (PR or PUR), active paying users (APA), average revenue per user (ARPU), average revenue per paying user (ARPPU), product lifetime value (LTV)
  5. K factor  K = (the number of invitations each user sends to his friends) × (the conversion rate of people who receive invitations into new users)

2. Attribution Model

Accurate description is actually an established rule. We need to assign credit to each conversion node according to the set weight before achieving the goal (forming conversion) according to the actual needs of the product . Once a product is converted, the user may have to go through many conversion nodes (conversion does not necessarily only complete sales. A registration can also be regarded as a conversion, and a visit can also be regarded as a conversion, which should be formulated according to the actual needs of the business).

  • First-time attribution : It is suitable for companies whose brands are not well-known. Focusing on the initial channel that can bring customers is very helpful for expanding the market;
  • Last attribution : suitable for businesses with few conversion paths and short cycles. Both the last attribution and the first attribution belong to the single-channel attribution model;
  • Linear attribution : Distribute the credits of all touchpoints equally during the lookback period. The advantage is that there is no need to consider the weight of different channels, and all channels are treated equally. The disadvantage is that some high-quality channels may be averaged; it is suitable for expecting to maintain in the entire sales cycle. For companies that connect with customers and maintain brand awareness, this attribution method makes each channel play an equal role in promoting the customer's consideration process;
  • Time decay attribution model : For all touchpoints within the statistical time point, the closer to the conversion, the greater the contribution, which is suitable for the situation where the customer decision cycle is short and the sales cycle is short;
  • Location attribution : This model focuses on the channels that initially bring clues and ultimately lead to transactions. If a company values ​​these two points, it can choose this model. It combines the first attribution, last attribution, and linear attribution. and the last contact each contribute 40%, and all the contacts in the middle contribute an average of the remaining 20%;

3. Cohort analysis (cohort analysis)  

Cohort Analysis, also known as cohort analysis and group analysis, is a commonly used method in data analysis. The general analysis process is to divide the data into several consecutive parts with the same weight , and then perform the same analysis on each part of the data, and finally conduct a continuous discussion and get the results.

For example, analyze the income of the post-70s, post-80s, and post-90s at the age of 20, 30, 40, and 50; analyze the retention rate of new registered users each day in the next N days, and so on.

  • Commodity Cohort: Commodity LTV Model
  • User Cohorts: User Retention Models
  • Channel Cohorts: A Model for Channel Quality Analysis

4. AHP Analytical Hierarchy Process

Analytic Hierarchy Process has the advantages of simplifying complex problems and simple calculations, and is widely used in many fields such as personnel quality assessment, multi-plan comparison, scientific and technological achievement evaluation, and work effectiveness evaluation. It is a multi-index comprehensive evaluation algorithm, which generally has two purposes:

  • Indicator weighting: For a certain decision, (subjective) the degree of importance to its factors is different. AHP can realize the weighting of these indicators without collecting data
  • Quantitative scheme selection: AHP can combine the above five factors to calculate a quantitative score for these schemes

The core idea of ​​hierarchical single sorting is roughly divided into two steps

Calculate its weight (weight vector) for the judgment matrix:

  • Square root method: After each row is multiplied, the root is opened, and the obtained vector is standardized to be the weight vector
  • Sum method: First standardize each column of the matrix, then sum each element by row, and standardize the summation result

Check for consistency:

Principle of wire generation

Theorem 1: If A is a consistency matrix, then the largest eigenvalue of A λ_max⁡ = n, where n is the order of matrix A, and the rest of the eigenvalues ​​of A are all 0.

Theorem 2: An order n-th reciprocal matrix is ​​a consistent matrix if and only if its largest eigenvalue λ_max⁡ = n, and when the reciprocal matrix is ​​inconsistent, there must be λ_max⁡ > n.

Define the consistency index The larger the consistency index CI is, the more inconsistent the entire matrix is

According to the weight matrix to calculate the largest characteristic root A is the judgment matrix, W is the standardized weight

Then, in order to measure CIthe size, a random consistency index is introduced RI. The construction method of this index is to randomly construct 1000 positive and reciprocal matrices, and calculate the average value of the consistency index, just look up the table

insert image description here

 Finally, the consistency ratio is generally calculated , and when the consistency ratio is CR<0.1reached, the consistency test is passed

For hierarchical total sorting ,

 Its consistency ratio is

5. Time series models

5.1 AR(p) model

The full name of the AR model is Auto Regression, which means auto-regression. Everyone should know that ordinary regression equations use x to regress y. Here, x and y are generally not the same thing. And our auto-regression here, as the name implies, is to use itself to return to itself, that is, both x and y are time series themselves. The specific model is as follows:

In the above model, Xt represents the value of period t, the value of the current period is determined by the value of the previous p period , the δ value is a constant term, which is equivalent to the intercept term in ordinary regression, and μ is a random error.

5.2 MA(q) model

The full name of MA is Moving Average, which means moving average. The specific model is as follows:

In the above model, Xt represents the value of period t, the value of the current period is determined by the error value of the previous q period , the μ value is a constant term, which is equivalent to the intercept term in ordinary regression, and ut is the random error of the current period. The core idea of ​​the MA model is that the random error of each period will affect the value of the current period, and the addition of all the errors in the previous q period is the impact on the value of the t period. 

5.3 ARMA(p,q) model

The ARMA model is actually a combination of the above two models, which means that the value of the t period is not only related to the x value of the previous p period, but also related to the error of each period corresponding to the previous q period. These two parts jointly determine the current t The value of period, the specific model is as follows:

5.4 ARIMA(p,d,q) model

The ARIMA model is modified on the basis of the ARMA model. The ARMA model is modeled for the value of the t period, and the ARIMA is modeled for the difference between the t period and the td period. Making a difference is called a difference, where d is a few orders of difference. The specific model of ARIMA is as follows:

The wt in the above formula represents the result after the d-order difference in period t. We can see that the form of the ARIMA model is basically the same as that of ARMA, except that X is replaced by w.

When the data is a stationary time series, the first three models can be used. When the data is a non-stationary time series, the last one can be used to convert the non-stationary time series into a stationary time series by means of difference.

5.5 ARIMA steps

1. Draw the time series data to check the stationarity of the data. For the non-stationary time series data, first perform the difference until the time series is a stationary time series.
2. Perform a white noise test on the stabilized data. White noise refers to a random stationary sequence with zero mean and constant variance.
3. If it is a stationary non-white noise sequence, calculate ACF (autocorrelation coefficient) and PACF (partial autocorrelation coefficient) to identify the ARIMA model.
4. For the identified model, determine the model parameters, predict the time series, and evaluate the model results.
 

6. Factor Analysis

Principal component analysis aims to use the linear combination of variables to generate the same number of principal components, and then select the appropriate number of linear combinations to keep as much overall information as possible; while factor analysis aims to find common The factors that affect the variables transform the variables with complex relationships into a few factors to reproduce the internal relationship between the original variables. The factors here are false and unobservable random variables .

Exploratory factor analysis does not assume that there are several factors and relationships behind a bunch of independent variables, but we use this method to find factors and relationships.

Confirmatory factor analysis is to assume that there are several factors behind a bunch of independent variables, and try to verify whether this assumption is correct.

6.1 Steps

  1. Normalize raw data X
  2. Compute the eigenvalues ​​r and eigenvectors U of the correlation matrix C
  3. Determine the number of common factors k
  4. Construct the initial factor loading matrix , where U is the eigenvector of r
  5. Build a factor model
  6. Perform rotation transformation on the initial factor loading matrix A. The rotation transformation is to simplify the structure of the initial factor loading matrix and clarify the relationship, making the factor variables more interpretable. If the initial factors are not correlated, you can use the maximum variance orthogonal rotation. The correlation between factors can be rotated obliquely, and a new ideal factor loading matrix A' can be obtained after the rotation.
  7. Factors are represented as linear combinations of variables, the coefficients of which can be obtained by least squares.
  8. Calculate factor scores.

7. Correspondence Analysis

7.1 Introduction

In factor analysis, Q-type and R-type analysis target different objects. R-type factor analysis studies the correlation between variables (indicators), and Q-type factor analysis studies the correlation between samples. These two analysis methods are often Opposed to each other, the samples and variables must be treated separately. (The variable is a column, and the sample is a row) Therefore, R-type factor analysis and Q-type factor analysis cannot be performed at the same time, which is a major limitation of factor analysis.

Correspondence analysis is also called association analysis and RQ factor analysis. It overcomes the shortcomings of factor analysis, integrates the advantages of R and Q factor analysis, and processes the rows and columns in the cross contingency table.
Using the idea of ​​dimensionality reduction to simplify the data structure and seek to represent the relationship between rows and columns in the data table with low-dimensional graphics is a multivariate statistical analysis method that is especially suitable for the study of multi-category attribute variables. (Widely used in market analysis, product positioning, advertising research, sociology, etc.)

Introduction to the Principle of Correspondence Analysis (Association Analysis, RQ-type Factor Analysis, Tool for Handling Categorical Variables)

8. DuPont Analytics (financial)

DuPont analysis (also known as the DuPont identity or DuPont model) is used to break down the different drivers of return on equity (ROE) . Enables investors to individually focus on key metrics of financial performance to identify strengths and weaknesses.

The DuPont analysis is an expanded return on equity formula calculated by multiplying net profit margin times asset turnover times the equity multiplier

Three financial metrics that drive return on equity (ROE) : operating efficiency, asset utilization efficiency, and financial leverage . Operating efficiency is expressed as net profit margin or net profit divided by total sales or revenue. Asset utilization efficiency is measured by asset turnover . Leverage is measured as an equity multiplier , equal to average assets divided by average equity.

Application of DuPont analysis method in the actual financial statements of enterprises:

  • First of all, the net sales rate of an enterprise can reflect the level of profitability of the enterprise .
  • Secondly, the number of asset turnover can reflect the level of the enterprise's operating ability .
  • Furthermore, the equity multiplier reflects the level of the company's solvency .


analyzing tool

1. Heat map analysis

By recording the user's mouse behavior and presenting it with intuitive effects, it helps users optimize the layout of the website.

  • Mouse Move Heatmap
  • Mouse Click Heatmap
  • Mouse Scroll Heatmap
  • Link Heatmap


* Analysis results inspection

1. Consistency check

  • Kappa test
  • ICC intraclass correlation coefficient
  • Kendall W coordination coefficient


interview questions

Ⅰ. Number law

Ⅱ. Use AB Test to evaluate the effect of the algorithm (business questions)

1) demand

A shopping app has recently optimized the recommendation algorithm of the "Guess You Like" module, hoping to further improve the accuracy of recommendations and increase sales. Now it is necessary to evaluate the effect of the new recommendation through AB Test (50% of the users who retain the original recommendation algorithm are the control group, and 50% of the users who use the new recommendation algorithm are the experimental group ) . Assuming you are the data analyst for this experiment, how would you evaluate the performance of the control group and the experimental group? Please list the three most important indicators in order of importance and give your analysis process.

2) Problem-solving ideas

Indicators: sales of recommended products, click-through rate of recommended products, conversion rate of recommended products

Analysis process:

  1. Let the null hypothesis be that the above indicators decrease or remain unchanged after using the new recommendation algorithm, and the alternative hypothesis is that the above indicators increase after using the new recommendation algorithm.
  2. The critical value of the significance level is selected as 5%, and the sample size and test period are determined according to the expected improvement of the indicator .
  3. The samples were divided reasonably, the AB test was launched, and the data was collected.
  4. The T test was used to calculate the P value and verify the effect.
  5. Analysis conclusion, if the P value is less than 5%, then the original hypothesis is not valid, and the alternative hypothesis is valid, that is, the index is improved after using the new algorithm. On the contrary, the original hypothesis cannot be overturned, and the index cannot be proved to be improved after using the new algorithm.


product analysis

Product Requirements Document (PRD)

PRD is a document that describes product requirements in detail in a practical and implementable manner

Including: business flow chart, function structure diagram, function detail description, interface prototype, etc.

Competitive Analysis

By analyzing competitors' products, discover pain points, and better explore and meet user needs.

Steps: Basic data management of competing products, process management of competing products, analysis of competing products, display of competing products

Five Elements of User Experience

The difference between ToB and ToC products

the difference ToB ToC
business model Basically sign a contract and pay for selling products Free trial, basically an indirect traffic monetization model
scenes to be used The usage scenarios are relatively simple, mostly in office scenarios There are many and complex usage scenarios, using fragmentation and randomness
Business form Most of them are flat functions, which can be sold separately Focus on one core function, and extend the product in multiple dimensions
Replacement cost, user stickiness High replacement costs, long customization deployment cycle, and high user stickiness If the usability is poor, the experience is not good, and the user stickiness is low
product capability More emphasis on business process logic and negotiation coordination capabilities Focus more on user models, transaction models, etc.
data analysis Pay attention to product market share, number of service merchants, renewal rate, etc. Pay attention to the number of active users of the product, user growth rate, conversion rate, etc.
Relationship with sales team There is a strong connection with sales and needs to cooperate with sales No direct sales team, usually an operations team
scalability Scalability is weak, and it can only be achieved bit by bit Strong scalability, can realize point-to-surface

data collection


Code burying point : When the APP or website is loaded, the SDK for data analysis of the third-party service provider is initialized, and then when an event occurs, the corresponding data sending interface in the SDK is called to send data. Strong flexibility, but high labor cost

Visual burying point: frame burying point, using visual interaction means, business personnel can directly make simple circle selections on the page to track user behavior (defined events). The labor cost is small, but the flexibility is not strong.

No burying point (full burying point): After the developer integrates the collection SDK, the SDK will directly start to capture and monitor all user behaviors in the app, and report all of them, without requiring the developer to add additional code. The full amount of data is collected.

Operation related


Channel operation : Use all available resources and traffic to bring new means to your products; including free, paid, exchange, network accumulation, product attractiveness, insider recommendations, planning activities, content Marketing, user word-of-mouth and other means can all be the direction of channel operation.


References

exclusive! How to play data analysis? | Everyone is a Product Manager

Actual combat: How to set up a data indicator system? | Everyone is a Product Manager

The principle and application of Analytic Hierarchy Process (AHP)

Explain in human terms AHP AHP (very detailed principle + simple tool implementation)_Halosec_Wei's Blog-CSDN Blog_ahp AHP

Time Series Forecasting (Medium)_Junhong's Blog of Data Analysis Road-CSDN Blog

Guess you like

Origin blog.csdn.net/m0_64768308/article/details/124602205