R language data analysis notes - Cohort retention analysis

640?wx_fmt=gif&wxfrom=5&wx_lazy=1

About the AuthorIntroduction

Du Yu , member of the EasyCharts team, columnist of the R language Chinese community, interested in: Excel business charts, R language data visualization, geographic information data visualization.

Personal public account: Data Little Rubik's Cube (WeChat ID: datamofang), founder of "Data Little Rubik's Cube". 


Highlights

The R language study notes that have been thrown up in those years are all here~

Left hand uses R right hand Python series - the way of tabular data capture

Left-handed R and right-handed Python series - error exception avoidance in loops

Left hand uses R right hand Python series - exception capture and fault tolerance

Left hand uses R right hand Python series - task progress management

Left hand uses R and right hand Python - CSS web page parsing combat

Left hand with R right hand Python series 17 - CSS expression and web page parsing

Left hand uses R right hand Python series - string formatting advanced

640?wx_fmt=gif&wxfrom=5&wx_lazy=1

I believe that students who often do data analysis have heard of Cohort analysis, especially in Internet operations, which is used to analyze customer retention and other scenarios. In the past, most of this analysis was done with SQL+Excel.

When I was trying to learn Cohort user retention analysis recently, I found the complete code of the Python version of Cohort retention analysis, a foreign data analysis enthusiast, and provided practice data very conscientiously. As a rookie analyst who is more proficient in R than Python, naturally Is the first thought of how to translate this code into the R version.

http://www.gregreda.com/2015/08/23/cohort-analysis-with-python/

Finally, I worked hard for a day to restore the R language version of this Cohort analysis with R language code, regardless of the people who are interested. I will share it with you here. Please forgive me for the inadequacy of the code. It is just a demo and has not been packaged yet.

library('xlsx')
library('ggplot2')
library('dplyr')
library('magrittr')
library('tidyr')
library('reshape2')

1. Data import:

setwd("D:/R/File/")df <- read.xlsx('relay-foods.xlsx', sheetName = 'Purchase Data')

640?wx_fmt=jpeg


2. Data cleaning:

The fields used in the retention analysis are only the purchase date, user ID and other information. To analyze the monthly retention, the date needs to be normalized to the form of an adult month, and at the same time, grouped by customer id to calculate the date of the user's first purchase. The code is as follows:

2.1 Create the purchase month field

df$OrderPeriod = format(df$OrderDate,'%Y-%m')   #购买日期

2.2 Create a user's first purchase field

CohortGroup = df %>% group_by(UserId) %>%               summarize( CohortGroup = min(OrderDate)) 
             #计算用户首购日期

CohortGroup$CohortGroup <-  CohortGroup$CohortGroup %>% format('%Y-%m') df <- df %>% left_join(CohortGroup,by = 'UserId')  
#将首购日期与原始订单表合并对齐

2.3 Grouping (according to the first purchase date, purchase date) to calculate the total number of users, the total number of orders, and the total payment amount (the user ID should not be repeated)

chorts <- df %>% group_by(CohortGroup,OrderPeriod) %>%            summarize(               UserId  = n_distinct(UserId),               OrderId = n_distinct(OrderId),               TotalCharges = sum(TotalCharges)               ) %>% rename(TotalUsers= UserId , TotalOrders = OrderId)

2.4 Group by user ID and add sequential labels according to purchase date month

chorts <- chorts %>%               arrange(CohortGroup,OrderPeriod) %>%               group_by(CohortGroup) %>%               mutate( CohortPeriod =row_number())

3. Calculate the number of new users purchased in the current month

cohort_group_size <- chorts %>%              filter(CohortPeriod == 1) %>%              select(CohortGroup,OrderPeriod,TotalUsers)user_retention <- chorts %>%              select(CohortGroup,CohortPeriod,TotalUsers) %>%              spread(CohortGroup,TotalUsers)              #长表转换为宽表#将具体用户数换算为占基准月份比率

user_retention[,-1] <- user_retention[,-1] %>% t() %>% `/`(cohort_group_size$TotalUsers) %>% t() %>% as.data.frame()

Convert wide table to long table

user_retention1 <- user_retention %>% select(1:5) %>%             melt(                 id.vars = 'CohortPeriod',                 variable.name = 'CohortGroup',                 value.name = 'TotalUsers'                )

4. Retention curve

ggplot(user_retention1,aes(CohortPeriod,TotalUsers)) +     geom_line(aes(group = CohortGroup,colour = CohortGroup)) +     scale_x_continuous(breaks = 1:15) +     scale_colour_brewer(type = 'div')

640?wx_fmt=jpeg


The final persistence heatmap data source:

user_retentionT <- t(user_retention) %>% .[2:nrow(.),]  %>% as.data.frameuser_retentionT$CohortPeriod <- row.names(user_retentionT)row.names(user_retentionT) <- NULLuser_retentionT <- user_retentionT[,c(16,1:15)]user_retentionT1 <- user_retentionT %>%             melt(                 id.vars = 'CohortPeriod',                 variable.name = 'CohortGroup',                 value.name = 'TotalUsers'                )

5. Retention analysis heat map:

library("Cairo")
library("showtext")font_add("myfont","msyh.ttc")CairoPNG("C:/Users/RAINDU/Desktop/emoji1.png",1000,750)showtext_begin()ggplot(user_retentionT1 ,aes(CohortGroup,CohortPeriod,fill=TotalUsers))+  geom_tile(colour='white') +  geom_text(aes(label = ifelse(TotalUsers != 0,paste0(round(100*TotalUsers,2),'%'),'')),colour = 'blue') +  scale_fill_gradient2(limits=c(0,.55),low="#00887D", mid ='yellow', high="orange",midpoint = median(user_retentionT1$TotalUsers, na.rm =TRUE),na.value = "grey90") +  scale_y_discrete(limits = rev(unique(user_retentionT1$CohortPeriod))) +  scale_x_discrete(position = "top")+  labs(title="XXX产品Chort留存分析",subtitle="XXX产品在2019年1月至2010年三月中间的留存率趋势")+  theme(    text = element_text(family = 'myfont',size = 15),    rect = element_blank()    )showtext_end()dev.off()

640?wx_fmt=png


Retention analysis is an analysis tool that is often used in Internet data analysis and application. The R code in this section is derived from the idea of ​​the Python code at the beginning of the article. You can compare the advantages and disadvantages of the two as a reference for future analysis and use.

Write at the end:

If you want to learn ggplot2 in depth, but you are too busy with your usual study and work and have no time to study the vast sea of ​​source documents, that's okay. This editor has spent a lot of effort recently, and put my own learning ggplot2 process. Some experiences, learning experiences, and imitation guides have been carefully organized. The video course of R language ggplot2 visualization has been successfully launched, which is exclusively issued by Tianshan Intelligence. I hope this course can bring you more experience in R language data visualization learning. Rich experience!

640?wx_fmt=png

Last eight hours, one-off promotion!

Click to read the full text and join the course now!

Don't let your opponents get ahead!

640?wx_fmt=jpeg

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325728442&siteId=291194637