Python Data Analysis combat: the classic cohort analysis

Author | Zhou Zhipeng

Zebian | Guo Rui

Firstly, the concept of group analysis of the same period made a brief introduction, and then follow the data overview, data cleaning, analysis of ideas, and ultimately achieve a single point of the process, trying to do every step of the clear and reproducible. Follow the practice again, no matter the level of understanding of proficiency model or use Pandas will Cengceng to jump up. ( Note: you can get a complete end real data and code, airborne text )

What is the group analyzing the same period?

Group analysis data analysis in the same period is a classic hin of thinking, the core is to take place by the time the initial user behavior, divided into different groups, and then analyze how changes in a group of similar behavioral changes over time.

It is generally achieved through retained table like this:

Each row, new month on behalf of customers, on retirement in the next few months.

Through horizontal comparison, retention and life cycle can have a preliminary understanding of the customer. Longitudinal observation of different clients can be found, retention discrepancy, the anti push client of whether or not the introduction of precision.

This table looks simple and clear, there are some sophisticated tools to achieve, but you really want to use Python-based orders to implement, or to twist a lot of brains.

Data Overview

First, import order data, incidentally, take a look at the source data looks like:

This is a small grocery store z data, then advanced than the fried. Subsequent analysis will use a customer key fields nickname, time of payment, order status and payment amount.

View the amount of data and missing cases:

Orders totaling 42713 lines, in addition to the time of payment, others are complete (excluding missing values).

Overall structured format, datetime format for the time of payment, the purchase amount and the number is numeric.

Data cleaning

Cleaning focus is to figure out why so many payments are missing. We first screened line payment time is null, check it out:

Seemingly, missing time of payment data, order status is mainly a "deal failed." Here to do a preliminary inferred reason for missing the time of payment, because no actual transactions.

Sure enough, the lack of time for payment orders are a "deal failed" state, and complete data is "successful transaction."

Then, just filter out orders like a successful trading:

40339 lines of data, is the main battlefield cohort analysis.

Ways of Thinking

Table retained let the beginning swipe sense of presence:

Direct one-time thinking how to generate this table, really spent hair. A more rational way of thinking is to disassemble the building blocks of this table.

Each row of this table, represents a cohort, but their essence is the same logic.

  • First, calculate the month the number of new customers and recording customer nickname;

  • Then take this part of the customer, and the customer were to buy back a month to do the matching, and count how many customers appear complex purchase (retained).

As long as we calculate the corresponding month of new customers and retention situation, these data are spliced ​​together, we get the coveted cohort retention table.

Achieve a single month

Follow the step of thinking, the problem becomes easy for them to achieve computational logic month, apply to other months.

Grocery store data retention time dimension and the above table are not the same, because they do not relate to the time series, with a string of "Year - Month" tab easier:

Orders the source data is from September 19, 2020 February. We October 2019 data for the model to achieve a single line of cohort analysis.

Obviously, in October 2019, a total of 7336 customers bought a 8096 pen orders.

Next, we calculate that the number of new customers each month, this new, and before the month is the need to traverse the match to verify, prior to October 2019 the customer is September 2019 data:

And historical data do matching, validation and screening in October 2019 the number of new customers:

Then, after a month and customers nickname October matching, calculated each month on retirement:

The beginning of the new month is added to the list of customers:

October 2019 new customers 7083, the next month (November) retained 539 people, then decreased, and in February 2020 the number of customers retained repurchase increased slightly from the previous month.

Other months of retained and new computational analysis logic, as well.

Traversal merger

Back to our October 2019 as a model, according to the first order history, matching a pure new customers that month, followed by monthly dimensions, customer follow-month traverse to verify the number of customer retention.

In order to facilitate recycling, we have introduced a list of the months:

Complete code and critical comments as follows:

#引入时间标签
month_lst = order['时间标签'].unique()
final = pd.DataFrame()

for i in range(len(month_lst) - 1):

    #构造和月份一样长的列表,方便后续格式统一
    count = [0] * len(month_lst)

    #筛选出当月订单,并按客户昵称分组
    target_month = order.loc[order['时间标签'] == month_lst[i],:]
    target_users = target_month.groupby('客户昵称')['支付金额'].sum().reset_index()

    #如果是第一个月份,则跳过(因为不需要和历史数据验证是否为新增客户)
    if i == 0:
        new_target_users = target_month.groupby('客户昵称')['支付金额'].sum().reset_index()
    else:
        #如果不是,找到之前的历史订单
        history = order.loc[order['时间标签'].isin(month_lst[:i]),:]
        #筛选出未在历史订单出现过的新增客户
        new_target_users = target_users.loc[target_users['客户昵称'].isin(history['客户昵称']) == False,:]

    #当月新增客户数放在第一个值中
    count[0] = len(new_target_users)

    #以月为单位,循环遍历,计算留存情况
    for j,ct in zip(range(i + 1,len(month_lst)),range(1,len(month_lst))):
        #下一个月的订单
        next_month = order.loc[order['时间标签'] == month_lst[j],:]
        next_users = next_month.groupby('客户昵称')['支付金额'].sum().reset_index()
        #计算在该月仍然留存的客户数量
        isin = new_target_users['客户昵称'].isin(next_users['客户昵称']).sum()
        count[ct] = isin

    #格式转置
    result = pd.DataFrame({month_lst[i]:count}).T

    #合并
    final = pd.concat([final,result])

final.columns = ['当月新增','+1月','+2月','+3月','+4月','+5月']

Dangdang Dangdang! We successfully got the expected data.

However, the real data retention is embodied in the form, and then slightly processed to:

Finally, you're done! We want to achieve a cohort analysis table. Simple sweep two can be found:

  • A horizontal view, a serious loss of the month, the best performance of the month following the month retained only 12%, then steadily decreased, stabilizing at around 6%.

  • Vertical contrast, in 2019 new customers a minimum of the month, only 2042, but relatively accurate crowd, retention outperformed the rest of the year.

  • ...

Due to limited space, the visual part on the left side of the faithful to practice their own slightly ~

And full data source codes: https: //pan.baidu.com/s/1x_f1a5-zqJdRAxMKsL70Ew extraction code: aiqf

Author: Zhou Zhipeng, 3-year data analysis, deeply felt fun and the learning process of data analysis in the case of a lack of frustration, then newly opened public number "data is not bragging", regularly updated data analysis techniques relevant and interesting cases (including actual data set), welcome attention to the exchange.

Disclaimer: This article Submission, it belongs to all.

【End】

Epidemic prevention, how to return to work in parallel? Tianyun launched artificial intelligence data monitoring program! In the end how to do advance the prevention, rather than after the fact? 8 pm Thursday , the day the cloud data VP Yong for you reveal the answer! Scanning the next Fanger Wei code free registration ~

Recommended Reading 

nearly 10 years programming language rookie big PK, Pick it!

had sleepless nights techniques learned, and now no use ......

Your Business Under what circumstances need to artificial intelligence? Take a look at what conditions and capabilities you need to have it!

5 bn bo suspected data leak, how to avoid stepping Python reptile sinkhole?

@ developers, Microsoft CEO Satya led the 60 large build-up to make coffee, you dare to take it?

claiming Nakamoto he was angry judge hate: your testimony there is no credibility!

You look at every point, I seriously as a favorite

Click to read the original text, sign up to participate

Released 1878 original articles · won praise 40000 + · Views 17,070,000 +

Guess you like

Origin blog.csdn.net/csdnnews/article/details/105108969