Python Quantitative Trading

Data analysis project case

The first step, we first import tushare:

# Import tushareimport tushare as tspro = ts.pro_api()
Shanghai-Shenzhen-Hong Kong Stock Connect Standards
Many investors tend to invest in Shanghai-Shenzhen-Hong Kong Stock Connect stocks because these stocks are favored by more international investors. In tushare, we obtain the stock list through the stock_basic interface, where the is_hs parameter is used to filter the list of Shanghai, Shenzhen and Hong Kong Stock Connect bids. The specific rules are as follows:

Shanghai Stock Connect standard: is_hs="H";
Shenzhen Stock Connect standard: is_hs="S";
Others: is_hs="N".
# Shanghai Stock Connect stock list pro.stock_basic(is_hs='H').head()

# Shenzhen Stock Connect stock list pro.stock_basic(is_hs='S').head()

# Non-Shanghai-Shenzhen-Hong Kong stock list pro.stock_basic(is_hs='N').head()

Listing status
In the stock_basic interface, we can use the list_status parameter to filter stocks with different listing statuses.

Listing status: list_status="L";
Delisting status: list_status="D";
Suspended listing status: list_status="P".
# Normal listed stock pro.stock_basic(list_status='L').head()

# Delisted stocks pro.stock_basic(list_status='D').head()

Exchange
We can also view the list of stocks listed on the Shanghai Stock Exchange and the Shenzhen Stock Exchange, just pass in the exchange parameter in stock_basic.

Shanghai Stock Exchange: exchange="SSE";
Shenzhen Stock Exchange: exchange="SZSE:.
# Shanghai Stock Exchange stock pro.stock_basic(exchange='SSE').head()

Industry
In addition to the filter dimensions provided in the parameters, we can also use some fields to filter after obtaining all the stock list data. For example industry:

df = pro.stock_basic()

df.query('industry == "bank"').head()

Listed sector
We can also select stocks according to the listed sector, which is also very useful. Some investors focus on stocks in a certain sector, such as the small and medium-sized board, the ChiNext board, and so on.

df = pro.stock_basic()

df.query('market == "GEM"').head()

Region
We can also select stocks belonging to a particular region. This screening method will be very helpful when we need to look for geographical linkage effects or when a certain region explodes a certain good.

df = pro.stock_basic()df.query('area == "Shenzhen"').head()

Basic information of listed companies
After obtaining the list of stocks, we may also want to know their basic information, such as registered capital, city, business scope, etc. At this time, we can use the stock_company interface.

The interface only provides some fields by default. If you need more fields, you can use the fields parameter to specify the required columns.

pro.stock_company().head(1).T

Get Shenzhen local stocks

Let’s take a look at a case. For example, when the state helped small and medium-sized enterprises to solve their pledge risks, the Shenzhen State-owned Assets Supervision and Administration Commission responded first. Then we want to see which stocks may benefit. Therefore, we want to find the list of local stocks in Shenzhen first, and check Their main business and number of employees.

#Get the stock list of local listed companies in Shenzhen import pandas as pd#Get listed company information fields = ('ts_code,exchange,chairman,province,' 'city,office,employees,main_business')df = pro.stock_company(fields=fields) df.query('city == "Shenzhen"').head()

Obtain the stocks of listed companies with more than 100,000 employees.
Maybe we also want to filter out large companies based on the number of employees. At this time, we can complete the screening through the employees field.

df = pro.stock_company()df.query('employees >= 100000')

Is not it simple? But the power of tushare is far more than that
———————————————

Share the content of obtaining basic stock information, mainly including the following three knowledge points, I hope it will be helpful to you:

(1) Obtain data from Tushare
(2) Write to Mysql
(3)
After importing into Excel, I will add some complex functions, such as regular update, logging and so on.

"""
接口:stock_basic
描述:获取基础信息数据,包括股票代码、名称、上市日期、退市日期等
更新频率:one week or long
"""
import tushare as ts
import pymysql
import pandas as pd

# 连接数据库,传参数的,返回df
def get_mysql_data(sql):
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="STOCK",
                         charset='utf8')
    cur = db.cursor()
    cur.execute(sql)
    data = cur.fetchall()
    columnDes = cur.description  # 获取连接对象的描述信息
    columnNames = [columnDes[i][0] for i in range(len(columnDes))]
    df = pd.DataFrame([list(i) for i in data], columns=columnNames)
    cur.close()
    db.commit()
    db.close()
    return df

# 将已完成的数据表,导出到excel中
def to_excel():
    sql = "select * from stock_basic"
    df = get_mysql_data(sql)
    df.to_excel(r'D:\0.STOCK\Excel_File\stock_basic.xlsx',index=False)
    print('<=== !!!导入成功!!! =====>')

# 返回相应接口的数据
def get_data():
    # 只需要在第一次或者token失效后调用
    ts.set_token('your token')
    pro = ts.pro_api()
    # 查询当前所有正常上市交易的股票列表
    data = pro.stock_basic(exchange='', list_status='L',
                           fields='ts_code,symbol,name,area,industry,market,exchange,list_status,list_date,is_hs')
    return data

# 连接数据库,清洗掉原表,并重新写入数据
def write_data():
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="123",
                         db="STOCK",
                         charset='utf8')
    cur = db.cursor()
    cur.execute("truncate TABLE stock_basic")
    para = []
    data = get_data()  # 获取接口的数据
    print('<=== !!!获取成功!!! =====> 共有', data.shape[0], '个股票数据')
    print(data.head())
    for i in range(0, len(data)):
        para.append((i+1, data.loc[i, 'ts_code'], data.loc[i, 'symbol'], data.loc[i, 'name'], data.loc[i, 'area'],
                     data.loc[i, 'industry'], data.loc[i, 'market'], data.loc[i, 'exchange'],
                     data.loc[i, 'list_status'],
                     data.loc[i, 'list_date'], data.loc[i, 'is_hs']))

    sql = """
    insert into stock_basic(id,ts_code,symbol,name,area,industry,market,exchange,list_status,list_date,is_hs) values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
    """
    cur.executemany(sql, para)
    # 查询写入的条数
    sql_query = "SELECT ts_code from stock_basic"
    cur.execute(sql_query)
    sql_data = cur.fetchall()
    sql_num = len(sql_data)
    print('Mysql数据库写入成功,共写入', sql_num, '个股票数据')
    cur.close()
    db.commit()
    db.close()

if __name__ == '__main__':
    write_data()  # 清除到原表,重新写入
    to_excel()  # 导出到表格中

1. Stock Analysis

need:

Use the tushare package to get the historical market data of a certain stock.

Output all dates on which the stock closed more than 3% higher than its open.

Output all dates when the stock's opening price fell more than 2% from the previous day's closing price.

# Requirement 4: If I buy 1 lot of stocks on the first trading day of each month starting from January 1, 2010, and sell all stocks on the last trading day of each year, what is my income until today?

Requirement 1: Use the tushare package to obtain the historical market data of a certain stock.

# get quotes

df = ts.get_k_data(code="600519",start='2000-01-01')#save to local

df.to_csv('./maotai.csv')#Read local csv file data

df = pd.read_csv('./maotai.csv')#Delete the Unnamed: 0 column, convert the date column to time type, and set it as the index column

df.drop(labels='Unnamed: 0',axis=1,inplace=True)

df['date'] = pd.to_datetime(df['date'])

df.set_index('date',inplace=True)print(df.info()) #View each data type in the entire data set

print(df)

Requirement 2: Output all the dates when the closing price of the stock is more than 3% higher than the opening price.

#(closing-opening)/opening > 0.03 The return value is a boolean value, and boolean is used as the row index

#In the process of analysis, if a boolean value is generated, the next step is to immediately use the boolean value as the row index of the source data#If the boolean value is used as the row index of df, the row data corresponding to true can be taken out, and the row data corresponding to false can be ignored# print((df['close'] - df['open'])/df['open'] > 0.03) # Get the row data corresponding to True (row data that meets the requirements)

print(df.loc[(df['close'] - df['open']) / df['open'] > 0.03].index)

Requirement 3: Output all the dates when the opening price of the stock fell more than 2% from the closing price of the previous day.

#(Today's opening price - yesterday's closing price)/Yesterday's closing price< -0.02#print(df['close'].shift(1)) # Make the df['close'] column move down one place as a whole

print(df.loc[(df['open'] - df['close'].shift(1)) / df['close'].shift(1) < -0.02].index)

# Requirement 4: If I buy 1 lot of stocks on the first trading day of each month starting from January 1, 2010, and sell all stocks on the last trading day of each year, what is my income until today?

- analyze:

- Time node: 2010-2020

- One lot of stocks: 100 stocks

- purchase:

- 1200 stocks need to be bought for a full year

- Sell:

- 1200 stocks need to be sold for a full year

- Unit price for buying and selling stocks:

- opening price

#Buy stocks: find the row data corresponding to the first trading day of each month (capture the opening price) == "the first row of data per month #Extract the specified data from the original data according to the month#First of every month row data corresponding to a trading day

new_df = df['2010-01-01':]

mairu= new_df.resample('M').first()['open'].sum() * 100 #Data resampling, take out the first stock of each month

maichu = new_df.resample('A').last()['open'][:-1].sum() * 1200 #Get the closing price of the last trading day of each year

yu = new_df['close'][-1] * 600 #Remaining stock value

#print(new_df.resample('M').first()['open']*100)#print(new_df.resample('A').last()['close'][:-1] * 100)

print(maichu - mairu + yu)

Double moving average strategy

Requirement 1: Calculate the 5-day moving average and 60-day moving average of the historical data of the stock

- What is a moving average?

- For each trading day, the moving average of the previous N days can be calculated, and then these moving averages can be connected to form a line, which is called the N-day moving average. Commonly used moving average lines have indicators of 5 days, 10 days, 30 days, 60 days, 120 days and 240 days.

- 5 days and 10 days are reference indicators for short-term operations, called daily moving average indicators;

- 30 days and 60 days are medium-term moving average indicators, called quarterly moving average indicators;

- 120 days and 240 days are long-term moving average indicators, called annual average indicators.

- Calculation method of moving average: MA=(C1+C2+C3+...+Cn)/NC: closing price of a certain day N: moving average period (number of days)

df = ts.get_k_data(code="600519", start='2000-01-01')

df.to_csv('./maotai.csv')

df= pd.read_csv('./maotai.csv')

df.drop(labels='Unnamed: 0',axis=1,inplace=True)

df['date'] = pd.to_datetime(df['date'])

df.set_index('date',inplace=True)

ma5= df['close'].rolling(5).mean() #5 day moving average

ma30 = df['close'].rolling(30).mean() #30 day moving average

plt.rcParams['font.sans-serif'] = ['SimHei'] #Used to display Chinese labels normally

plt.rcParams['axes.unicode_minus'] = False #Used to display the negative sign normally

plt.plot(ma5[20:80],'h-r', label='ma5')

plt.plot(ma30[20:80],'h-b', label='ma30')

plt.legend()

plt.show()

Requirement two:

- Analyze and output all golden cross dates and dead cross dates

- The golden cross and dead cross in stock analysis technology can be simply explained as:

- Analyze two lines in the indicator, one for a short period of time and the other for a longer period of time.

- If the short-term indicator line turns upwards and crosses the longer-term indicator line, this state is called a "golden cross"

- If the short-term indicator line turns downward and crosses the longer-term indicator line, this state is called "dead fork"

- Under normal circumstances, after a golden cross, the operation tends to buy; while a dead cross tends to sell. Of course, golden cross and dead cross are only one of the analysis indicators, and they must be used in conjunction with many other indicators to increase the accuracy of the operation.

#Analyze and output all golden fork dates and dead fork dates

df = pd.read_csv('./maotai.csv')

df.drop(labels='Unnamed: 0', axis=1, inplace=True)

df['date'] = pd.to_datetime(df['date'])

df.set_index('date', inplace=True)

ma5= df['close'].rolling(5).mean()

ma30= df['close'].rolling(30).mean()

s5= ma5[30:] < ma30[30:]

s30= ma5[30:] > ma30[30:]

df= df[30:]

down= s5 & s30.shift(1)print(df.loc[down].index) #dead fork

up= ~(s5 | s30.shift(1))print(df.loc[up].index) #Golden Fork

Requirement 3: If I start from January 1, 2010, with an initial capital of 100,000 yuan, try to buy golden forks, and sell all dead forks, what is my stock yield until today?

df = pd.read_csv('./maotai.csv')

df.drop(labels='Unnamed: 0', axis=1, inplace=True)

df['date'] = pd.to_datetime(df['date'])

df.set_index('date', inplace=True)

ma5= df['close'].rolling(5).mean()

ma30= df['close'].rolling(30).mean()

s5= ma5[30:] < ma30[30:]

s30= ma5[30:] > ma30[30:]

df= df[30:]

up= ~(s5 | s30.shift(1)) #golden fork

down = s5 & s30.shift(1) #dead fork

up_code= Series(data=1, index=(df.loc[up].index))

down_code= Series(data=0, index=(df.loc[down].index))

s=up_code.append(down_code)

s= s.sort_index()['2010-01-01']

first_monry= 100000 #principal, unchanged

money = first_monry # variable, the money for buying stocks and the money for selling stocks are all operated from this variable

hold = 0 #The number of stocks held (number of shares: 100 shares=1 lot)

for i in range(0, len(s)): The implicit index in the Series of s represented by #i

#i = 0 (dead cross: sell) = 1 (golden cross: buy)

if s[i] == 1: #The time of Golden Cross

#Buy as many stocks as possible based on the principal of 100000

#Get the unit price of the stock (the opening price in the row data corresponding to the golden cross time)

time = s.index[i] #time of Golden Cross

p = df.loc[time]['open'] #Unit price of stock

hand_count = money // (p * 100) #Use 100000 to buy up to how many hands of stock

hold = hand_count * 100money-= (hold * p) #subtract the money for buying stocks from money

else:#Sell the purchased stock

# Find out the unit price of selling stock

death_time =s.index[i]

p_death= df.loc[death_time]['open'] #The unit price of selling stocks

money += (p_death * hold) #The income from selling stocks is added to money

hold =0#How to determine whether the last day is a golden fork or a dead fork

last_monry = hold * df['close'][-1] #The value of the remaining stock

#Total revenue

money + last_monry -first_monryprint(money)

Demographic Analysis Project

- Requirements: - Import files, view original data - Merge population data and state abbreviation data - Delete duplicate abbreviation columns in the merged data - View columns with missing data - Find which states/regions make the value of state If it is NaN, perform a deduplication operation - fill in the correct value for the state items of these states/regions found, thereby removing all NaN in the state column - merge the area data areas of each state - we will find area(sq.mi) this One column has missing data, find out which rows - remove the row with missing data - find the 2010 national population data - calculate the population density of each state - sort, and find the state with the highest population density

#Import the file and view the original data

abb = pd.read_csv(r'H:\py\courseware\2-courseware\2_data analysis\courseware\data\state-abbrevs.csv') #state(full name of the state) abbreviation (abbreviation of the state)

area = pd.read_csv(r'H:\py\courseware\2-courseware\2_data analysis\courseware\data\state-areas.csv') #The full name of the state, area (sq. mi) the area of ​​the state

pop = pd.read_csv(r'H:\py\courseware\2-courseware\2_data analysis\courseware\data\state-population.csv')#state/region abbreviation, ages age, year time, population population

#Merge population data and state abbreviation data

abb_pop = pd.merge(abb,pop,left_on='abbreviation',right_on='state/region')

abb_pop#Delete the duplicate abbreviation column in the merged data

abb_pop2.drop(labels='abbreviation',axis=1,inplace=True)

abb_pop2#View columns with missing data

abb_pop2.isnull().any(axis=0)#Find which states/regions make the value of state NaN, and perform deduplication

abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()#Fill the correct value for the state items of these states/regions found, thereby removing the state column All NaN#Combined state area data areas#We will find that there is missing data in the area(sq.mi) column, find out which rows#Remove rows containing missing data#Find out the population data of the whole people in 2010#Calculate the population of each state Density#sort and find the state with the highest population density

Consumption record analysis

first part

The first part: data type processing - data loading - field meaning: -user_id: user ID -order_dt: purchase date -order_product: quantity of purchased products -order_amount: purchase amount - observe data - check the data type of data - whether the data is stored in Missing value - convert order_dt to time type - view the statistical description of the data - calculate the average quantity of goods purchased by all users - calculate the average cost of goods purchased by all users - add a column to the source data to indicate the month: astype('datetime64[M] ')

Code

importpandas as ppdfrom pandas importDataFrame#Data loading

df = pd.read_csv(r'H:\py\Advanced\Data Analysis\Scientific Computing Basic Package-numpy\CDNOW_master.txt',header=None,sep='\s+',names=['user_id','order_dt ','order_product','order_amount'])

df#View the data type of the data

df.info()#Whether the missing value is stored in the data#df.isnull().any()

df.notnull().all()#Convert order_amount to time type

df['order_dt'] = pd.to_datetime(df['order_dt'],format='%Y%m%d')

df#View the statistical description of the data

df.describe()#Add a column to the source data to indicate the month: astype('datetime64[M]')

df['month'] = df['order_dt'].astype('datetime64[M]')

df

View Code

the second part

The second part: monthly data analysis-the total amount spent by users per month-drawing a curve display-the monthly product purchases of all users-the total number of consumptions of all users per month-to count the number of consumers per month

Code

#The total amount spent by the user per month

df.groupby(by='month')['order_amount'].sum()#Draw a curve display

importmatplotlib.pyplot as plt

plt.plot(df.groupby(by='month')['order_amount'].sum())

df.groupby(by='month')['order_amount'].sum().plot()#The monthly product purchases of all users

df.groupby(by='month')['order_product'].sum()#The total consumption times of all users per month

df.groupby(by='month')['user_id'].count()#Statistics of the number of consumers per month

df.groupby(by='month')['user_id'].nunique()

View Code

the third part

Part 3: Analysis of individual user consumption data - Statistical description of the total amount of user consumption and the total number of consumption times - Scatter plot of user consumption amount and the number of consumed products - Histogram distribution of each user's total consumption amount (consumption amount within 1000 distribution)- the histogram of the total quantity consumed by each user (the distribution of the quantity of consumed goods within 100 times)

Code

#Statistical description of the total amount of user consumption and the total number of consumption times

df.groupby('user_id')['order_amount'].sum()

df.groupby('user_id')['order_product'].count()#Scatter diagram of user consumption amount and consumption product quantity

order_amount = df.groupby('user_id')['order_amount'].sum()

order_product= df.groupby('user_id')['order_product'].sum()

plt.scatter(order_amount,order_product)#The histogram of the total consumption amount of each user (the distribution of the consumption amount within 1000)

df.groupby(by='user_id').sum().query('order_amount <= 1000')['order_amount'].hist()#The histogram of the total quantity consumed by each user (the number of consumption goods is in distribution within 100 times)

df.groupby(by='user_id').sum().query('order_product <= 100')['order_product'].hist()

View Code

fourth part

Part 4: Analysis of user consumption behavior - monthly distribution of the user's first consumption, and population statistics - drawing a line graph - time distribution of the user's last consumption, and population statistics - drawing a line graph - the proportion of new and old customers - consumption One time for new users - consumption multiple times for old users - analyze the time of the first consumption and the last consumption of each user - agg(['func1','func2']): specify the aggregation of the grouped results -Analyze the consumption ratio of new and old customers-User stratification-Analyze the total purchase amount and total consumption amount of each user and the table rfm-RFM model design of the latest consumption time-R indicates the customer's latest transaction time interval. - /np.timedelta64(1,'D'): remove days-F indicates the total quantity of goods purchased by the customer, the larger the F value, the more frequent the customer's transaction, and vice versa, the less active customer's transaction. -M indicates the amount of customer transactions. The larger the M value, the higher the customer value, and vice versa, the lower the customer value. - Apply R, F, M to the rfm table - According to the value stratification, users are divided into: - important value customers - important retention customers - important retention customers - important development customers - general value customers - general retention customers - general retention Customers - general development customers - just use the existing layered model rfm_func

Code

importpandas as ppdfrom pandas importDataFrame#read data

df = pd.read_csv(r'H:\py\Advanced\Data Analysis\Scientific Computing Basic Package-numpy\CDNOW_master.txt',sep='\s+',header=None,names=['user_id','order_dt ','order_product','order_amount'])#convert to time format

df['order_dt'] = pd.to_datetime(df['order_dt'], format='%Y%m%d')#Add a column of month

df['month'] = df['order_dt'].astype('datetime64[M]')#The monthly distribution of the user's first consumption, and the statistics of the number of people, draw a line graph

df.groupby(by='user_id')['month'].min().value_counts().plot()#The time distribution of the user's last consumption, and the number of people, draw a line graph

df.groupby(by='user_id')['month'].max().value_counts().plot()#The proportion of new and old customers

new_old_user = df.groupby(by='user_id')['order_dt'].agg(['min','max'])

val= (new_old_user['min'] == new_old_user['max']).value_counts()#Proportion of new users

val[True]/(val[True] +val[False])#Proportion of old users

val[False]/(val[True] + val[False])

View Code

Build the RFM data table

#Analyze the table rfm of the total purchase amount and total consumption amount of each user and the time of the latest consumption

rfm = df.pivot_table(index='user_id',aggfunc={'order_product':'sum','order_amount':'sum','order_dt':'max'})

rfm#R indicates the interval of the customer's latest transaction time

importnumpy as np

new_date= df['order_dt'].max() #The maximum time in the data, assuming it is the current time

rfm['R'] = -(rfm.groupby(by='user_id')['order_dt'].max()-new_date)/np.timedelta64(1,'D')

rfm#F indicates the total quantity of goods purchased by the customer. The larger the value of F, the more frequent the customer's transaction, otherwise it means that the customer's transaction is not active enough. #M indicates the amount of the customer's transaction. The larger the M value, the higher the customer value, and vice versa, the lower the customer value #Apply R, F, M to the rfm table

#Delete the order_dt column

rfm.drop(labels='order_dt',axis=1,inplace=True)#Rename the column

rfm.columns = ['M','F','R']

rfm

View Code

User Hierarchy

defrfm_func(x):

level= x.map(lambda x:'1' if x>=0 else '0')

val= level.R + level.F +level.M

dit={'111':'Important value customers','011':'Important customer retention','101':'Important retention customers','001':'Important development customers','110':'General value Customers','010':'generally maintain customers','100':'generally retain customers','000':'generally develop customers',

}

respons=dit[val]returnrespons

rfm['level'] = rfm.apply(lambda x:x-x.mean()).apply(rfm_func,axis=1)

rfm

View Code

Part V: User Life Cycle

The fifth part: user life cycle - dividing users into active users and other users - counting the consumption times of each user per month - counting whether each user consumes every month, consumption is 1, otherwise it is 0-knowledge points: The difference between apply and applymap of DataFrame-applymap: return df-apply the function to all elements in the DataFrame-apply: return Series-apply() apply a function to each row or column in the DataFrame-will Users are divided according to each month: -unreg: wait-and-see users (who did not buy in the first two months, and bought for the first time in the third month, the user is a wait-and-see user in the first two months) -unactive: after purchasing in the first month, the next month If there is no purchase, the user will be an inactive user in the month without purchase-new: A user who made the first purchase in the current month will be a new user in the current month-active: A user who purchased in consecutive months will be an active user in these months- return : The first month of re-purchasing at an interval of n months after purchase is the repeat customer of that month

Code

importpandas as pdfrom pandas importDataFrame

df= pd.read_csv(r'H:\py\Advanced\Data Analysis\Scientific Computing Basic Package-numpy\CDNOW_master.txt',sep='\s+',header=None,names=['user_id','order_dt ','order_product','order_amount'])

df['order_dt'] = pd.to_datetime(df['order_dt'],format='%Y%m%d')

df['month'] = df['order_dt'].astype('datetime64[M]')

df#Statistics of consumption per user per month

month_sum = df.pivot_table(index='user_id',values='order_dt',aggfunc='count',columns='month').fillna(0)# count whether each user consumes every month, consumption is 1 otherwise is 0

month_sum = df.pivot_table(index='user_id',values='order_dt',aggfunc='count',columns='month').fillna(0)

df_purchase= month_sum.applymap(lambda x:1 if x>0 else0 )

df_purchase

View Code

Differentiate between user categories

#Modify the original data 0 and 1 in df_purchase to new, inactive......, return the new df called df_purchase_new#fixed algorithm

month_sum = df.pivot_table(index='user_id',values='order_dt',aggfunc='count',columns='month').fillna(0)

one_zero= month_sum.applymap(lambda x:'1' if x>0 else '0')defactive_status(data):

status= []#A user's monthly activity

for i in range(18):#If there is no consumption this month

if data[i] ==0:if len(status) >0:if status[i-1] == 'unreg':

status.append('unreg')else:

status.append('unactive')else:

status.append('unreg')#If you consume this month

else:if len(status) ==0:

status.append('new')else:if status[i-1] == 'unactive':

status.append('return')elif status[i-1] == 'unreg':

status.append('new')else:

status.append('active')returnstatus

pivoted_status= df_purchase.apply(active_status,axis=1)# convert to list format

pivoted_status_list =pivoted_status.values.tolist()#Generate a new data table#new_start_info = DataFrame(data=start_list)#Generate a new data table and replace it with the original index

new_start_info = DataFrame(data=pivoted_status_list,index=month_sum.index,columns=month_sum.columns)

new_start_info

View Code

- Count of [different active] users per month

- purchase_status_ct = df_purchase_new.apply(lambda x : pd.value_counts(x)).fillna(0)

- Transpose to view the final result

new_start_info_ct = new_start_info.apply(lambdax : pd.value_counts(x)).fillna(0)

new_start_info_ct

new_start_info_ct.T

Guess you like

Origin blog.csdn.net/qq_22473611/article/details/126311555