How to assign unique column name if the same set pattern repeats again in data frame using Pandas?

aeapen :

I am trying to create a new column Group (Cluster ) using the below logic.

LOgic :the script shud check the vendor,text field and the days column if the vendor,text and days value <=2 then the records shud be grouped into 1 cluster

My code

data['Date']=pd.to_datetime(data['Date'],infer_datetime_format=True)
data['Days']=(data['Date'].diff(1).dt.days).fillna(0)
data['Text']=data['Text'].fillna('No Value')
data['Vendor']=data['Vendor'].fillna('No Value')
diff= lambda x: x.diff().fillna(0).gt(2).cumsum()
t = data.groupby(['Text', 'Vendor']).Date_Difference.transform(diff)
g = data.groupby(['Text', 'Vendor', t], sort=False).ngroup()
data=data.assign(Group=g.add(1).astype(str).radd('Cluster'))

My current Output

 Invoice    Date    Text    Vendor  Days    Group       
1234567 1/1/2012    Repairs A       0        Cluster1   
1234568 2/1/2012    Repairs A       1        Cluster1   
1234569 4/1/2012    Repairs A       2        Cluster1
1234570 6/1/2012    Water   A       2        Cluster2  
1234571 9/1/2012    Water   A       3        Cluster2
1234572 9/1/2012    Car     A       0        Cluster3
1234573 9/1/2012    Bus     A       0        Cluster4
1234574 9/1/2012    Bike    A       0        Cluster5
1234575 9/1/2012    Repairs A       0        Cluster6
1234576 10/1/2012   Repairs A       1        Cluster6
1234577 11/1/2012   Repairs A       1        Cluster6
1234578 12/1/2012   Water   A       2        Cluster6
1234579 13/1/2012   Water   A       1        Cluster2
1234580 14/1/2012   Water   A       1        Cluster2

Expected Output

 Invoice    Date        Text    Vendor  Days    Group
    1234567 1/1/2012    Repairs A       0        Cluster1
    1234568 2/1/2012    Repairs A       1        Cluster1
    1234569 4/1/2012    Repairs A       2        Cluster1
    1234570 6/1/2012    Water   A       2        Cluster2
    1234571 9/1/2012    Water   A       3        Cluster2
    1234572 9/1/2012    Car     A       0        No Cluster
    1234573 9/1/2012    Bus     A       0        No Cluster
    1234574 9/1/2012    Bike    A       0        No Cluster
    1234575 9/1/2012    Repairs A       0        Cluster3
    1234576 10/1/2012   Repairs A       1        Cluster3
    1234577 11/1/2012   Repairs A       1        Cluster3
    1234578 12/1/2012   Water   A       2        Cluster4
    1234579 13/1/2012   Water   A       1        Cluster4
    1234580 14/1/2012   Water   A       1        Cluster4

Testing Data

  Invoice     Date      Text   Vendor   Days    Group   Expected Group
    1000001 1/1/2012    Repair  A       0   Cluster1     Cluster1
    1000003 2/1/2012    Repair  A       1   Cluster1     Cluster1
    1000005 4/1/2012    Repair  A       2   Cluster1     Cluster1
    1000007 6/1/2012    Water   A       2   No Cluster   No Cluster
    1000008 9/2/2012    Repair  A       34  Cluster2     No Cluster
    1000010 9/2/2012    Garden  A       0   Cluster3     Cluster2
    1000011 10/2/2012   Garden  A       1   Cluster3     Cluster2
    1000012 15/2/2012   Car     A       5   Cluster4     Cluster3
    1000013 16/2/2012   Car     A       1   Cluster4     Cluster3
    1000015 17/2/2012   Car     A       1   Cluster4     Cluster3
    1234574 17/2/2012   Bike    A       0   No Cluster   No Cluster

How can this be done in python?

jezrael :

Idea is create new Series g1 for consecutive groups by Text and Vendor columns and fill only duplicated by helper Series g1, last add non matched value by Series.reindex:

data['Date']=pd.to_datetime(data['Date'],infer_datetime_format=True, dayfirst=True)
data.sort_values(['Vendor','Date'],inplace=True)
data['Date_Difference']=(data['Date'].diff(1).dt.days).fillna(0)
data['Text']=data['Text'].fillna('No Value')
data['Vendor']=data['Vendor'].fillna('No Value')
diff= lambda x: x.diff().fillna(0).gt(2).cumsum()
t = data.groupby(['Text', 'Vendor']).Date_Difference.transform(diff)

g1 = data[['Text', 'Vendor']].ne(data[['Text', 'Vendor']].shift()).any(axis=1).cumsum()
m = g1.duplicated(keep=False)

g = data[m].groupby([g1, t], sort=False).ngroup()
clust = g.add(1).astype(str).radd('Cluster').reindex(data.index, fill_value='No Cluster')

data=data.assign(Group=clust)
print (data)
    Invoice       Date    Text Vendor  Days       Group  Date_Difference
0   1000001 2012-01-01  Repair      A     0    Cluster1              0.0
1   1000003 2012-01-02  Repair      A     1    Cluster1              1.0
2   1000005 2012-01-04  Repair      A     2    Cluster1              2.0
3   1000007 2012-01-06   Water      A     2  No Cluster              2.0
4   1000008 2012-02-09  Repair      A    34  No Cluster             34.0
5   1000010 2012-02-09  Garden      A     0    Cluster2              0.0
6   1000011 2012-02-10  Garden      A     1    Cluster2              1.0
7   1000012 2012-02-15     Car      A     5    Cluster3              5.0
8   1000013 2012-02-16     Car      A     1    Cluster3              1.0
9   1000015 2012-02-17     Car      A     1    Cluster3              1.0
10  1234574 2012-02-17    Bike      A     0  No Cluster              0.0

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=350474&siteId=1