Comparing lists with every record in DataFrame

Rishi :

I have a use case where I am comparing the list in same column with itself, code Below:

for i in range(0,len(counts95)):
    for j in range(i+1,len(counts95)):
        for x in counts95['links'][i]:
            for y in counts95['links'][j]:
                if x == y and counts95['linkoflinks'][j] is None:
                    counts95['linkoflinks'][j] = counts95['index'][i]

The code works but its not python friendly (using 4 for loops) and takes a huge amount of time to do the operation. The main idea behind it is linking the records where the elements in list at counts95['links'] is in any of the proceeding rows, if yes update the column linksoflinks with the index of first column only if linksoflinks column is None (no overwriting)

find the reference table below:

counts95 = pd.DataFrame({'index': [616351, 616352, 616353,6457754], 
                   'level0': [25,30,35,100],
                   'links' : [[1,2,3,4,5],[23,45,2],[1,19,67],[14,15,16]],
                   'linksoflinks' : [None,None,None,None]})

EDIT: New Dataframe

counts95 = pd.DataFrame({'index': [616351, 616352, 616353,6457754,6566666,464664683], 
                   'level0': [25,30,35,100,200,556],
                   'links' : [[1,2,3,4,5],[23,45,2],[1,19,67],[14,15,16],[1,14],[14,1]],
                   'linksoflinks' : [None,None,None,None,None,None]})

Desired output:

     index  level0            links  linksoflinks
0   616351      25  [1, 2, 3, 4, 5]         NaN
1   616352      30      [23, 45, 2]    616351.0
2   616353      35      [1, 19, 67]    616351.0
3  6457754     100     [14, 15, 16]         NaN
4  6566666     200           [1,14]    616351.0
5  6457754     556           [14,1]    616351.0
Andy L. :

Your desired output uses different values and column name compare to your sample dataframe constructor. I use your desired output dataframe for testing.

Logic:
For each sublist of links, we need to find the row index(I mean index of the dataframe, NOT columns index) of the first overlapped sublist. We will use these row indices to slice by .loc on counts95 to get corresponding values of column index. To achieve this goal we need to do several steps:

  • Compare each sublist to all sublists in link. List comprehension is fast and efficient for this task. We need to code a list comprehension to create boolean 2D-mask array where each subarray contains True values for overlapped rows and False for non-overlapped(look at the step-by-step on this 2D-mask and check with column links you will see clearer)
  • We want to compare from top to the current sublist. I.e. standing from current row, we only want to compare backward to the top. Therefore, we need to set any forward-comparing to False. This is the functionality of np.tril
  • Inside each subarray of this 2D-mask the position/index of True is the row index of the row which the current sublist got overlapped. We need to find these positions of True. It is the functionality of np.argmax. np.argmax returns the position/index of the first max element of the array. True is considered as 1 and False as 0. Therefore, on any subarray having True, it correctly returns the 1st overlapped row index. However, on all False subarray, it returns 0. We will handle all False subarray later with where
  • After np.argmax, the 2D-mask is reduce to 1D-mask. Each element of this 1D-mask is the number of row index of the overlapped sublist. Passing it to .loc to get corresponding values of column index. However, the result also wrongly includes row where subarray of 2D-mask contains all False. We want these rows turn to NaN. It is the functionality of .where

Method 1:
Use list comprehension to construct the boolean 2D-mask m between each list of links and the all lists in links. We only need backward-comparing, so use np.tril to crush upper right triangle of the mask to all False which represents forward-comparing. Finally, call np.argmax to get position of first True in each row of m and chaining where to turn all False row of m to NaN

c95_list = counts95.links.tolist()
m = np.tril([[any(x in l2 for x in l1) for l2 in c95_list] for l1 in c95_list],-1)
counts95['linkoflist'] = (counts95.loc[np.argmax(m, axis=1), 'index']
                                  .where(m.any(1)).to_numpy())

 Out[351]:
     index  level0            links  linkoflist
0   616351      25  [1, 2, 3, 4, 5]         NaN
1   616352      30      [23, 45, 2]    616351.0
2   616353      35      [1, 19, 67]    616351.0
3  6457754     100     [14, 15, 16]         NaN
4  6566666     200          [1, 14]    616351.0
5  6457754     556          [14, 1]    616351.0

Method 2:
If you dataframe is big, comparing each sublist to only top part of links makes it faster. It probably 2x faster method 1 on big dataframe.

c95_list = counts95.links.tolist()
m = [[any(x in l2 for x in l1) for l2 in c95_list[:i]] for i,l1 in enumerate(c95_list)]
counts95['linkoflist'] = counts95.reindex([np.argmax(y) if any(y) else np.nan 
                                                   for y in m])['index'].to_numpy()

Step by Step(method 1)

m = np.tril([[any(x in l2 for x in l1) for l2 in c95_list] for l1 in c95_list],-1)

Out[353]:
array([[False, False, False, False, False, False],
       [ True, False, False, False, False, False],
       [ True, False, False, False, False, False],
       [False, False, False, False, False, False],
       [ True, False,  True,  True, False, False],
       [ True, False,  True,  True,  True, False]])

argmax returns position both first True and first False of all-False row.

In [354]: np.argmax(m, axis=1)
Out[354]: array([0, 0, 0, 0, 0, 0], dtype=int64)

Slicing using the result of argmax

counts95.loc[np.argmax(m, axis=1), 'index']

Out[355]:
0    616351
0    616351
0    616351
0    616351
0    616351
0    616351
Name: index, dtype: int64

Chain where to turn rows corresponding to all False from m to NaN

counts95.loc[np.argmax(m, axis=1), 'index'].where(m.any(1))

Out[356]:
0         NaN
0    616351.0
0    616351.0
0         NaN
0    616351.0
0    616351.0
Name: index, dtype: float64

Finally, the index of the output is different from the index of counts95, so just call to_numpy to get the ndarray to assign to the column linkoflist of counts95.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=28671&siteId=1