I have a use case where I am comparing the list in same column with itself, code Below:
for i in range(0,len(counts95)):
for j in range(i+1,len(counts95)):
for x in counts95['links'][i]:
for y in counts95['links'][j]:
if x == y and counts95['linkoflinks'][j] is None:
counts95['linkoflinks'][j] = counts95['index'][i]
The code works but its not python friendly (using 4 for loops) and takes a huge amount of time to do the operation. The main idea behind it is linking the records where the elements in list at counts95['links'] is in any of the proceeding rows, if yes update the column linksoflinks with the index of first column only if linksoflinks column is None (no overwriting)
find the reference table below:
counts95 = pd.DataFrame({'index': [616351, 616352, 616353,6457754],
'level0': [25,30,35,100],
'links' : [[1,2,3,4,5],[23,45,2],[1,19,67],[14,15,16]],
'linksoflinks' : [None,None,None,None]})
EDIT: New Dataframe
counts95 = pd.DataFrame({'index': [616351, 616352, 616353,6457754,6566666,464664683],
'level0': [25,30,35,100,200,556],
'links' : [[1,2,3,4,5],[23,45,2],[1,19,67],[14,15,16],[1,14],[14,1]],
'linksoflinks' : [None,None,None,None,None,None]})
Desired output:
index level0 links linksoflinks
0 616351 25 [1, 2, 3, 4, 5] NaN
1 616352 30 [23, 45, 2] 616351.0
2 616353 35 [1, 19, 67] 616351.0
3 6457754 100 [14, 15, 16] NaN
4 6566666 200 [1,14] 616351.0
5 6457754 556 [14,1] 616351.0
Your desired output uses different values and column name compare to your sample dataframe constructor. I use your desired output dataframe for testing.
Logic:
For each sublist of links
, we need to find the row index(I mean index of the dataframe, NOT columns index
) of the first overlapped sublist. We will use these row indices to slice by .loc
on counts95
to get corresponding values of column index
. To achieve this goal we need to do several steps:
- Compare each sublist to all sublists in
link
. List comprehension is fast and efficient for this task. We need to code a list comprehension to create boolean 2D-mask array where each subarray containsTrue
values for overlapped rows andFalse
for non-overlapped(look at the step-by-step on this 2D-mask and check with columnlinks
you will see clearer) - We want to compare from top to the current sublist. I.e. standing from current row, we only want to compare backward to the top. Therefore, we need to set any forward-comparing to
False
. This is the functionality ofnp.tril
- Inside each subarray of this 2D-mask the position/index of
True
is the row index of the row which the current sublist got overlapped. We need to find these positions ofTrue
. It is the functionality ofnp.argmax
.np.argmax
returns the position/index of the first max element of the array.True
is considered as1
andFalse
as0
. Therefore, on any subarray havingTrue
, it correctly returns the 1st overlapped row index. However, on allFalse
subarray, it returns0
. We will handle allFalse
subarray later withwhere
- After
np.argmax
, the 2D-mask is reduce to 1D-mask. Each element of this 1D-mask is the number of row index of the overlapped sublist. Passing it to.loc
to get corresponding values of columnindex
. However, the result also wrongly includes row where subarray of 2D-mask contains allFalse
. We want these rows turn toNaN
. It is the functionality of.where
Method 1:
Use list comprehension to construct the boolean 2D-mask m
between each list of links
and the all lists in links
. We only need backward-comparing, so use np.tril
to crush upper right triangle of the mask to all False
which represents forward-comparing. Finally, call np.argmax
to get position of first True
in each row of m
and chaining where
to turn all False
row of m
to NaN
c95_list = counts95.links.tolist()
m = np.tril([[any(x in l2 for x in l1) for l2 in c95_list] for l1 in c95_list],-1)
counts95['linkoflist'] = (counts95.loc[np.argmax(m, axis=1), 'index']
.where(m.any(1)).to_numpy())
Out[351]:
index level0 links linkoflist
0 616351 25 [1, 2, 3, 4, 5] NaN
1 616352 30 [23, 45, 2] 616351.0
2 616353 35 [1, 19, 67] 616351.0
3 6457754 100 [14, 15, 16] NaN
4 6566666 200 [1, 14] 616351.0
5 6457754 556 [14, 1] 616351.0
Method 2:
If you dataframe is big, comparing each sublist to only top part of links
makes it faster. It probably 2x faster method 1 on big dataframe.
c95_list = counts95.links.tolist()
m = [[any(x in l2 for x in l1) for l2 in c95_list[:i]] for i,l1 in enumerate(c95_list)]
counts95['linkoflist'] = counts95.reindex([np.argmax(y) if any(y) else np.nan
for y in m])['index'].to_numpy()
Step by Step(method 1)
m = np.tril([[any(x in l2 for x in l1) for l2 in c95_list] for l1 in c95_list],-1)
Out[353]:
array([[False, False, False, False, False, False],
[ True, False, False, False, False, False],
[ True, False, False, False, False, False],
[False, False, False, False, False, False],
[ True, False, True, True, False, False],
[ True, False, True, True, True, False]])
argmax
returns position both first True
and first False
of all-False
row.
In [354]: np.argmax(m, axis=1)
Out[354]: array([0, 0, 0, 0, 0, 0], dtype=int64)
Slicing using the result of argmax
counts95.loc[np.argmax(m, axis=1), 'index']
Out[355]:
0 616351
0 616351
0 616351
0 616351
0 616351
0 616351
Name: index, dtype: int64
Chain where
to turn rows corresponding to all False
from m
to NaN
counts95.loc[np.argmax(m, axis=1), 'index'].where(m.any(1))
Out[356]:
0 NaN
0 616351.0
0 616351.0
0 NaN
0 616351.0
0 616351.0
Name: index, dtype: float64
Finally, the index of the output is different from the index of counts95
, so just call to_numpy
to get the ndarray to assign to the column linkoflist
of counts95
.