How to speed up multiple str.contains searches for millions of rows?

SCool :

I have a dataframe of shop names that I'm trying to standardize. Small sample to test here:

import pandas as pd

df = pd.DataFrame({'store': pd.Series(['McDonalds', 'Lidls', 'Lidl New York 123', 'KFC', 'Lidi Berlin', 'Wallmart LA 90210', 'Aldi', 'London Lidl', 'Aldi627', 'mcdonaldsabc123', 'Mcdonald_s', 'McDonalds12345', 'McDonalds5555', 'McDonalds888', 'Aldi123', 'KFC-786', 'KFC-908', 'McDonalds511', 'GerALDInes Shop'],dtype='object',index=pd.RangeIndex(start=0, stop=19, step=1)), 'standard': pd.Series([pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan],dtype='float64',index=pd.RangeIndex(start=0, stop=19, step=1))}, index=pd.RangeIndex(start=0, stop=19, step=1))

                store  standard
0           McDonalds       NaN
1               Lidls       NaN
2   Lidl New York 123       NaN
3                 KFC       NaN
4         Lidi Berlin       NaN
5   Wallmart LA 90210       NaN
6                Aldi       NaN
7         London Lidl       NaN
8             Aldi627       NaN
9     mcdonaldsabc123       NaN
10         Mcdonald_s       NaN
11     McDonalds12345       NaN
12      McDonalds5555       NaN
13       McDonalds888       NaN
14            Aldi123       NaN
15            KFC-786       NaN
16            KFC-908       NaN
17       McDonalds511       NaN
18    GerALDInes Shop       NaN

I set up a regex dictionary to search for a string, and insert a standardized version of the shop name into the column standard. This works fine for this small dataframe:

# set up the dictionary
regex_dict = {
 "McDonalds": r'(mcdonalds|mcdonald_s)',
 "Lidl" : r'(lidl|lidi)',
 "Wallmart":r'wallmart',
 "KFC": r'KFC',
 "Aldi":r'(\baldi\b|\baldi\d+)'
}

# loop through dictionary, using str.replace 
for regname, regex_formula in regex_dict.items(): 

    df.loc[df['store'].str.contains(regex_formula,na=False,flags=re.I), 'standard'] = regname

print(df)

                store   standard
0           McDonalds  McDonalds
1               Lidls       Lidl
2   Lidl New York 123       Lidl
3                 KFC        KFC
4         Lidi Berlin       Lidl
5   Wallmart LA 90210   Wallmart
6                Aldi       Aldi
7         London Lidl       Lidl
8             Aldi627       Aldi
9     mcdonaldsabc123  McDonalds
10         Mcdonald_s  McDonalds
11     McDonalds12345  McDonalds
12      McDonalds5555  McDonalds
13       McDonalds888  McDonalds
14            Aldi123       Aldi
15            KFC-786        KFC
16            KFC-908        KFC
17       McDonalds511  McDonalds
18    GerALDInes Shop        NaN

The problem is I have about SIX million rows to standardize, with a regex dictionary much larger than the one shown here. (many different shop names with some mispellings etc.)

What I would like to do is at each loop, only use str.contains for rows that have not been standardized, and ignore the rows that have been standardized. The idea is to reduce the search space with each loop, therefore reducing the overall processing time.

I have tested indexing by the standard column, only performing str.contains on rows where standard is Nan, but it does not result in any real speedup. It still takes time to figure out which rows are Nan before applying str.contains.

Here is what I tried to reduce the processing time each loop:

for regname, regex_formula in regex_dict.items(): 

    # only apply str.contains to rows where standard == NAN
    df.loc[df['standard'].isnull() & df['store'].str.contains(regex_formula,na=False,flags=re.I), 'standard'] = regname

This works .. but using this on my full 6 million rows makes no real difference in speed.

Is it even possible to speed this up on a dataframe of 6 million rows?

SCool :

I managed to reduced the time needed by 40% using this. Best I could do

I create an empty dataframe called fixed_df to append new standardized rows, then delete the same rows in the original dataframe at the end of each loop. The search space is reduced for each loop as each shop is standardized, and the fixed_df increases in size with each loop. In the end, fixed_df should have all the original rows, now standardized, and the original df should be empty.

# create empty df to store new results
fixed_df = pd.DataFrame()

# loop through dictionary
for regname, regex_formula in regex_dict.items(): 

    # search for regex formula, add standardized name into standard column
    df.loc[df['term_location'].str.contains(regex_formula,na=False,flags=re.I), 'standard'] = regname

    # get index of where names were fixed
    ind = df[df['standard']==regname].index

    # append fixed data to new df
    fixed_df.append(df[df.index.isin(ind)].copy())

    # remove processed stuff from original df
    df = df[~df.index.isin(ind)].copy()

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=8644&siteId=1