I have a dataframe of shop names that I'm trying to standardize. Small sample to test here:
import pandas as pd
df = pd.DataFrame({'store': pd.Series(['McDonalds', 'Lidls', 'Lidl New York 123', 'KFC', 'Lidi Berlin', 'Wallmart LA 90210', 'Aldi', 'London Lidl', 'Aldi627', 'mcdonaldsabc123', 'Mcdonald_s', 'McDonalds12345', 'McDonalds5555', 'McDonalds888', 'Aldi123', 'KFC-786', 'KFC-908', 'McDonalds511', 'GerALDInes Shop'],dtype='object',index=pd.RangeIndex(start=0, stop=19, step=1)), 'standard': pd.Series([pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan, pd.np.nan],dtype='float64',index=pd.RangeIndex(start=0, stop=19, step=1))}, index=pd.RangeIndex(start=0, stop=19, step=1))
store standard
0 McDonalds NaN
1 Lidls NaN
2 Lidl New York 123 NaN
3 KFC NaN
4 Lidi Berlin NaN
5 Wallmart LA 90210 NaN
6 Aldi NaN
7 London Lidl NaN
8 Aldi627 NaN
9 mcdonaldsabc123 NaN
10 Mcdonald_s NaN
11 McDonalds12345 NaN
12 McDonalds5555 NaN
13 McDonalds888 NaN
14 Aldi123 NaN
15 KFC-786 NaN
16 KFC-908 NaN
17 McDonalds511 NaN
18 GerALDInes Shop NaN
I set up a regex dictionary to search for a string, and insert a standardized version of the shop name into the column standard
. This works fine for this small dataframe:
# set up the dictionary
regex_dict = {
"McDonalds": r'(mcdonalds|mcdonald_s)',
"Lidl" : r'(lidl|lidi)',
"Wallmart":r'wallmart',
"KFC": r'KFC',
"Aldi":r'(\baldi\b|\baldi\d+)'
}
# loop through dictionary, using str.replace
for regname, regex_formula in regex_dict.items():
df.loc[df['store'].str.contains(regex_formula,na=False,flags=re.I), 'standard'] = regname
print(df)
store standard
0 McDonalds McDonalds
1 Lidls Lidl
2 Lidl New York 123 Lidl
3 KFC KFC
4 Lidi Berlin Lidl
5 Wallmart LA 90210 Wallmart
6 Aldi Aldi
7 London Lidl Lidl
8 Aldi627 Aldi
9 mcdonaldsabc123 McDonalds
10 Mcdonald_s McDonalds
11 McDonalds12345 McDonalds
12 McDonalds5555 McDonalds
13 McDonalds888 McDonalds
14 Aldi123 Aldi
15 KFC-786 KFC
16 KFC-908 KFC
17 McDonalds511 McDonalds
18 GerALDInes Shop NaN
The problem is I have about SIX million rows to standardize, with a regex dictionary much larger than the one shown here. (many different shop names with some mispellings etc.)
What I would like to do is at each loop, only use str.contains
for rows that have not been standardized, and ignore the rows that have been standardized. The idea is to reduce the search space with each loop, therefore reducing the overall processing time.
I have tested indexing by the standard
column, only performing str.contains
on rows where standard
is Nan
, but it does not result in any real speedup. It still takes time to figure out which rows are Nan
before applying str.contains
.
Here is what I tried to reduce the processing time each loop:
for regname, regex_formula in regex_dict.items():
# only apply str.contains to rows where standard == NAN
df.loc[df['standard'].isnull() & df['store'].str.contains(regex_formula,na=False,flags=re.I), 'standard'] = regname
This works .. but using this on my full 6 million rows makes no real difference in speed.
Is it even possible to speed this up on a dataframe of 6 million rows?
I managed to reduced the time needed by 40% using this. Best I could do
I create an empty dataframe called fixed_df
to append new standardized rows, then delete the same rows in the original dataframe at the end of each loop. The search space is reduced for each loop as each shop is standardized, and the fixed_df
increases in size with each loop. In the end, fixed_df
should have all the original rows, now standardized, and the original df should be empty.
# create empty df to store new results
fixed_df = pd.DataFrame()
# loop through dictionary
for regname, regex_formula in regex_dict.items():
# search for regex formula, add standardized name into standard column
df.loc[df['term_location'].str.contains(regex_formula,na=False,flags=re.I), 'standard'] = regname
# get index of where names were fixed
ind = df[df['standard']==regname].index
# append fixed data to new df
fixed_df.append(df[df.index.isin(ind)].copy())
# remove processed stuff from original df
df = df[~df.index.isin(ind)].copy()