Due to the development of LLM, many data sets are released in the form of DF, so the requirements for manipulating strings through Pandas are getting higher and higher, so this article will benchmark string manipulation methods to see if they are How does it affect the performance of pandas. Because once Pandas exceeds a certain limit when processing data, they will behave strangely.
We created a 100,000-row test dataset with Faker.
Test Methods
Install:
!pip install faker
The way to generate test data is simple:
import pandas as pd
import numpy as np
def gen_data(x):
from faker import Faker
fake = Faker()
outdata = {}
for i in range(0,x):
outdata[i] = fake.profile()
return pd.DataFrame(outdata).T
n= 100000
basedata = gen_data(n)
Then put Google Colab to store the output in Google drive
from google.colab import drive
drive.mount('/content/drive')
Created very simple functions to test various ways of concatenating two strings.
def process(a,b):
return ''.join([a,b])
def process(a,b):
return a+b
def process(a,b):
return f"{a}{b}"
def process(a,b):
return f"{a}{b}"*100
Create an empty DF, write a function to add the output %%timeit as a row to the dataframe
# add a row to the dataframe using %%timeit output
def add_to_df(n, m, x, outputdf):
outputdf.loc[len(outputdf.index)] = [m, n, x]
# output frame
outputdf = pd.DataFrame(columns=['method', 'n', 'timing'])
outputdf
Then there is the code that runs each of the functions above and exports the data to pandas.
# get a sample of data
n = 10000
suffix = 'fstring_100x'
data = basedata.copy().sample(n).reset_index()
record running time
%%timeit -r 7 -n 1 -o
data['newcol'] = ''
for row in range(len(data)):
data.at[row ,'newcol'] = process(data.at[row, 'job'], data.at[row, 'company'])
# 451 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# <TimeitResult : 451 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>
full function call
m = "Iterating over the rows"
add_to_df(n = n, m = m, x = vars(_), outputdf = outputdf)
test
The above is the code, let's start experimenting with the above code:
Iterrows (pandas native function) add each row
%%timeit -r 7 -n 1 -o
data['newcol'] = ''
for row, item in data.iterrows():
data.at[row ,'newcol'] = process(item['job'], item['company'])
Itertuples (safer due to immutability) sum each row
%%timeit -r 7 -n 1 -o
data['newcol'] = ''
for row, job, company in data[['job','company']].itertuples():
data.at[row ,'newcol'] = process(job, company)
Adding as strings using pandas native functions
%%timeit -r 7 -n 1 -o
data['newcol'] = data.job + data.company
Use the native function pandas.series.add
%%timeit -r 7 -n 1 -o
data['newcol'] = data.job.add(data.company)
use dataframe.apply
%%timeit -r 7 -n 1 -o
data['newcol'] = data.apply(lambda row: process(row['job'],row['company']), axis=1)
Use ListMap
%%timeit -r 7 -n 1 -o
data['newcol'] = list(map(process, data.job, data.company))
Pandas vectorization
%%timeit -r 7 -n 1 -o
data['newcol'] = process(data.job, data.company)
numpy array vectorization
%%timeit -r 7 -n 1 -o
data['newcol'] = process(data.job.to_numpy(), data.company.to_numpy())
Explicitly use numpy vectorization on numpy arrays
%%timeit -r 7 -n 1 -o
data['newcol'] = np.vectorize(process)(data.job.to_numpy(), data.company.to_numpy())
Optimized list comprehension
%%timeit -r 7 -n 1 -o
data['newcol'] = ''
data['newcol'] =[process(i,j) for i,j in list(zip(data.job, data.company)) ]
And finally the resulting output:
outputdf.to_csv(f"./drive/MyDrive/{n}_{suffix}.csv")
result
The result is shown below. I tested the results with the 3 different functions above.
Native string addition C = a+b
Time required to scale from 1000 rows to 100,000 rows;
Visual comparison:
All vectorized methods are very fast, and pandas standard str.add also vectorizes numpy arrays. It can be seen that Pandas' native methods are generally linear. List-map seems to grow at the square root of N
Using fstring: c = f "{a}{b}"
Using fstring, the results are interesting, some unexplainable.
time
visualization
Time-wise, vectorization is performed correctly for DFs of length greater than 10,000
The figure below is the third function, which is *100, which is more illustrative. The basic time of the vectorization operation has not changed.
Summarize
Through the above tests, we can summarize the results:
1. It is still an old-fashioned question, do not use iterrows(), itertuples(), try not to use DataFrame.apply(), because several functions are still cyclically traversed.
2. Vectorized operations can also be used in string operations, but for safety reasons, use Numpy arrays.
3. List comprehension is just like its name, it is still a list
4. There are still some strange unexplainable problems, but most of the cases are explainable
https://avoid.overfit.cn/post/2633908f89b14e0bb14bcaab443c3fec
If you have a better understanding, please leave a message
Author: Dr. Mandar Karhade