Various methods of Pandas string operation speed test

Due to the development of LLM, many data sets are released in the form of DF, so the requirements for manipulating strings through Pandas are getting higher and higher, so this article will benchmark string manipulation methods to see if they are How does it affect the performance of pandas. Because once Pandas exceeds a certain limit when processing data, they will behave strangely.

We created a 100,000-row test dataset with Faker.

Test Methods

Install:

 !pip install faker

The way to generate test data is simple:

 import pandas as pd
 import numpy as np
 
 def gen_data(x):
   from faker import Faker
   fake = Faker()
   outdata = {}
   for i in range(0,x):
     outdata[i] = fake.profile()
   return pd.DataFrame(outdata).T
 
 n= 100000
 basedata = gen_data(n)

Then put Google Colab to store the output in Google drive

 from google.colab import drive
 drive.mount('/content/drive')

Created very simple functions to test various ways of concatenating two strings.

 def process(a,b):
   return ''.join([a,b])
 
 def process(a,b):
   return a+b
 
 def process(a,b):
   return f"{a}{b}"
 
 def process(a,b):
   return f"{a}{b}"*100

Create an empty DF, write a function to add the output %%timeit as a row to the dataframe

 # add a row to the dataframe using %%timeit output
 def add_to_df(n, m, x, outputdf):
   outputdf.loc[len(outputdf.index)] = [m, n, x]
 
 # output frame
 outputdf = pd.DataFrame(columns=['method', 'n', 'timing'])
 outputdf

Then there is the code that runs each of the functions above and exports the data to pandas.

 # get a sample of data
 n = 10000
 suffix = 'fstring_100x'
 data = basedata.copy().sample(n).reset_index()

record running time

 %%timeit -r 7 -n 1 -o
 data['newcol'] = ''
 for row in range(len(data)):
   data.at[row ,'newcol'] = process(data.at[row, 'job'], data.at[row, 'company'])
 
 # 451 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 # <TimeitResult : 451 ms ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>

full function call

 m = "Iterating over the rows"
 add_to_df(n = n, m = m, x = vars(_), outputdf = outputdf)

test

The above is the code, let's start experimenting with the above code:

Iterrows (pandas native function) add each row

 %%timeit -r 7 -n 1 -o
 data['newcol'] = ''
 for row, item in data.iterrows():
   data.at[row ,'newcol'] = process(item['job'], item['company'])

Itertuples (safer due to immutability) sum each row

 %%timeit -r 7 -n 1 -o
 data['newcol'] = ''
 for row, job, company in data[['job','company']].itertuples():
   data.at[row ,'newcol'] = process(job, company)

Adding as strings using pandas native functions

 %%timeit -r 7 -n 1 -o
 data['newcol'] = data.job + data.company

Use the native function pandas.series.add

 %%timeit -r 7 -n 1 -o
 data['newcol'] = data.job.add(data.company)

use dataframe.apply

 %%timeit -r 7 -n 1 -o
 data['newcol'] = data.apply(lambda row: process(row['job'],row['company']), axis=1)

Use ListMap

 %%timeit -r 7 -n 1 -o
 data['newcol'] = list(map(process, data.job, data.company))

Pandas vectorization

 %%timeit -r 7 -n 1 -o
 data['newcol'] = process(data.job, data.company)

numpy array vectorization

 %%timeit -r 7 -n 1 -o
 data['newcol'] = process(data.job.to_numpy(), data.company.to_numpy())

Explicitly use numpy vectorization on numpy arrays

 %%timeit -r 7 -n 1 -o
 data['newcol'] = np.vectorize(process)(data.job.to_numpy(), data.company.to_numpy())

Optimized list comprehension

 %%timeit -r 7 -n 1 -o
 data['newcol'] = ''
 data['newcol'] =[process(i,j) for i,j in list(zip(data.job, data.company)) ]

And finally the resulting output:

 outputdf.to_csv(f"./drive/MyDrive/{n}_{suffix}.csv")

result

The result is shown below. I tested the results with the 3 different functions above.

Native string addition C = a+b

Time required to scale from 1000 rows to 100,000 rows;

Visual comparison:

All vectorized methods are very fast, and pandas standard str.add also vectorizes numpy arrays. It can be seen that Pandas' native methods are generally linear. List-map seems to grow at the square root of N

Using fstring: c = f "{a}{b}"

Using fstring, the results are interesting, some unexplainable.

time

visualization

Time-wise, vectorization is performed correctly for DFs of length greater than 10,000

The figure below is the third function, which is *100, which is more illustrative. The basic time of the vectorization operation has not changed.

Summarize

Through the above tests, we can summarize the results:

1. It is still an old-fashioned question, do not use iterrows(), itertuples(), try not to use DataFrame.apply(), because several functions are still cyclically traversed.

2. Vectorized operations can also be used in string operations, but for safety reasons, use Numpy arrays.

3. List comprehension is just like its name, it is still a list

4. There are still some strange unexplainable problems, but most of the cases are explainable

https://avoid.overfit.cn/post/2633908f89b14e0bb14bcaab443c3fec
If you have a better understanding, please leave a message

Author: Dr. Mandar Karhade

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/132445784