pymoments.blogspot.com

An easy way to parallelize pandas.apply processing

Pandarallel is an easy-to-use python package to parallelize pandas operations on multiple CPU cores. In this example, we use the text from sklearn's 20newsgroups dataset to create a pandas dataframe and apply row-wise a text preprocessing method. In standard pandas we achieve this by calling .apply on our dataframe. To leverage parallelization, we first need to initialize pandarallel (the initialization accepts a nb_workers argument to specify the number of CPU cores to be used, default is all available). Once pandarallel is initialized, we simply replace the .apply call with .parallel_apply. Note that parallelization comes with a memory cost compared to the standard pandas operation (the authors of the pandarallel documentation mention a factor of 2x in memory; I have not benchmarked that).

Github gist with code

dependencies: python3.9, pandarallel==1.6.3, pandas==1.4.3, scikit-learn==1.1.2

Search This Blog

pymoments.blogspot.com

Posts

An easy way to parallelize pandas.apply processing