Posts

Showing posts with the label pandarallel

An easy way to parallelize pandas.apply processing


An easy way to parallelize pandas.apply processing

Pandarallel is an easy-to-use python package to parallelize pandas operations on multiple CPU cores. In this example, we use the text from sklearn's 20newsgroups dataset to create a pandas dataframe and apply row-wise a text preprocessing method. In standard pandas we achieve this by calling .apply on our dataframe. To leverage parallelization, we first need to initialize pandarallel (the initialization accepts a nb_workers argument to specify the number of CPU cores to be used, default is all available). Once pandarallel is initialized, we simply replace the .apply call with .parallel_apply. Note that parallelization comes with a memory cost compared to the standard pandas operation (the authors of the pandarallel documentation mention a factor of 2x in memory; I have not benchmarked that).


Github gist with code

dependencies: python3.9pandarallel==1.6.3, pandas==1.4.3, scikit-learn==1.1.2