Posts

Showing posts with the label pandas

How to check whether two pandas DataFrames are equal and to what extend


How to check whether two pandas DataFrames are equal and to what extend

Pandas has its own testing methods to check e.g. if two DataFrame objects are equal or not. Here we use pandas.testing.assert_frame_equal to compare two DataFrame objects. df_two is a copy of df_one but with swapped column order, i.e. it is the same content but the order of the columns is different. Applying assert_frame_equal does indeed raise an AssertError with detailed information about what caused the error: 'DataFrame.columns values are different. If we want to check if the content of the two DataFrame objects is equal independent of the order, we can either enforce that both DataFrame objects have the same column order by selecting the order. Or we can set check_like=True in assert_frame_equal to ignore the order of columns.


Github gist with code

dependencies: python3.10, pandas==1.4.3

An easy way to parallelize pandas.apply processing


An easy way to parallelize pandas.apply processing

Pandarallel is an easy-to-use python package to parallelize pandas operations on multiple CPU cores. In this example, we use the text from sklearn's 20newsgroups dataset to create a pandas dataframe and apply row-wise a text preprocessing method. In standard pandas we achieve this by calling .apply on our dataframe. To leverage parallelization, we first need to initialize pandarallel (the initialization accepts a nb_workers argument to specify the number of CPU cores to be used, default is all available). Once pandarallel is initialized, we simply replace the .apply call with .parallel_apply. Note that parallelization comes with a memory cost compared to the standard pandas operation (the authors of the pandarallel documentation mention a factor of 2x in memory; I have not benchmarked that).


Github gist with code

dependencies: python3.9pandarallel==1.6.3, pandas==1.4.3, scikit-learn==1.1.2