Posts

Showing posts with the label python

How to use the union operator for typing instead of typing.Union

How to use the union operator for typing instead of typing.Union

Since python 3.10 we can use the union operator | instead of typing.Union to define a union of types. In this example, we can write str | None to define a type that accepts str as well as NoneType. This is equivalent to writing typing.Union[str, None] or typing.Optional[str]. In general, we can define optional types with the union operator as T | None.


Github gist with code

dependencies: python3.10

How to check whether two pandas DataFrames are equal and to what extend


How to check whether two pandas DataFrames are equal and to what extend

Pandas has its own testing methods to check e.g. if two DataFrame objects are equal or not. Here we use pandas.testing.assert_frame_equal to compare two DataFrame objects. df_two is a copy of df_one but with swapped column order, i.e. it is the same content but the order of the columns is different. Applying assert_frame_equal does indeed raise an AssertError with detailed information about what caused the error: 'DataFrame.columns values are different. If we want to check if the content of the two DataFrame objects is equal independent of the order, we can either enforce that both DataFrame objects have the same column order by selecting the order. Or we can set check_like=True in assert_frame_equal to ignore the order of columns.


Github gist with code

dependencies: python3.10, pandas==1.4.3

How to use zip with strict argument

How to use zip with strict argument

We can use the built-in zip function to iterate over multiple iterators in parallel. If not all iterators are of the same length, the loop will terminate after the shortest iterator has been exhausted. This will happen without throwing an exception or warning. Starting with Python 3.10 zip accepts a strict argument. By default this is False. Setting strict=True will raise a ValueError if not all iterators are of the same length. The implementation is such that zip will run until it exhausts the shortest iterator and only raises an exception then.


Github gist with code

dependencies: python3.10

An easy way to parallelize pandas.apply processing


An easy way to parallelize pandas.apply processing

Pandarallel is an easy-to-use python package to parallelize pandas operations on multiple CPU cores. In this example, we use the text from sklearn's 20newsgroups dataset to create a pandas dataframe and apply row-wise a text preprocessing method. In standard pandas we achieve this by calling .apply on our dataframe. To leverage parallelization, we first need to initialize pandarallel (the initialization accepts a nb_workers argument to specify the number of CPU cores to be used, default is all available). Once pandarallel is initialized, we simply replace the .apply call with .parallel_apply. Note that parallelization comes with a memory cost compared to the standard pandas operation (the authors of the pandarallel documentation mention a factor of 2x in memory; I have not benchmarked that).


Github gist with code

dependencies: python3.9pandarallel==1.6.3, pandas==1.4.3, scikit-learn==1.1.2

How to parametrize fixtures in pytest


How to parametrize fixtures in pytest

Parametrization of arguments in test functions can be achieved with the built-in pytest.mark.parametrize decorator. Maybe a bit less known is the fact that pytest fixtures can be parametrized as well. pytest.fixture accepts a params argument, a list of parameters. A fixture with params will be executed multiple times, one time for each parameter in params. And each time a fixture is executed all the dependent tests (the ones that use that fixture) will be executed as well.

In this example, test_dict is a fixture that is parametrized with 2 fixtures, min_dict and full_dict, meaning the test_dict fixture will be executed two times and therefore also our test_instantiation will be executed twice due to the fixture. On top of that, we have a parametrization of the arguments indentation and expected_linebreak in test_instantiation. So in total, our test_indentation will be called 4 times.

Github gist with code

dependencies: python3.9pytest==7.1.2

How to read field values from environment variables with pydantic?


How to read field values from environment variables with pydantic?

To read values and secrets from environment variables we can use python's os module. Another option is a model class based on pydantic.BaseSettings. The nice thing with BaseSettings is that by default it tries to read field values from environment variables that are either named exactly like the field name in all caps, defined by Field.env, or defined by the model class config. In addition, we can also decide to pass the values as kwargs during instantiation of the model class which then has precedence over the values from the environment variables. If we do not have the expected environment variables set and no arguments are passed during init, pydantic throws a ValidationError.


Github gist with code

dependencies: python3.9pydantic==1.9.2

How to protect your secrets from accidental exposure by tracebacks and logs?


Protect your secrets from accidental exposure by storing them in pydantic.SecretStr objects

When you have to deal with secrets like authorization credentials and passwords you want to make sure that those values are not accidentally exposed in error messages, tracebacks, or logs. One easy way to improve protection of sensitive values is to use SecretStr from pydantic to store these values.  SecreStr is formatted as "**********". That means when we call print() or str() on a SecretStr object no sensitive information will be exposed. To access the value of a SecreStr object we call .get_secret_value() and since this only needs to happen when we hand the secrets other to perform authentication or login we reduce the chances of exposure.


Github gist with code

dependencies: python3.9pydantic==1.9.2

A convenient FileHandler to read text from local files and files on AWS S3


A convenient FileHandler to read from local and S3 files with cloudpathlib.AnyPath

How can we build one class to handle reads from both, local files as well as files in AWS S3? cloudpathlib is a nice package that can handle S3 paths (see my post on cloudpathlib.CloudPath). cloudpathlib.AnyPath is a polymorphic class that will automatically instantiate a CloudPath or a pathlib.Path object - whatever is appropriate based on the input string. In this example, we build a pydantic.BaseModel with one field file of type AnyPath. This pydantic model is our FileHandler to read content from files - be it from a local file system or AWS S3. The nice part is that AnyPath does all the heavy lifting for us since it will instantiate the appropriate Path or S3Path objects with a common interface. That's why we can just always call .read_text() in the get_content method of our FileHandler. To test our FileHandler class we use moto.mock_s3 (see my post on using moto.mock_s3) to mock calls to AWS S3 and tempfile.NamedTemporaryFile to create a temporary local file. The simplicity of the FileHandler class is unfortunately a bit buried in the setup for testing it in this example.


Github gist with code

dependencies: python3.9, boto3 ==1.24.51, cloudpathlib==0.10.0, moto==3.1.18, pydantic==1.9.2

How to initialize a dictionary from a list of keys?

How to initialize a dictionary from a list of keys using the classmethod fromkeys

The standard dict class has a classmethod .fromkeys which accepts a list of keys and optionally a value. In our example we have a list of feature names that are all from type float. Using fromkeys we can initialize a dictionary with the feature names as keys and value float. If no value is provided in fromkeys, it defaults to None. 


Github gist with code

dependencies: python3.9

How to handle AWS S3 paths using cloudpathlib?



You might be aware that pathlib.Path cannot properly deal with S3-like paths. Cloudpathlib is an easy to use package that does handle AWS S3 paths as well as other cloud provider paths. In this example we use cloudpathlib.CloudPath instead of pathlib.Path to instantiate a S3Path object from a S3 URL string. The interface of a S3Path is the same as for a PosixPath object. For example, we can call .parts on it to obtain all components of the S3Path. 


Github gist with code

dependencies: python3.9, cloudpathlib==0.10.0

An easy way to integration test your calls to S3 with moto


Integration test boto3 methods with moto mock_s3

Moto is a great package to mock AWS services and I think it is an easy way to have integration tests for your - non-critical - code that interacts with AWS like e.g. boto3-related calls. In our example with have a custom method that creates a AWS S3 bucket. Using the moto.mock_s3 decorator on our test method automatically mocks all calls to S3.


Github gist with code

dependencies: python3.9, boto3 ==1.24.51, moto==3.1.18

How to obtain a Counter from a list of dictionaries similar to collections.Counter for simple types?



If we have a list of integers we can use collections.Counter to obtain a counter for those integers. If we have a list of dictionaries simply applying collections.Counter will not work. One way to get a counter of dicts from a list of dictionaries is to use itertools.groupby. 

Note in the example that the city San José appears twice but in different countries. Our counter correctly keeps them separate since we are using both keys, city and country, for sorting and groupby. If we would use only city as key (like in yesterday's post), the counter would show 2 for San José.


Github gist with code

dependencies: python3.9

How to get the set of dictionaries from a list of dictionaries with duplicates?



To find the set of dictionaries from a list of dictionaries with duplicates we can use itertools.groupby. The important part is that the list of dictionaries is sorted by the keys of the dicts before applying groupby. In our example this is key='city'. To obtain the list of unique dicts we use only the key part of the tuple returned by the groupby iterator.


Github gist with code

dependencies: python3.9

How to get all files in a directory and delete them?


Obtain all files in a pathlib.Path directory and delete them


If we have a directory as a Path object we can call .iterdir() on it to obtain a generator that will yield all files present in that directory as Path objects. And if we want to delete all files in a given directory we can .unlink on every Path object returned by iterdir(). Using missing_ok=True is to avoid any race-conditions.


Github gist with code

dependencies: python3.9

Two ways to combine two dictionaries with unique keys


Two ways to combe and merge two dictionaries in python: Union operator and dict unpacking

How do we combine 2 dictionaries when they have no intersecting keys and we don't want to modifiy any one of the two original dicts? First, we can combine them with the dict union operator, i.e. |. The dict union operator is available since Python v3.9. If a key would appear in both dicts, the union operator takes the value from the right hand-side dict, i.e. from the last seen. Second, we can use dict unpacking to merge our 2 dicts. For both methods the order of the keys in the combined dict is given by the order of the dicts in the union or unpacking operation.


Github gist with code

dependencies: python3.9

Instantiate a pydantic BaseModel from a dictionary object



To instantiate a pydantic.BaseModel we can call the parse_obj method of our model class and provide a dictionary with the content.


Github gitst with code

dependencies: python3.9, pydantic==1.9.2

Show tqdm progress bar while uploading file to S3 with boto3



With boto3 we can obtain a S3 client to upload a local file to AWS S3 with upload_file. For larger files it might be nice to see the progress of the upload. This can be achieved with tqdm.tqdm as a callback.


Github gist with code

dependencies: python3.9, boto3 ==1.24.51, tqdm==4.64.0

How to create a directory and all of it's parent directories if none or some of them do not exist yet?



If you want to create a directory and all of it's parent directories, working with pathlib.Path objects this can be achieved by using Path.mkdir with parents=True. Setting exist_ok=True ensures that no FileExistsError is raised if the directory already exists. This way we can run it over and over again and always end up with the same result.


Github gist with code

dependencies: python3.9