Data Modelling Challenge

Scenario

You are designing a database for an online bookstore. The system should handle customers, their orders, and the books being sold. A customer can place multiple orders, and each order can contain multiple books.

Tasks:

  1. Identify the main entities in this system.

  2. Describe the possible relationships between these entities (e.g., one-to-many, many-to-many).

  3. For each entity, design a simple data structure with at least 3 attributes and their respective type and constraints.

Deliverables:

  • A list of entities and their relationships.
  • A table structure for each entities.

Solution

  1. The information system has 3 entities: Customer, Order and Book.

  2. A customer can place multiple orders but each orders are associated with one and only one customer (one-to-many). An order can contain multiple books and a book can appear in different orders (many-to-many).

  3. Thereafter is a list of at least 3 attributes for each entities with their type and constraints (physical data model):

Customer:

attribute data_type constraints
customer_id INT  PRIMARY_KEY, AUTO_INCR
 name VARCHAR(100) NOT NULL, UNIQUE
 email VARCHAR(255) NOT NULL, UNIQUE
phone_number VARCHAR(15) NULLABLE, UNIQUE
address TEXT NULLABLE
 last_order_date DATETIME  NULLABLE

Order:

attribute data_type constraints
order_id INT PRIMARY_KEY, NOT NULL, AUTO_INCR
customer_id  INT FOREIGN_KEY
 order_date  DATETIME NOT NULL
total_amount DECIMAL(10, 2) NOT NULL

Book:

attribute data_type constraints
book_id  INT PRIMARY_KEY, NOT NULL, AUTO_INCR
title  VARCHAR(100) NOT NULL
price DECIMAL(10, 2) NOT NULL
description TEXT NULLABLE
publishing_date  DATETIME NULLABLE
stock_quantity  INT NOT NULL

Having created the table for those 3 entities, it appears that a fourth entity is required to make the connection between an order and the book(s) it contains:

OrderDetails:

attribute data_type constraints
order_id  INT FOREIGN_KEY, NOT NULL
book_id  INT FOREIGN_KEY, NOT NULL
quantity  INT NOT NULL
unit_price DECIMAL(10, 2) NOT NULL

Acknowledgement

This scenario was designed using the following CHatGPT prompts (GPT-4o model):

  1. what is data modelling?

  2. can you recap all this information into tables?

  3. can you imagine a small exercise for me to check that I master the presented information?

Automate the creation of repositories

One thing that comes across as particularly cumbersome when you are at the very early start of creating a new python project is to create the structure of the repository.

It always looks the same and never is the most interesting part of the project:

.
├── Makefile
├── README.md
├── your_project_name_in_snake_case
│   ├── __init__.py
│   └── main.py
├── poetry.toml
├── pyproject.toml
└── tests
    ├── __init__.py
    └── test_dummy.py

The good news is that it can be automated!

The objective is to create the structure using a bash script, wrapped up into a bash command accessible globally.

I am showing you how.

Objective

Whenever you want to init the structure for a new python project, you want the solution to work as follow:

  1. Move to the root of the new project;

  2. There, hitting a create repo kind of command that creates the whole structure (README.md, Makefile, .gitignore, etc.) for you when triggered.

Workflow

(1) Create the .gitignore files and templates for the files to be created:

E.g. you can find a template for the .gitignore on the Internet (you can even ask ChatGPT).

(2) Create a bash script to coordinate the creation of the different elements:

I called mine .create-poetry-repo.sh. It contains a set of usual instructions.

For instance, it creates a tests/ folder:

mkdir tests/
touch tests/__init__.py

and populates it with a minimal test suit:

cat > tests/test_dummy.py << EOF
def test_dummy():
    assert True
EOF

(3) The script is configurable, it asks the user via a prompt for the project’s name:

echo -n "project name (camel-case): "
read project_name_camel_case

This collected variable is then reused across the script e.g. to create the README.md.

cat > README.md << EOF
# $project_name_camel_case
EOF

For some use cases, the variable must be converted into snake case first:

project_name_snake_case=${project_name_camel_case//-/_}

It can the be used – e.g. to create the Makefile:

mkdir $project_name_snake_case
touch $project_name_snake_case/__init__.py

(4) The logic (bash script and template files such as the .gitignore) are then stored on a single source of truth i.e. a Github repository;

More can be done. Once you are satisfied with the logic you have encapsulated, you can upload your .sh script and your templates (e.g. the .gitignore file) on a Github repository.

Usage

Now, whenever you want to create a new python project:

(1) Download the documents (e.g. .gitignore) and bash script from the remote Github repository using the curl command:

    zsh> curl -OL <local_filename> https://raw.githubusercontent.com/<user>/<project>/<branch>/<path_to_remote_filename>

Notes:

  • after the .sh script has been downloaded, you must change the set of permissions to execute it. chmod +x <executable_filename> usually does the job.
  • if the visibility of your github repo is set on private, you will have to use a personal token after you have created it from the developer settings: github.com/settings/tokens; then use it in the arguments of the curl command:
curl -H "Authorization: token $GITHUB_API_REPO_TOKEN" \
     -H "Accept: application/vnd.github.v3.raw" \
     -O \
     -L https://raw.githubusercontent.com/<user>/<project>/<branch>/<path_to_remote_filename>
  • the token will be globally accessible from your zsh terminals if stored in the ~/.zshrc file:
export GITHUB_API_REPO_TOKEN="your_github_token"

(2) Execute the bash script:

    zsh> chmod +x .create-poetry-repo.sh
    zsh> ./.create-poetry-repo.sh

(3) To make it even more flexible, wrap the logic within a globally reusable command stored on ~/.zshrc:

    alias create_poetry_project='curl -H "Authorization: token $GITHUB_API_REPO_TOKEN" \
      -H "Accept: application/vnd.github.v3.raw" \
      -O \
      -L https://raw.githubusercontent.com/<user>/<repo>/<branch>/.create-poetry-repo.sh && chmod +x .create-poetry-repo.sh && ./.create-poetry-repo.sh'
    alias cpp="create_poetry_project"

(4) Do not forget to apply the changes via:

    zsh> source ~/.zshrc

Now, whenever I am in the root of a new project, I just have to execute:

zsh> cpp
Project name (camel-case):

…for the structure to be automatically created!

SQL technical interview

Leaning Objectives

  • Build-up a general idea of what a SQL technical job interview is like;

  • Learn a handful of SQL best practices;

  • Get familiar with sqlite3 Python’s library.

Business-case

As a phone provider company, you want to know whom – among your list of customers – spent a total accumulated time of 10 or more minutes on the phone.

You can imagine that this list of identified top customers can be later used for marketing purposes.

This list of top customers should contain only the names, alphabetically sorted.

Hence, you have two tables:

  1. The list of customers;

  2. The calls history.

The table customers contains the following information:

customer_id name phone
1 Tim 1234
2 John 5678
3 Mona 9101

The table calls contains the following information:

id caller callee duration
1 1234 5678 4
2 1234 9101 2
3 5678 1234 5
4 5678 9101 7

On this simple test-case dataset, the returned table should be:

name
John
Tim

Solution

The steps are the following:

  1. For each caller – compute the total duration of their calls;

  2. For each callee – compute the total duration of their calls;

  3. A caller can be at other times a callee and the other way around. Thus, both tables must be merged into one using the phone number as identifier;

  4. The total duration – as a caller and as a callee – must be summed up;

  5. The filter can be applied to only retain the values for which the total duration is superior or equal to 10 minutes;

  6. Last but not the least, in place of the phone number, we map with the associated name and sort the result alphabetically.

Local implementation

Setting sqlite3

For you to be able to work with the data itself, you can configure your own local SQL database, store the dataset and query the server with your own custom SQL statements.

Because it is easy to set up, I am using sqlite3 – a Python library.

The first step is to import the library. It is natively built-in along with your python’s distribution:

import sqlite3

Then, create a local file named database.db where to store your tables. The command returns a connection object:

connection = sqlite3.connect("database.db")

Create a cursor object. You can see the cursor like a mouse tracker that allow you to target specific actions on the database based on where the focus is actually on:

cursor = connection.cursor()

Using the cursor, you can create our two tables customers and calls:

cursor.execute(
  "CREATE TABLE IF NOT EXISTS customers(customer_id, name, phone)"
)
cursor.execute(
  "CREATE TABLE IF NOT EXISTS calls(id, caller, callee, duration)"
)

Then, prepare the data to be injected within our tables:

sql_query_customers = """
INSERT INTO
    customers (customer_id, name, phone)
VALUES
    (1, "Tim", "1234"),
    (2, "John", "5678"),
    (3, "Mona", "9101");
"""

sql_query_calls = """
INSERT INTO
    calls (id, caller, callee, duration)
VALUES
    (1, "1234", "5678", 4),
    (2, "1234", "9101", 2),
    (3, "5678", "1234", 5),
    (4, "5678", "9101", 7);
"""

Execute the queries:

cursor.execute(sql_query_customers)
cursor.execute(sql_query_calls)

Do not forget the save the operation:

connection.commit()

Test that the data is actually there:

result = cursor.execute("SELECT * FROM customers")
print(result.fetchall())

This should return the following output:

[
    (1, 'Tim', '1234'),
    (2, 'John', '5678'),
    (3, 'Mona', '9101')
]

Do not forget to close the connection:

connection.close()

Now that our environment is setup, let’s work toward the solution.

Building the intermediate tables

The first step is to find all the durations, grouped by caller for each callers:

WITH caller_durations AS (
    SELECT
      caller AS phone_number,
      SUM(duration) AS duration
  FROM calls
  GROUP BY caller
)
SELECT * FROM caller_durations;

This gives us the following caller_durations table:

phone_number duration
1234 6
5678 12

Interpretation:

  • The customer associated with the phone number 1234 spent 6 minutes as a “caller”;

  • The customer associated with the phone number 5678 spent 12 minutes as a “caller”;

  • The customer associated with the phone number 9101 never called.

We do the same for the callee:

WITH callee_durations AS (
  SELECT
    callee AS phone_number,
    SUM(duration) AS duration
  FROM calls
  GROUP BY callee
)
SELECT * FROM callee_durations;

Which gives us the following callee_durations table:

phone_number duration
1234 5
5678 4
9101 9

Interpretation:

  • The phone number 1234 received a total of 5 minutes call;

  • The phone number 5678 received a total of 4 minutes call;

  • The phone number 9101 received a total of 9 minutes call.

Next step is to combined both tables using a JOIN statement:

WITH joined_duration AS (
  SELECT
    caller_durations.phone_number AS phone_number_caller,
    caller_durations.duration AS duration_as_caller,
    callee_durations.phone_number AS phone_number_callee,
    callee_durations.duration AS duration_as_callee
  FROM caller_durations
  FULL JOIN callee_durations
  ON caller_durations.phone_number=callee_durations.phone_number
)
SELECT * FROM joined_durations;
phone_number_caller duration_as_caller phone_number_callee duration_as_callee
1234 6 1234 5
5678 12 5678 4
None None 9101 9

Interpretation:

  • The phone number “1234” placed a total number of 6 minutes call as a caller and 5 as a callee;

  • The phone number “9101” never placed a phone call but received a couple of incoming phone calls for a total duration of 9 minutes;

  • The columns phone_number_caller and phone_number_callee should always refer to the same phone number.

We can reduce the above table in something more “readable”:

WITH trimmed_durations AS (
  SELECT
    COALESCE(phone_number_caller, phone_number_callee) AS phone,
    COALESCE(duration_as_caller, 0) AS duration_as_caller,
    COALESCE(duration_as_callee, 0) AS duration_as_callee
  FROM joined_durations
)
SELECT * FROM trimmed_durations;
phone duration_as_caller duration_as_callee
1234 6 5
5678 12 4
9101 0 9

Note: the COALESCE function allows to replace the value by another one should the initial value be NULL.

Now, we can add the durations to have the total_duration i.e. the total number of minute the customer was one the phone:

WITH total_durations AS (
  SELECT
    phone AS phone,
    duration_as_caller + duration_as_callee AS total_durations
  FROM trimmed_durations
)
SELECT * FROM total_durations;
phone durations
1234 11
5678 16
9101 9

It is now time to filter to only consider the durations >= 10:

WITH top_users AS (
  SELECT
    phone AS phone
  FROM total_durations
  WHERE total_durations >= 10
)
SELECT * FROM top_users;
phone
1234
5678

However, the phone numbers must be replaced by the associated names. Thus, a second JOIN is needed to map the values with the customers table:

WITH final_table AS (
  SELECT
    customers.name AS name
  FROM top_users
  LEFT JOIN customers
  ON top_users.phone=customers.phone
  ORDER BY name ASC
)
SELECT * FROM final_table;

Which gives us the expected result:

name
John
Tim

Python Literal, New and Final Types

Python type annotations are an excellent way to communicate your intentions across your development teams (to enforce your intention, you need to use a type checker like mypy though).

This makes your code base more robust.

Multiple types are accessible in Python. Here we go into the detail of Literal, NewType and Final.

To illustrate the concepts, we will work with the following Cheese object:

from dataclasses import dataclass

@dataclass
class Cheese:
    """
    Class representing a cheese object.
    """

    name: str
    price_per_kilo: float
    aoc: bool


if __name__ == "__main__":

    cheese = Cheese(
        name="roquefort",
        price_per_kilo=25.75,
        aoc=True
    )

Note: AOC stands for Appellation d’Origine Controlée. As a trademark, this label intends to protect French Cheeses from counterfeiting. Some cheese are protected. Some are not. There are over 1,600 varieties of cheese. Each of them pairing well with specific wines.

Literal

The Literal type allows you to restrict the variable to a very specific set of values.

from dataclasses import dataclass
from typing import Literal

@dataclass
class Cheese:

    name: Literal["roquefort", "comté", "brie"]
    price_per_kilo: float
    aoc: bool


if __name__ == "__main__":

    cheese = Cheese(
        name="abondance",
        price_per_kilo=28,
        aoc=True
    )

If you run mypy on the command line against that file, you will get an error:

error: Argument "name" to "Cheese" has incompatible type "Literal['abondance']";
expected "Literal['roquefort', 'comté', 'brie']"  [arg-type]

Tip: In most development environments you can get the typechecker analysis in real time. In Visual Studio Code, you can get notified of errors as you type using the MyPy Type Checking extension: code --install-extension matangover.mypy.

Notes:

  • To restrict possible values of a variable, you can also use Python Enumerations. However, Literal is more lightweight.

  • You can also use Annotated types to specify more complex constraints (e.g. to constrain a string to a specific size or to match a regular expression) but this type is best served as a communication method.

NewType

A NewType takes an existing type and creates a brand new type that possesses the exact same fields and methods as the existing type.

NewType is useful in a handful of real-world scenario.

For instance, to make sure you only operate upon sanitized strings to prevent SQL injections you could establish a distinction between a str and a SanitizedString.

Same to separate between a User and a LoggedInUser.

Let’s say we are a gastronomic restaurant. We only want to serve Protected Cheese to our customers.

For the service, a dedicated method called dispense_protected_cheese_to_customer makes sure that
only protected cheese can be served. And for sure, as it only takes our new ProtectedCheese type (based on the Cheese type) as argument and refuses everything else.

from dataclasses import dataclass
from typing import NewType

@dataclass
class Cheese:

    name: str
    price_per_kilo: float
    aoc: bool


ProtectedCheese = NewType("ProtectedCheese", Cheese)


def prepare_for_serving(
    cheese: Cheese
) -> ProtectedCheese:
    protected_cheese = ProtectedCheese(cheese)
    # you can image other suitable methods being
    # applied here...
    return protected_cheese


def dispense_protected_cheese_to_customer(
    protected_cheese: ProtectedCheese
) -> None:
    # ...
    return None


if __name__ == "__main__":

    cheese = Cheese(
        name="charolais",
        price_per_kilo=47.60,
        aoc=True
    )

    protected_cheese = prepare_for_serving(cheese)
    dispense_protected_cheese_to_customer(protected_cheese)

You can try twisting around the above snippet yourself, by trying to serve a non-protected cheese to a customer:

protected_cheese = prepare_for_serving(cheese)
dispense_protected_cheese_to_customer(cheese)

Mypy will complain:

error: Argument 1 to "dispense_protected_cheese_to_customer" has incompatible type "Cheese";
expected "ProtectedCheese"  [arg-type]

Note: The prepare_for_serving method acts as a blessed function, creating our protected cheese from our original cheese blueprint type. It is important to note that the only way to create new types is through a set of blessed functions.

One more thing; NewType is a useful pattern to be aware of. However, classes and invariants provide a similar but much stronger guarantees to avoid illegal states.

Final Types

Final types allow you to prevent a type from changing its value over time (e.g. if you do not want the name of a variable to be changed by accident).

from typing import Final

RESTAURANT_NAME: Final = "Le Central"

Should you try set the variable to a new value, mypy will complain:

from typing import Final

RESTAURANT_NAME: Final = "Le Central"

if __name__ == "__main__":

    RESTAURANT_NAME = "Le Bois sans feuilles"
error: Cannot assign to final name "RESTAURANT_NAME"  [misc]

Final is hence useful to prevent a variable from being rebound.

Note: Both restaurants belong to the Troisgros family. Both are based in Roanne. Le Bois sans feuilles is a prestigious three-stars restaurant. A documentary about it was released in 2023: Menus Plaisirs – Les Troisgros.

Type Aliases and Variables Annotations

Even though it is cumbersome, you can add type annotations not only to methods but to variables:

def reverse_string(string_candidate: str) -> str:
    reversed_string: str = string_candidate[::-1]
    return reversed_string

You can also alias a particularly long type:

from typing import Dict, List

DictOfLists = Dict[str, List[int]]

def element_in_lists(
    dictionary: DictOfLists,
    element: int
) -> list[bool]:
    return [element in l for _, l in dictionary.items()]

if __name__ == "__main__":

    my_dict = {
        "a": [1,2],
        "b": [],
        "c": [42, 2, -1]
    }

    result = element_in_lists(
        dictionary=my_dict,
        element=42
    )
    print(result)

Acknowledgment

This post is based on the excellent book: Robust Python: Write Clean and Maintainable Code, Patrick Viafore, O’REILLY.

Python SimpleNamespace

SimleNamespace is a Python utility that allow you to create simple Python objects.

With it, you can turn a dictionary into an object where the keys are accessible using the “dot notation”.

python> from types import SimpleNamespace
python> person_dict = {
    "firstname": "John",
    "lastname": "Doe",
    "age": 29
}
python> person = SimpleNamespace(**person_dict)
python> person.firstname
'John'

python> person_dict.firstname
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'dict' object has no attribute 'firstname'

You can see it as a simple data structure. You can use it instead of a Class when you need an object that do need to implement any kind of behavior.

SimpleNamespace vs. Dataclass vs. namedtuple

Here is a recap of the main strength for each types:

SimpleNamespace Dataclass namedtuple classes
prototyping & dot notation post_init immutability behaviors

Use-Case: Mocking & Prototyping

You have an API endpoint you want to query.

This endpoint returns JSON data.

You have a python method taking this JSON data as an input, transforms it and returns the transformed JSON as output.

Note: this is typically the case when you have e.g. a Google Cloud Function, listening to the outbound webhook where data are delivered by one external service and passing on the data to another external service after having performed some transformation over it. Usually, extracting data from the received HTTP payload and preparing a JSON card to be send to another endpoint via an HTTP POST request. Could be alerts you want to send to Microsoft Teams.

Here is the snippet doing the aforementioned actions:

from typing import Any
import requests
from pprint import pprint
from types import SimpleNamespace


def request_random_user() -> requests.Response:
    """
    Method fetching one random user from the API.
    """

    response = requests.get(
        url="https://random-data-api.com/api/v2/users?size=1",
        timeout=300
    )
    response.raise_for_status()
    return response


def reduce_json_content(
    response: requests.Response
) -> dict[str, Any]:
    """
    Method returning a dict with selected
    subfields from the json content in the
    HTTP response.
    """

    content = response.json()
    selected_fields = ("first_name", "last_name", "username")
    return {field: content[field] for field in selected_fields}


def main() -> None:

    response = request_random_user()
    content = reduce_json_content(response)
    pprint(content)


if __name__ == "__main__":
    main()

Let’s give it a shot:

zsh> poetry run python main.py
{
    'first_name': 'Eula',
    'last_name': 'Morar',
    'username': 'eula.morar'
}

Now, let’s say you want to test the reduce_json_content method without calling the API.

You can simply pass on to the method a mock json content.

SimpleNamespace is the perfect too for that.

Let’s edit the above snippet, replacing the last lines by the following:

if __name__ == "__main__":

    response_mock = SimpleNamespace()
    response_mock.json = lambda: {
        "first_name": "John",
        "last_name": "Doe",
        "username": "johndoe",
    }

    content = reduce_json_content(response_mock)
    pprint(content)

Note: response is of type requests.Response. It possess a .json() method. Therefore, we need to mock this method using a lambda function. This lambda function does not require any parameters.

zsh> poetry run python main.py
{'first_name': 'John', 'last_name': 'Doe', 'username': 'johndoe'}

Voilà! You can then continue the development without calling the API every time.

Git Configuration

Like any other tools, git can be configured.

You can script those configurations and store them at different places within configuration files.

This article answers the following questions:

(1) Why using git configuration? What is that useful for?
(2) What can I configure and how?
(3) What a configuration file might look like?

It will also present you with a use-case to store different git user profiles and there specific ssh keys.

 Why using git config?

Having git configurations stored in place allows you to (1) have a nice commit history with your user information displayed on Github or Gitlab’s UI. It is a way to reconciliate the user stored on the remote with the user stored on your local.

It provides also a (2) handy way to share configuration like git aliases (see this article).

Last but not the least, it allows you to (3) have different config (e.g. ssh and user identification) if you want for instance to switch users if you are multi-tenant (e.g. your ssh key to connect for Gitlab or Github might be different. Same for the email and username in used).

Anatomy of a configuration file

A git config file is a simple file. It can have different names depending on where it is located. Generally when located at the root, the name will be .gitconfig.

zsh> tail -n 3 ~/.gitconfig
[user]
    name = John Doe
    email = john.doe@gmail.com

The data in there are written like in a toml file.

A git config file has different sections. The default one is [core]. Another useful one is [user].

Alias are put under the alias section.

Here is an example of what a simple .gitconfig file might look like:

[alias]
    p=pull
    c=commit
    cm=commit -m
    s=status
    sw=switch
    a=add
    alias=!git config -l | grep alias | cut -c 7-
    cam=commit -am
    lo=log --oneline
    sc=switch -c
    rsm=rm -r --cached
    asm=submodule add
[user]
    name = John Doe
    email = "john.doe@gmail.com"

More here: https://git-scm.com/docs/git-config#_configuration_file

 The different configuration scopes (global, local)

By default, git config will read the configuration options from different git config files in that order:

(1) /etc/gitconfig
(2) ~/.gitconfig
(3) .git/config (at the repository level)
(4) .git/config.worktree (at the repository level)

They are also overwritten in that order. The local one taking precedence over the global configurations.

To read all the current configurations, you can either cat those files or simply run:

zsh> git config --list --show-scope --show-origin

Note: the show-origin argument will list where the config comes from.

If you want to see what are the specific values hold for the current config (depending on your pwd) you can use the get argument:

zsh> git config --get user.name
John Doe
zsh> git config --get user.email
john.doe@gmail.com

Note: depending on where you are located – i.e. the working directory – the value returned by the get command might change.

For instance, let’s say you have two git configuration files with different configuration in place:

zsh> pwd
~/my-workplace-folder

zsh> tree -a .
.
├── .git
    └── config
├── git-project-A
└── git-project-B

zsh> cat .git/config
[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
    ignorecase = true
    precomposeunicode = true
[user]
    name = Jane Doe
    email = jane.doe@gmail.com

zsh> git config user.name
Jane Doe

zsh> git config --global user.name
John Doe

zsh> cat ~/.gitconfig
[user]
    name = John Doe
    email = john.doe@gmail.com

You can also edit the git config files via command lines:

zsh> pwd
~/my-workplace-folder

zsh> git config --local user.name "Baby Doe"
zsh> git config --local user.email "baby.doe@gmail.com"

zsh> tail -n 3 .git/config
[user]
    name = Baby Doe
    email = baby.doe@gmail.com

zsh> git config --get user.name
Baby Doe

zsh> git config --global --get user.name
John Doe

One important note when setting value via command line though:

If you add an equal e.g. git config --local user.name = "Baby Doe", the “=” char will be interpreted by the command as the string value passed to user.name and will ignore silently the rest of the line. You will end up with the following file:

ssh> cat .git/config                                                                                                                
[user]
    name = =

Thus, make sure to write this instead:

zsh> git config --local user.name "Baby Doe"

Use-case: different git user profiles

I have a gitlab/ folder and a github/ one.

zsh> tree .
├── github/
└── gitlab/

By default, I want my git config (user name and email) to be associated with gitlab.

However, I want those config to be overwritten when my working directory if under the github/ folder.

Note that because I want those two users to be able to ssh push and pull from the remote repositories, they both have a specific ssh key configured:

zsh> ls -la ~/.ssh
id_rsa_gitlab
id_rsa_gitlab.pub
id_rsa_gitlhub
id_rsa_github.pub

Note: you can create a new ssh key running:

zsh> sh-keygen -o -t rsa -C "gitlab@gmail.com"

Step 1 – in the global ~/.gitconfig file:

[user]
    name = "Gitlab User"
    email = "gitlab@gmail.com"
[includeIf "gitdir:~/github/"]
    path = ~/github/.gitconfig

Step 2 – in the github/ folder:

zsh> git init # create the .git folder
zsh> cat <<EOT >> .gitconfig
[core]
    sshCommand = "ssh -i ~/.ssh/id_rsa_github"
[user]
    name = "Github User"
    email = "github@gmail.com"
EOT
zsh> pwd
~/github/

zsh> git config user.name
Github User

zsh> cd ~/gitlab
zsh> git config user.name
Gitlab User

Note: withing the github/ folder, one must remain careful and keep in mind in case of overlapping issues that there is therefore now two git config files, one being .gitconfig and the second one being .git/config.

Pylint Logging Format Interpolation

This article explains why fstring interpolation is bad in logging functions and why you should rather use:

logging.info("Your string %s", my_var)

instead of:

logging.info(f"Your string {my_var}")

or:

logging.info("Your string %s" % my_var)

Context

In python you can have code using logging functions for better observability. For instance:

"""
Demonstration module for logging fstring
interpolation pylint error.
"""

import logging

logging.basicConfig(level="INFO")
logger = logging.getLogger(__name__)


def main() -> None:
    """
    Main method. Nothing fancy about it.
    """
    try:
        print(8 / 0)
    except ZeroDivisionError as exc:
        logger.info(f"The division failed: {exc}")


if __name__ == "__main__":
    main()

Running the above code will raise the following error:

zsh> poetry run python main.py
INFO:__main__:The division failed: division by zero

So far so good.

Error: Use lazy % formatting in logging functions

The next step is to push our code into production, hence, applying black, mypy and pylint formatting over your code.

Note: you can encompass all of the above within a similar Makefile, see hereafter.

black:
    poetry run black main.py

mypy:
    poetry run mypy main.py

pylint:
    poetry run pylint main.py

checks: black mypy pylint
zsh> make checks
main.py:19:8: W1203: Use lazy % formatting in logging functions (logging-fstring-interpolation)

------------------------------------------------------------------
Your code has been rated at 9.00/10

So what’s happening here?

You can see that pylint complains about us using a Python fstring in the logging function.

The error message hints us to use the lazy % formatting instead.

Because we want to comply with its instruction, we edit our code, changing the line by the following:

logger.info("The division failed: %s" % exc)

Confident, we run our code again, switching for a lazy % formatting and expecting a 10/10 rating:

zsh> make checks
main.py:19:8: W1201: Use lazy % formatting in logging functions (logging-not-lazy)
main.py:19:20: C0209: Formatting a regular string which could be an f-string (consider-using-f-string)

------------------------------------------------------------------
Your code has been rated at 8.00/10 (previous run: 9.00/10, -1.00)

Surprise! Pylint seems not to like either fstring or lazy % formatting…

It even comes up with a new error into place!

We could obviously silent the error, adding the following inline comment:

logger.info(  # pylint: disable=logging-fstring-interpolation
    f"The division failed: {exc}"
)

Which would gives us a nice and neat 10/10 rating:

zsh> make checks
-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 9.00/10, +1.00)

However, better than hiding the dirt under the rug, it is better to understand why pylint is raising this error.

So, why fstring interpolation is bad in logging functions?

Hidden Motivation: Performances!

The problem is not so much about us using fstrings in place of lazy % formatting.

The problem lies about performances.

Jumping back into our code example, what fixes it – without us having to cheat around by muting the error – is a lazy % formatting with a comma:

logger.info("The division failed: %s", exc)

The Devil is in the details!

The motivation behind the warning is around performance.

With this final version of our code, if a log statement is not emitted, then the interpolation cost is saved.

The solution was to shift from:

logger.info("The division failed: %s" % exc)

to:

logger.info("The division failed: %s", exc)

so that the string will only be interpolated if the message is actually emitted.

Working with datetime and timezones

Working with datetime and timezones in python can generate lot of confusions and errors.

It is therefore a best practice to always specify the timezone you are working with (e.g. UTC, CEST…) explicitly as you can be sure the bread will always fall on the marmalade side.

For instance, the system time in your CICD pipeline runners might be different from your local system time, causing your unittests to fail on your runner but to succeed in local.

As a rule of thumb, it is always best to compare Coordinated Universal Time (UTC) with UTC.

Let’s see how to do this in practice with a code we will also unittest. Hence, we will catch up all the necessary lingo and discover the necessary tools of the trade.

Usage

Let’s say you have created a custom type Message:

from dataclasses import dataclass

@dataclass
class Message:
    content: str
    created_at: str

You want to convert the Message data class into a JSON-like dictionary – e.g. for better human readability:

from pprint import pprint
from dateutil.parser import parse
from datetime import datetime, timezone
from typing import Dict


def convert_message_to_dict(
    message: Message
) -> Dict[str, str]:
    """
    Method returning a dictionary populated with the
    Message's attributes.
    """
    return {
        "content": message.content,
        "created_at": parse(message.created_at)
        .astimezone(timezone.utc).isoformat(),
        "converted_at": datetime.now(timezone.utc)
        .isoformat(),
    }


if __name__ == "__main__":

    message = Message(
        content="foo",
        created_at="2024-02-14T01:32:00Z"
    )

    converted_message = convert_message_to_dict(message)

    pprint(converted_message)

Notes:

  1. I want my dictionary to only contain str so I need to convert datetimes to isoformat – which is the ISO_8601 standard.

  2. I explicitly want to define the timezone as UTC, using timezone.utc, to avoid any confusion.

  3. In isoformat you can explicitly define the timezone information at the end of the “YYYY-MM-DDTHH:mm:ss” string. To declare an UTC, either “Z” or the offset “+00:00”, “+0000” or “+00” can be used as postfix interchangeably.

  4. “Z” stands for “Zulu”, the military designation for UTC. It is the same as the UTC time. Read more.

  5. As for the “+” or “-” prefixing the offset, you add “+” moving East from the London meridian and subtract “-” moving West. Read more.

  6. pprint stands for pretty print and displays the results in a more human readable manner.

Here we go, running the script at “2024-02-13T13:52:09.265050+00:0” gives me the following output:

{
    "content": "foo",
    "created_at": "2024-02-14T01:32:00+00:00",
    "converted_at": "2024-02-13T13:52:09.265050+00:00"
}

Let’s unittest the above code.

Unittest datetime

In order to unittest the aforementioned code, we gonna use freezegun.

freeze_time is a nice python module that allow you to mock datetime.

import pytest
from freezegun import freeze_time
from datetime import datetime, timezone


@pytest.fixture()
def message() -> Message:
    """
    Fixture returning a minimal Message
    with 'created_at' expressed in UTC.
    """
    return Message(
        content="foo",
        created_at="2023-09-01T07:00:00Z"
    )

@freeze_time("2024-02-24T06:00:00", tz_offset=0)
def test_datetime():
    dt = datetime.now().isoformat()
    assert dt == "2024-02-24T06:00:00"


@freeze_time("2024-02-24T06:00:00", tz_offset=0)
def test_feed_to_dict(message):

    outcome = message_to_dict(message)

    expected_outcome = {
        "content": "foo",
        "created_at": "2023-09-01T07:00:00+00:00",
        "converted_at": "2024-02-24T06:00:00"
    }

    assert outcome == expected_outcome

Notes:

  • the trick is to compare UTC dates to UTC dates;

  • thanks to the fixture, the above unittest code respects the Arrange-Act-Assert pattern.

Python types

This post is part of the Python Crash Course series. The chronological order on how to read the articles is to be found on the agenda.

In python you can manipulate different objects. If you take your pocket calculator, you might have mostly played around with integers and floating numbers – e.g. when performing additions and adding numbers together. Python extends those capabilities – not only allowing you to manipulate integers – but also to interact with a variety of different objects/types.

Short example

Let’s create a variable a and check its type:

zsh> python
>>> a = 42
>>> type(a)
<class `ìnt`>

You can see that a is of type int – which refers to integers.

Note: to learn how to start python on the terminal and to interact with it, see: python-via-command-line.

The different built-in types

By default, python provides different types which are already built-in. See: Built-in Types.

All of them possess their own associated methods so you can perform operations on them – e.g. integers additions etc.:

The most important and mainly used are: int, bool, list, tuple, str, dict and set.

type() vs. isinstance() vs is

There are situations where you want to check if the object you are dealing with is of a specific type. You can do it using the isinstance() method:

zsh> python
>>> isinstance("my_string", str)
True
>>> isinstance(42.0, int)
False

You could have also used the following:

>>> type(42.0) == float
True

However, there are differences between the both:

(1) From a design perspective, it is better to use == to check the value of a variable, not its type.

(2) To compare types, isinstance() is expected to be slightly faster, even though negligible on the latest python versions:

zsh> python
>>> from timeit import timeit
>>> timeit("isinstance(42.0, float)")
0.06043211100040935
>>> timeit("42.0 == float")
0.07633306799834827

(3) Sometime you do not want to compare if the object is strictly of a specific type but rather if it behaves like so or inherits its properties.

For instance, a str object has some methods that allow us to perform specific operations on it, like turning the text into uppercase:

zsh> python
>>> "This is my string".upper()
'THIS IS MY STRING'

Let’s now imagine that you want to create your own str type which inherits the str properties – such as the capacity to change the text case via upper() or lower() – but supplemented with your own custom methods:

zsh> python
>>> my_custom_str = ExtendedString("foo")

This custom type of yours is still to be considered as a str as it inherits the main properties of the str type:

>>>  my_custom_str.upper()
"FOO"

However, python sees different:

>>> type(my_custom_str) is str
False
>>> type(my_custom_str) == str
False
>>> isinstance(my_custom_str, str)
True

What is important to remember here is that isinstance() checks if my_custom_str is – overall – a subclass of str (it is, because ExtendedString inherits from str).

On the other hand, is or == check if my_custom_str is an instance of str (it is not directly the case, because it is – strictly speaking – an instance of ExtendedString).

Note: As for the difference between is and ==, is will return True is two variables point to the same object in memory while == returns True if the values hold by the variables are equal.

zsh> python
>>> a = [1, 2, 3]
>>> b = [1, 2, 3]
>>> a == b # values are equal
True
>>> a is b # but it's 2 different object
False
>>> a = b # a becomes b
>>> a is b
True
>>> b.append(4) # thus, if b changes
>>> b
[1, 2, 3, 4]
>>> a
[1, 2, 3, 4] # a changes

In practice, you can use the above concept e.g. to surcharge and add your strings the possibility to turn your texts into Spongemock case. You will have to extend and add this functionality by yourself as this feature is not natively present:

class ExtendedString(str):

    def spongebob(self) -> str:
        return "".join(
            char.upper() if i%2 == 0 else char.lower()
            for i, char in enumerate(self.__str__())
        )

You can imagine a following usage:

zsh> my_str = ExtendedString("The cake is a lie")
zsh> my_str.spongebob()
ThE CaKe iS A LiE

Going further about inheritance

If you go upstream, you can see that str itself inherits from Sequence. Thus, all str objects are also of type Sequence:

>>> from collections.abc import Sequence
>>> my_str = "foo"
>>> isinstance(my_str, str)
True
>>> isinstance(my_str, Sequence)
True

However, going up the stream even further – by winding the links all the way up – you will ultimately notice that all python types inherits from the catch-all python object:

>>> isinstance(my_str, object)
True
>>> isinstance(42, object)
True

Data vs. Common Classes

In python you can also create your own types using the dataclasses module:

from dataclasses import dataclass

@dataclass
class Point():
    x: float
    y: float

The main difference between common classes – like the ExtendedString one previously created – and data classes lies in the fact that data classes are not expected to contain any logic or methods.

Data classes are strictly geared toward storing data, not performing operation on it.

chmod and access permissions

chmod is an abbreviation of change mode (source: wiki/Chmod).

It is used to manage permissions in unix-like operating systems.

Groups & Permissions

Let’s take the example of a file.

There is three kind of permissions for a file: read (r), write (w) or execute (x).

Those permissions can be attributed to different group of users.

There is three sort of group classes: user (u), group (g), others (o).

  • user: file owner
  • group: the group owning the file
  • others: users not owner and not in the group owning the file.

Each group can have all three permissions (rwx), a combination of two (rw-, r-x, -wx), one (r–, -w-, –x) or nothing at all (—).

Permissions are listed in the following order: user > group > others (ugo).

Note: thinking about the name “Hugo” might be a nice trick to help you memorizing the order.

zsh> touch my_file
zsh> ls -l my_file
-rw-r--r-- my_file # default permissions

Edit permissions

To add permissions:

zsh> chmod +rwx my_file
zsh> ls -l my_file
-rwxr-xr-x my_file

The aforementioned line will add (+) the read (r), write (w) and execute (x) permissions for all classes.

Note: the group and others classes won’t receive the write permission. You will have to set it up specifically.

To remove permissions:

zsh> chmod -x my_file
-rw-r--r-- my_file

The aforementioned line will remove (-) the execute (x) permission for all classes.

To target specific classes:

zsh> chmod ugo+wr my_file # grant read/write access to all
-rw-rw-rw- my_file

 Numerical permissions

Permissions can be represented via numbers:
* (r) is associated with (4);
* (w) with (2);
* and (x) with (1).

They can therefore be added e.g. (7) will grant (rwx) i.e. (4+2+1) permissions to the targeted class.

Thus, you can define permissions in the following way:

zsh> chmod 754 my_file
zsh> ls -l my_file
-rwxr-xr-- my_file

To revoke all permissions:

zsh> chmod 000 my_file
zsh> ls -l my_file
---------- my_file