Olivier Bénard

Git commit message convention

To help you writing commit messages, there is a couple of guidelines, conventions and best practices one might follow. It can be summarized in 9 bullet points:

Start with the ticket’s number the commit solves.
Separate subject from body with a blank line
Limit the subject line to 50 characters
Capitalize the subject line
Add issue reference in the subject
Do not end the subject line with a period
Use the imperative mood in the subject line
Wrap the body at 72 characters
Use the body to explain what and why, not how

Example

VERS-1936: Add user access to datapool table

Access rights for the datapool resource have
been edited. Access is required to work on the
new outfit recommender system. Request has been
approved by the PO.

Please, note the following:

Access right is normally granted at the dataset level.

Due to pii data, access was granted at the table level.

Resolves: VERS-1936
See also: VERS-0107

Additional notes

When you commit, the Vim editor will be called by default. You will be then asked to prompt your commit message. The big strength of Vim is that it is installed by default on all systems.

If you don’t want to call Vim but rather use the command line, you can use the inline -m flag to write your commit message:

    git commit -m "Your message at the imperative voice"

If you still want to use a text editor to fill-in the commit message, you can change the settings so it opens Visual Studio Code IDE when the git commit command is called:

    git config --global core.editor "code --wait"

This requires you to have Visual Studio Code installed though.

You are now fully equipped to write the next bestseller! 📚

Acknowledgement

https://cbea.ms/git-commit/

Why using snake case

The snake case is a style of writing in which each space is replaced by an underscode and letters writen in lowercase:

this_is_what_the_snake_case_style_looks_like

Since the snake_case format is mandatory for some objects, it is then easier to stick to it and generalised its usage throughout.

It is important that you use the snake case because your python code might simply do not work otherwise:

from helpers.math-module import incr

def test_incr() -> None:
    result = incr(42)
    print(result == 43)

if __name__ == "__main__":
    test_incr()

> python main.py
File "path/to/snake_case_project/main.py", line 1
from helpers.math-module import incr
                 ^
SyntaxError: invalid syntax

Instead, change for the following syntax:

snake_case_project/
    ├── helpers
        ├── __init__.py
        └── math_module.py
    └── main.py

from helpers.math_module import incr

def test_incr() -> None:
    result = incr(42)
    print(result == 43)

if __name__ == "__main__":
    test_incr()

> python main.py
True

Admit that for a language like Python, the snake_case is rather well adapted! 🐍

What is python init.py file for?

The Python __init__.py file serves two main functions:

It is used to label a directory as a python package to make it visible so other python files can re-use the nested resources (e.g. the incr method defined inside helpers/file1.py):
```
from helpers.file1 import incr

result = incr(42)
assert result == 43
```
A side effect is that – with some not-recommended workarounds – developers do not have to care about the method’s location in your package hierarchy:
```
    helpers/
    ├── __init__.py
    ├── file1.py
    ├── file2.py
    ├── ...
    └── fileN.py
```
For that, simply fill the __init__.py file with the following content:
```
from file1 import *
from file2 import *
...
from fileN import *
```
Therefore, even though it is always a good practice to explicitely mention the source, they can simply use:
```
from helpers import incr

result = incr(42)
assert result == 43
```
It is used to define variables or to initialise objects like logging at the package level and import time (to make them accesible at a global package level):
```
from helpers.file3 import MY_VAR

print(MY_VAR)
```

Still blur? Thereafter an easy example to understand:

First, let’s plot some context

You have the following project structure:

playground_packages
├── helpers/
    └── utils.py
└── main.py

The utils.py file contains:

def incr(n:list[float]) -> list[float]:
    return [x+1 for x in n]

if __name__ == "__main__":
    pass

Note: you could have also used the map and lambda methods instead. However, here is a nice example to show about list comprehension. The alternave version would have looked like:

list(map(lambda x: x+1, n))

The main.py file is looking like the following:

from helpers.utils import incr

def main() -> None:
    result = incr([1,2,3,4,5])
    print(result)

if __name__ == "__main__":
    main()

Notes:

Why we haven’t used import helpers.utils or import * is explained here (to do).
The if __name__ == "__main__" conditional statement is explained here (to do).

`init.py` to label a folder as Python package

Jumping back to our example, if you try to run the code with the current configuration, you will get the following error:

> python main.py
Traceback (most recent call last):
File "path/to/playground_package/main.py", line 1, in <module>
    from helpers.utils import incr
ModuleNotFoundError: No module named 'helpers'

This is because the helpers directory is not yet visible for Python. Python is actively looking for Python packages but cannot find any. A package is a folder that contains a __init__.py file.

Simply edit our current structure for the following:

playground_packages
├── helpers/
    ├── __init__.py
    └── utils.py
└── main.py

Now, it you try again, it will succeed:

> python main.py
[2, 3, 4, 5, 6]

The main take-away is:

If you want to split-up your code in different folders and files (to make your code more readable and debuggable), you must create a __init__.py file under each folder so they become visible for Python and can therefore be used and refered to in your code using import.

`init.py` to define global variables

In our previous example, the __init__.py file is empty. We can edit it, adding the following line:

MY_LIST = [2,4,6,8,10]

This variable is accessible even by the main function:

from helpers import MY_LIST
from helpers.utils import incr

def main() -> None:
    result = incr(MY_LIST)
    print(result)

if __name__ == "__main__":
    main()

> python main.py
[3, 5, 7, 9, 11]

Note: it is better to define variables in a config.py or constants.py file rather than in a __init__.py file. However, __init__.py becomes handy when it comes to instanciate objects such as logging or dynaconf. More on that will follow in another article.

You are now ready to fit your code together like Russian dolls 🪆

How to manage multiple terraform versions with tfenv

You can use tfenv use to quickly switch between the different terraform versions you have installed on your system:

> tfenv use 1.1.8
> terraform --version
Terraform v1.1.8

> tfenv use 0.13.5
> terraform --version
Terraform v0.13.5

If the version is not installed already, you can use tfenv install to install it e.g. to install terraform v1.1.8:

tfenv install 1.1.8

Finally, you can check all the existing versions you have installed via tfenv list, e.g. in my case:

>  tfenv list
* 1.1.8
0.13.5

Note: the wildcard character is preffixing the version currently used by default.

`tfenv` commands

The most used and useful commands are:

tfenv list
tfenv use <version>
tfenv install <version>

More can be displayed on the manual:

> tfenv
tfenv 3.0.0
Usage: tfenv <command> [<options>]

Commands:
   install       Install a specific version of Terraform
   use           Switch a version to use
   uninstall     Uninstall a specific version of Terraform
   list          List all installed versions
   list-remote   List all installable versions
   version-name  Print current version
   init          Update environment to use tfenv correctly.
   pin           Write the current active version to ./.terraform-version

How to install `tfenv`

Using brew (MacOS)

brew install tfenv

Via the Github repository

Git clone the tfenv repository under a new tfenv folder:

git clone https://github.com/tfutils/tfenv.git ~/.tfenv

Export the path to your profile:

For bash users:

echo 'export PATH="$HOME/.tfenv/bin:$PATH"' >> ~/.bash_profile

For MacOS:

echo 'export PATH="$HOME/.tfenv/bin:$PATH"' >> ~/.zshrc

Turn tfenv/ into an executable binary. Thus, you can symlink it to your local bin directory:
```
ln -s ~/.tfenv/bin/* /usr/local/bin
```
Check that the tfenv/ binary folder is indeed synchronized with the local/bin directory:
```
> which tfenv
/usr/local/bin/tfenv
```
Check the installation:
```
> tfenv --version
tfenv 3.0.0
```

Which terraform versions are available

You can check the exisitng available terraform versions on the official hashicorp releases page: releases.hashicorp.com/terraform.

Note: similarly you can use the command line interface:

> tfenv list-remote
1.3.4
1.3.3
1.3.2
1.3.1
1.3.0
1.3.0-rc1
1.3.0-beta1
1.3.0-alpha20220817

When to use `tfenv` and why it is useful

tfenv allows you to quickly change the version of terraform running by default on your system.

This is handy when you have multiple terraform repositories across your organization and each one of them uses a different terraform version.

To be more specific, each terraform repository requires you to set the terraform version explicitely, as you can see line 2 of the following example:

terraform {
    required_version = "1.2.2"
    required_providers {
        local = {
            source = "hashicorp/local"
            version = "~> 2.0"
        }
    }
}

When using the usual methods:

terraform init
terraform plan
teraform apply

it will require you to have a local terraform version matching the one specified in the terraform configuration file.

Therefore, the 1:1 mapping between your locally installed versions and the versions specified in your configuration files is required.

Warning: running one of those methods with a higher local terraform version will introduces changes on your repository that cannot be reversed. This will not only forces you to migrate the configuration files so they fit the syntax of the new terraform version, but all developers will have to install the new terraform version on their local environment too.

Swapping has never been easier! 🔥

What is a Data Engineer

What is a Data Engineer in a nutshell

A Data Engineer is like a gas or oil pipeline operator. He must:

oversee the full Cloud Data Warehouse;
make data available 24/7 on the platform;
move data from A to B;
ingest, update, retire or transform data coming from upstream data sources;
turn data into by-products and monitor the overall flow.

On top of ingesting, transforming, serving and storing, those tasks mandates strong proficiency in Security, Data Management, Data|FinOps, Data Architecture, Orchestration & Software Engineering.

The main objective is to serve data to the analysts/BI/data science teams + back to the software teams (reverse ETL).

It enables the company to make data-driven decisions (an example of data driven decisions is given at the end).

One big aspect though: a Data Engineer brings the oil to the different teams but is not responsible for consuming it. He is not responsible for turning data into meaningful charts or actionable decisions. We mostly do not bother much about the business-logic behind. Our role is to save data into a place (being the unique source of truth) where it can be consumed by the teams who need it.

Note: A lot of organizations or applicants tend to have a very poor understanding of what is really meant behind the term Data Engineering. They rather see the position as a patchwork of different roles and responsibilities. The profession is indeed quite new – as you can see on Google Trend the interest only exploded in late 2019. It will still needs some time before people (me included) have a full grasp over it.

Instead, Data Engineer is a technical job that requires you to be proficient in writing code (mainly Python and Java). Therefore, you need to have strong Software Engineering skills. Developers (more than Data Scientists or Data Analysts) are in turn highly valued. That is why I prefer to call it Data Software Engineer to remove any ambiguity.

The different missions

Build, Orchestrate and Monitor ELT pipelines (using Airflow & Google Cloud Composer).
Manage data infrastructures and services in the Cloud (e.g. tables, views, datasets, projects, storage, access rights, network).
Ingest sources from external databases (using Python, Docker, Kubernetes & Change Data Capture tools)
Develop REST API clients (facebook, snapchat, jira, rocketchat, gitlab, external providers)
Publish open-source projects e.g. on github.com/e-breuninger such as Python libraries, Bots or Google Chrome Extensions (ok, this one is rather specific to my job)
Lead workshops and interviews (e.g. BigQuery Introduction, Code Standardization & Best Practices)

Keep in mind: Data Engineers are Data Paddlers, not Data Keepers.

If the source data is corrupted, correcting or improving data quality is out of scope. You can see us as an incorruptible blind-folded carrier, moving the baggage assigned to us from A to B, without looking into it throughout the journey. You don’t want us to start opening the bags and fold your different shirts as it should be.

See Should Data Engineers work closely with the business logic?

Tools used by a Data Engineer

At least, those I am currently working with on a daily base:

Airflow to manage your ETL pipelines.
Google Cloud Platform as Cloud Provider.
Terraform as Infrastructure as Code solution.
Python, Bash, Docker, Kubernetes to build feeds and snapshots readers.
SQL as Data Manipulation Language (DML) to query data.
Git as versioning tool.

The main challenges in the job

Based on my own experience, my biggest challenges at the moment are:

Improve existing pipelines reliability (pushing for more tests, ISO-standardization, data validation & monitoring)
Get rid of the technical debt (moving toward 100% automation, documentation and infrastructure as code coverage)
Keep up with the upcoming technologies (learning new skills, going more in-deep vertically and horizontally)
Enforce the software development best practices and standardization principles (via multitude hours of workshops and conferences)
Strengthen the international part, actively connecting the teams (via department tours and promoting the use of english as primarily source of communication)

I believe them to be representative in this industry.

Get started as Data Engineer

This will need an article on its own. However, you can get started with the immediate following take-aways:

Technical books

Fundamentals of Data Engineering, Reis & Housley, O’REILLY, 2022
Google BigQuery: The Definitive Guide, Lakshmanan & Tigani, O’REILLY, 2019
Terraform: Up & Running, 3rd Edition, Yevgeniy Brikman, O’REILLY, 2022
Learning SQL, 3rd Edition, Alan Beaulieu, O’REILLY, 2020
Docker: Up & Running, 3rd Edition, Kane & Matthias, O’REILLY, 2023
The Kubernetes Book, Nigel Poulton, Edition 2022

Online courses

The Git & Github Bootcamp, Colt Steele on Udemy
Apache Airflow: The Hands-On Guide, Marc Lamberti on Udemy
Terraform Tutorials, HashiCorp Learn

An example of Data Driven Decisions

Imagine you are the CEO of a bicycle rental company. You have multiple stations across Manhattan. You have the following 3 want-to-know questions:

You want to know which stations perform the best and are in high demand so you can anticipate any disruptions ahead, having more technicians standing-by in the area, increase the fleet and anticipate future expansion.
On the other hand, you want to retire poorly performing stations, adjusting your implantation to make it fit the market needs more accurately.
You want to monitoring the overall usage so you know what are the off-peaks and rush hours, average journey length or most appreciated commute options. You can then adapt the offer accordingly, e.g. offering discounts at specific times of the day/week/month to boost customer acquisition or match better your customers’ needs.

It is part of the Data Engineering journey to consume data coming from the different sources (e.g. bikes stations, bicycles, Open-Weather API, Google Map API etc.) so the marketing and business intelligence teams can solely focus on answering your questions without getting their hands dirty, deep-diving into the data ingestion part.

To conclude, as it is often the case in history, recent jobs have many similarities with sectors that have existed long before them. For instance, the data engineering field and the energy sector share closed similarities (you have to move an expandable from A to B and distribute it to consumers). They simply inherit the lingo. Ideas remain the same but are now applied to different “objects”. Data has replaced oil but the paradigm keeps working.

However, good luck getting your car to run with it! 🏎️💨

Use Poetry as Python Package Manager

Installing poetry is super easy. On macOS, simply run:

brew install poetry

Now, let’s have a look how to use it.

Poetry Cheat Sheet

I have gathered for you in this section the poetry commands you will always need. You can refer to this section later on and simply save the link for later use.

poetry init
poetry install
poetry update
poetry add <your-python-package>
poetry run

Getting started with an example

Create your python project and move under its repository:
```
mkdir playground-poetry && cd playground-poetry
```

Init poetry. You will be prompted to fill-in the following configuration fields:

> poetry init
Package name [playground-poetry]:
Version [0.1.0]:
Description []:
Author [None, n to skip]:
License []:
Compatible Python versions [^3.10]:
Would you like to define your main dependencies interactively? (yes/no) [yes]
Would you like to define your development dependencies interactively? (yes/no) [yes]
Do you confirm generation? (yes/no) [yes]

This will generate the pyproject.toml configuration file:

[tool.poetry]
name = "playground-poetry"
version = "0.1.0"
description = "A primer on poetry."
authors = ["John Doe <john.doe@gmail.com>"]

[tool.poetry.dependencies]
python = "^3.10"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

After generation, your project’s architecture should look like the following:

playground-poetry
└── pyproject.toml

Add a package

You can simply use the following command e.g.:

poetry add black

In our case, we wanted to install the python black code formatter.

Note: for more general codebase formatting, I recommend super-linter.

You will see that our pyproject.toml poetry configuration file has been updated as it now contains reference to the black package:

[tool.poetry]
name = "playground-poetry"
version = "0.1.0"
description = "A primer on poetry."
authors = ["John Doe <john.doe@gmail.com>"]

[tool.poetry.dependencies]
python = "^3.10"
black = "^22.10.0"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

Note: you are wondering what the weird ^ sign stands for? You well soon find an article about it.

You can check that black is indeed accessible and installed within poetry virtual environment via:

> poetry run black --version
black, 22.10.0 (compiled: yes)
Python (CPython) 3.10.4

The pyproject.toml is not the only thing that has changed. If you have a look on our project’s architecture, you will see that it now contains an additional poetry.lock file:

playground-poetry
├── poetry.lock
└── pyproject.toml

Note: poetry is storing the state the same way terraform is doing. If you are new to poetry this might be too much details right now. If you want to know more about poetry.lock an article will follow soon!

Run python code on a poetry environment

Imagine that you have now a python code that requires a couple of dependencies (e.g. could be black, pandas, logging, etc.) to run. E.g.:

"""
A simple module containing maths methods.
"""


def add(number_1: float, number_2: float) -> float:
    """
    Add the numbers.
    Args:
        number_1 (float): the first number.
        number_2 (float): the second number.
    Returns:
        float: the sum of both numbers.

    >>> add(-2, 1)
    -1
    >>> add(42, 0)
    42
    """
    return number_1 + number_2


def main() -> None:
    """
    Main function.
    """

    res = add(4, 7)
    print(res)


if __name__ == "__main__":
    main()

with the following project’s architecture:

playground-poetry
├── playground_poetry
    ├── __init__.py
    └── main.py
├── poetry.lock
└── pyproject.toml

Note: fair enough, in our project we do not really need those dependencies at this point, but let’s say that this is just an extract and other parts in the code do actually use logging or pandas.

You have those dependencies installed on your poetry environment (you can see them on the pyproject.toml dependencies section).

You then need to execute your python code within the umbrella of this poetry virtual environment.

This is done using poetry run python <your-python-file>.

In our example:

> poetry run python playground_poetry/main.py
11

Note: we recommend you to have a similar architecture on your projects as it makes the development of python’s package easier, using the snake_case.

your-project-name
└── your_project_name
    ├── __init__.py
    └── main.py

Get started on a cloned poetry project

Now let’s say you already inherit from an existing poetry project with an already existing pyproject.toml and poetry.lock files.

The first time you need to instantiate the virtual environment, reading from the pyproject.toml file:

poetry install

This will create the poetry.lock file if not existing or resolves the dependencies if so.

You can also update the poetry.lock file if needed:

poetry update

Note: more info https://python-poetry.org/docs/cli/.

Run the extra mile using poetry run and a Makefile

Let’s improve our example project.

Have you noticed that to run our main.py file, you need to explicitly state the whole path:

poetry run python playground_poetry/main.py

You can make things better, editing the pyproject.toml file for the following:

[tool.poetry]
name = "playground-poetry"
version = "0.1.0"
description = "A primer on poetry."
authors = ["John Doe <john.doe@gmail.com>"]
packages = [{include="playground_poetry"}]

[tool.poetry.dependencies]
python = "^3.10"
black = "^22.10.0"
pandas = "^1.5.1"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

[tool.poetry.scripts]
main = "playground_poetry.main:main"

and from now on simply do the same thing as before but shorter (and faster) via:

> poetry run main
11

Note: this is thanks to the packages line, the __init__.py file nested under it that makes the main.py file visible and the [tool.poetry.scripts] layer.

But that’s not all: we can do even better. Let’s make this command even shorter, saving it under a Makefile command.

In our example project, let’s add a Makefile with the following lines:

main:
    poetry run main

The final structure should look like the following:

playground-poetry
├── playground_poetry
    ├── __init__.py
    └── main.py
├── Makefile
├── poetry.lock
└── pyproject.toml

Finally, you can run the main function using:

> make main
poetry run main
11

And you thought our main job were to “write” code? The less the more! 😇

What have you learned

You can create a poetry environment from scratch to manage your python dependencies.
You can reuse an existing one.
You can scale and automate using Makefile commands.
You got a short primer on Software Development Standardization and Best Principles.