How to create Git Aliases

There are two ways to create custom git aliases:
1. Using the Command Line Interface.
2. Directly editing the git config files.

We will have a look on both methods.

Via the Command Line Interface (CLI)

For instance, git status can be shortened into git s:

git config --global alias.s status

Note: in this example, we are configuring a git alias so git status and git s will be equivalent. Therefore, git status and git s will return the same output.

Editing one of the three git config files

  • .git/config (at your git local repository level)
  • ~/.gitconfig (global)
  • ~/.config/git/config (global)

Just add the following lines within the file, e.g. using vim or nano:

[alias]
    s = status

Notes:

  • If you only edit the git config file at your local repository level, the alias will only be accessible within your current git project.
  • If you set-up the alias at one of the global git config file, the alias will be usable across your overall local environment.

List all the alias you have defined

alias = "!git config -l | grep alias | cut -c 7-"

Note: the exclamation mark tells git that everything within the quotes will be a bash script, therefore it will gives us access to the usual bash commands like grep or cut.

Example of git aliases

Here is a couple of aliases you might find useful to replicate in your configuration:

[alias]
    c=commit
    cm=commit -m
    s=status
    sw=switch
    a=add
    alias=!git config -l | grep alias | cut -c 7-
    cam=commit -am
    lo=log --oneline
    sc=switch -c
    rsm=rm -r --cached
    asm=submodule add
    reinit=!git submodule update --init --recursive && git submodule update --remote

Why using git aliases

It simply makes your life easier since you do not have to type long commands anymore. You can simply call them using a shorter name.

After you have typed the same command again and again 10+ a day, you will start to love git aliases.

Did I hear someone say that software developers are lazy? 😈

How to rename git branch

To make it short, the main take-aways:

git branch -m <new-name>
git branch -m <old-name> <new-name>
git push origin -u <new-name>
git push origin -d <old-name>

Use case: local changes or project not yet existing on remote

To rename the current git branch:

git branch -m <new-name>

To rename any other git branches you’re not currently pointing at:

git branch -m <old-name> <new-name>

Tips:

  • The -m option is short for --move. That way it’s easier to remember.
  • Other custom-made shortcuts can be defined e.g. typing git b for git branch. Check How to create Git Aliases.

Use case: git project already deployed

If your git project is already deployed on a remote environment – e.g. Gitlab or Github – the beforementioned changes won’t be reflected though. To change the branches not only on local but also on the remote, you need to:

First, push the local branch and reset the upstream branch:

git push origin -u <new-name>

Then, delete the remote branch:

git push origin --delete <old-name>

Note: in that case, it is not necessary to use the git branch -m commands.

Use case: changing the capitalization

If you are only changing the capitalization – e.g. dev/vers-5395 becoming dev/VERS-5395 – you need to use -M instead of -m as Git will tell you that the branch already exists otherwise. This matters for users being on Windows or any other case-insensitive file systems.

Note: if you are wondering what might be a good branch naming convention, you can check the git branch naming conventions article.

Example

Github then Gitlab recently introduced changes to their main branch naming convention. By default, drifting from master to main. If you are a person used to the traditional way, you can restore the branch name, turning things more into your likings:

> git clone git@gitlab.com:username/your-project-in-kebab-case.git
> git branch
* main
> git push origin -u master
> git push origin --delete main

You can then check your changes:

> git branch
* master

Run the extra mile

If like me you’re still not 100% sure when it comes to git stash, rebase, forking and traveling back the history line, you can either:

But remember: it’s a marathon, not a sprint 🐢

Interpunct Keyboard Shortcuts

The interpunct · can be typed using the following keystrokes:

Operating System Keystroke Combination Keyboard Country ISO Code
Microsoft Windows Alt + 250 or Alt + 0183 n/a
 Apple macOS Option + Shift + 9 generic
 Apple macOS Option + Shift + . NOR/SWE
 Apple macOS Option + . DNK
 Apple macOS Option + Shift + H CAN
 Apple macOS Option + Shift + F FRA

I particularly like to use it when it comes to separate items of the same nature, e.g. in emails:

Dear colleagues,
please find attached the recap of the presentation here.

Discussed topics: billing, costs, replication, active directory sync.
Issued action plan: VERS-0909 · VERS-0601 · VERS-1606

Kind regards,

Olivier Bénard
Data Software Engineer for Data Platform Services
O. Benard GmbH & Co.
github.com/olivierbenard

So, convinced? 😅

sftp basic CLI commands survival guide

SFTP cheat sheet

> sftp <username>@<hostname>
sftp> lcd <change-the-local-working-directory-by-this-one>
sftp> cd <change-the-remote-working-directory-by-this-one>
sftp> get <copy-this-remote-file-into-the-local-working-directory>
sftp> put <copy-this-local-file-into-the-remote-working-directory>
sftp> exit

Note: for the put and get commands, you can add the -r flag should you download/upload directories (r standing for recursive).

Real Business Case Example – Missing data need to be reloaded on SFTP server.

You have csv files uploaded on a SFTP server at a daily rate. Those csv files are sent by an external provider (contain results from a marketing campaign).

Once a file is uploaded into the SFTP server, it is picked up by a function listening to upload events (e.g. GCP Cloud or AWS Lambda function).

When the function is triggered (after an upload event), it looks for the uploaded file matching the following pattern: feedback_YYYY_MM_DD.csv.

Then the function parses the matching file and aggregates the content into a table on your data warehouse.

You got a complain from your stakeholders that they have missing data for the 2022-11-13:

SELECT PARTITIONTIME as pt, COUNT(*) as nb_rows
FROM `datapool.marketing_campaign.feedbacks_report`
WHERE DATE(PARTITIONTIME) >= "2022-11-11"
GROUP BY PARTITIONTIME
ORDER BY PARTITIONTIME ASC

which returns the following:

pt nb_rows
2022-11-11 247
2022-11-12 308

Connecting on the sftp server (or direclty looking into the bucket it reflects – or even better – looking into the logs), you notice that there is a tipo in the filename for that missing day. So even if the upload event was detected, the file has not been parsed and ingested by the function (failed to match the defined pattern).

You need to download this file on your local machine, change the name and upload it back on the SFTP server so the ingestion process is processed again by the function:

> sftp obenard@olivierbenard.fr
obenard@olivierbenard.fr's password:
sftp> cd inputs_folder/
sftp> ls -l
feedback_2022_11_11.csv  feedback_2022_11_12.csv  fedback_2022_11_13.csv
sftp> lcd ~/Downloads/
sftp> get fedback_2022_11_13.csv

-- from your local machine, go to ~/Downloads/ and edit "fedback" for "feedback".

sftp> put feedback_2022_11_13.csv
sftp> ls -l
feedback_2022_11_11.csv  feedback_2022_11_12.csv  fedback_2022_11_13.csv  feedback_2022_11_13.csv
sftp> exit

Note: put feedback_2022_11_13.csv and not put ~/Downloads/feedback_2022_11_13.csv because your are already looking into the ~Downloads/ folder due to the lcd command you have executed before.

pt nb_rows
2022-11-11 247
2022-11-12 308
2022-11-13 296

The data is now there! ✅

Notes: Editing the name cannot be done directly from the sftp environement, e.g. using mv <old-name> <new-name>, because the function need an upload event to be triggered.

SFTP Basic File Transfer Commands in detail

  • sftp command connects you to the sftp server:

    > sftp <username>@<hostname>
    
  • exit command closes the connection:

    sftp> exit
    
  • lcd changes your Local Current working Directory on the local machine (remote file will be downloaded there):

    lcd <local-directory>
    
  • cd changes your Current working Directory on the remote sftp machine:

    sftp> cd <your-directory>
    
  • put command copies the file from the local machine to the remote sftp machine:

    sftp> put <local-file>
    
  • get command copy the file from the remote machine to the local machine:

    sftp> get <remote-file>
    

What is an SFTP server for

An SFTP server allows users to use an SFTP client – like the FileZilla software – to connect to the bucket but without directly connecting to the bucket itself. The SFTP server being used as primary interface then as relay to pass on requests on the bucket it is associated with. That way, any users using the client can query the bucket and perform operations such as retrieving, listing or consulting the files stored into that bucket.

Advantages are: the SFTP client provides a more “user-friendly” interface and you do not allow users to directly connecting into the bucket, but rather as its additional interface.

When not to use an SFTP server

On Google Cloud Platform, opening a new SFTP server is not the way to go: if set-up correctly, you do not need SFTP. We can give their google service account permissions to upload in the bucket (requires a google account though). The user can then use the gcloud command line tool to upload/download directly from the bucket.

Should Data Engineers work closely with the business logic?

No, it should not be the case that Data Engineers have to worry about the business logic behind the data they ingest. The actual business logic should only happen after data publication in datapool and should rather be done by the stakeholders (data scientists, data analysts and analytics engineers) themselves. I explain.

Data Engineers are technical experts

The two main reasons of why Data Engineers should not be interested in deep-diving into the business logic are easy to understand:

  • The stakeholders are the only ones to truly know what they need from the data and how to codify their needs in a query;
  • An increasingly technically complex world requires the formation of expert groups. As processes become increasingly complex, it is no longer possible for the handyman to carry all the tools and to know all about their sort of fashion.

Instead, Data Engineers are more equipped to handle the technical logic:

  • Identify and ensure uniqueness of the data (key-based de-duplication)
  • Type-casting (making sure that the values of an integer column are not of type string)
  • Schema validation
  • Normalisation (e.g. unnesting repeated fields)

More: What is a Data Engineer.

Technical logic vs. Business Logic

Should you ask, below are examples of technical and business logic requests one might encounter in the shoes of the Data Engineer.

Technical Logic

  • Ingest new source data (e.g. external databases, sftp servers, feeds and snapshots) into the Data Warehouse
  • Turn the raw ingested data into structured data (e.g. from json to SQL tables)
  • Define identity and access roles or create the Cloud infrastructure (projects, datasets, views and tables)
  • Define backups and security policies.

Business logic

  • Generate graphs and reportings for upper management (role of data analysts/business intelligence)
  • Conduct Machine Learning projects (role of data scientists)
  • Build a data catalogue and map the available data with the business object it represents e.g. shop orders or app users tables (role of data modellers)

Note: should a company fails to have a clear delimitation between both logics, it is a clear marker – at least for me – that their processes are not mature enough.

One more thing

It is always a headache when stakeholders reach you with their special requests, asking you to help them joining multiple tables against each other. Or to figure out why the business object they end up with does not matches their business needs. E.g.:

I would like to have the website’s frequentation for June associated with the average customer expenses with a 30 minutes granularity. See how it evolves the further down we enter into the month.

date time visitors average_expenses
2022-06-01 00:00:00 UTC 13 75
2022-06-01 00:30:00 UTC 8 83
2022-06-01 01:00:00 UTC 17 90
2022-06-01 01:30:00 UTC 4 42

We simple don’t know. Our role is simply to bring the data there. Not to figure out the meaning of it, use it and come up with meaningful data-driven decisions.

Remember, entropy always wins 💥

Git commit message convention

To help you writing commit messages, there is a couple of guidelines, conventions and best practices one might follow. It can be summarized in 9 bullet points:

  • Start with the ticket’s number the commit solves.
  • Separate subject from body with a blank line
  • Limit the subject line to 50 characters
  • Capitalize the subject line
  • Add issue reference in the subject
  • Do not end the subject line with a period
  • Use the imperative mood in the subject line
  • Wrap the body at 72 characters
  • Use the body to explain what and why, not how

Example

VERS-1936: Add user access to datapool table

Access rights for the datapool resource have
been edited. Access is required to work on the
new outfit recommender system. Request has been
approved by the PO.

Please, note the following:

  • Access right is normally granted at the dataset level.
  • Due to pii data, access was granted at the table level.

Resolves: VERS-1936
See also: VERS-0107

Additional notes

When you commit, the Vim editor will be called by default. You will be then asked to prompt your commit message. The big strength of Vim is that it is installed by default on all systems.

If you don’t want to call Vim but rather use the command line, you can use the inline -m flag to write your commit message:

    git commit -m "Your message at the imperative voice"

If you still want to use a text editor to fill-in the commit message, you can change the settings so it opens Visual Studio Code IDE when the git commit command is called:

    git config --global core.editor "code --wait"

This requires you to have Visual Studio Code installed though.

You are now fully equipped to write the next bestseller! 📚

Acknowledgement

Why using snake case

The snake case is a style of writing in which each space is replaced by an underscode and letters writen in lowercase:

this_is_what_the_snake_case_style_looks_like

Since the snake_case format is mandatory for some objects, it is then easier to stick to it and generalised its usage throughout.

It is important that you use the snake case because your python code might simply do not work otherwise:

from helpers.math-module import incr

def test_incr() -> None:
    result = incr(42)
    print(result == 43)

if __name__ == "__main__":
    test_incr()
> python main.py
File "path/to/snake_case_project/main.py", line 1
from helpers.math-module import incr
                 ^
SyntaxError: invalid syntax

Instead, change for the following syntax:

snake_case_project/
    ├── helpers
        ├── __init__.py
        └── math_module.py
    └── main.py
from helpers.math_module import incr

def test_incr() -> None:
    result = incr(42)
    print(result == 43)

if __name__ == "__main__":
    test_incr()
> python main.py
True

Admit that for a language like Python, the snake_case is rather well adapted! 🐍

What is python __init__.py file for?

The Python __init__.py file serves two main functions:

  1. It is used to label a directory as a python package to make it visible so other python files can re-use the nested resources (e.g. the incr method defined inside helpers/file1.py):

    from helpers.file1 import incr
    
    result = incr(42)
    assert result == 43
    

    A side effect is that – with some not-recommended workarounds – developers do not have to care about the method’s location in your package hierarchy:

        helpers/
        ├── __init__.py
        ├── file1.py
        ├── file2.py
        ├── ...
        └── fileN.py
    

    For that, simply fill the __init__.py file with the following content:

    from file1 import *
    from file2 import *
    ...
    from fileN import *
    

    Therefore, even though it is always a good practice to explicitely mention the source, they can simply use:

    from helpers import incr
    
    result = incr(42)
    assert result == 43
    
  2. It is used to define variables or to initialise objects like logging at the package level and import time (to make them accesible at a global package level):

    from helpers.file3 import MY_VAR
    
    print(MY_VAR)
    

Still blur? Thereafter an easy example to understand:

First, let’s plot some context

You have the following project structure:

playground_packages
├── helpers/
    └── utils.py
└── main.py

The utils.py file contains:

def incr(n:list[float]) -> list[float]:
    return [x+1 for x in n]

if __name__ == "__main__":
    pass

Note: you could have also used the map and lambda methods instead. However, here is a nice example to show about list comprehension. The alternave version would have looked like:

list(map(lambda x: x+1, n))

The main.py file is looking like the following:

from helpers.utils import incr

def main() -> None:
    result = incr([1,2,3,4,5])
    print(result)

if __name__ == "__main__":
    main()

Notes:

  • Why we haven’t used import helpers.utils or import * is explained here (to do).
  • The if __name__ == "__main__" conditional statement is explained here (to do).

__init__.py to label a folder as Python package

Jumping back to our example, if you try to run the code with the current configuration, you will get the following error:

> python main.py
Traceback (most recent call last):
File "path/to/playground_package/main.py", line 1, in <module>
    from helpers.utils import incr
ModuleNotFoundError: No module named 'helpers'

This is because the helpers directory is not yet visible for Python. Python is actively looking for Python packages but cannot find any. A package is a folder that contains a __init__.py file.

Simply edit our current structure for the following:

playground_packages
├── helpers/
    ├── __init__.py
    └── utils.py
└── main.py

Now, it you try again, it will succeed:

> python main.py
[2, 3, 4, 5, 6]

The main take-away is:

If you want to split-up your code in different folders and files (to make your code more readable and debuggable), you must create a __init__.py file under each folder so they become visible for Python and can therefore be used and refered to in your code using import.

__init__.py to define global variables

In our previous example, the __init__.py file is empty. We can edit it, adding the following line:

MY_LIST = [2,4,6,8,10]

This variable is accessible even by the main function:

from helpers import MY_LIST
from helpers.utils import incr

def main() -> None:
    result = incr(MY_LIST)
    print(result)

if __name__ == "__main__":
    main()
> python main.py
[3, 5, 7, 9, 11]

Note: it is better to define variables in a config.py or constants.py file rather than in a __init__.py file. However, __init__.py becomes handy when it comes to instanciate objects such as logging or dynaconf. More on that will follow in another article.

You are now ready to fit your code together like Russian dolls 🪆

How to manage multiple terraform versions with tfenv

You can use tfenv use to quickly switch between the different terraform versions you have installed on your system:

> tfenv use 1.1.8
> terraform --version
Terraform v1.1.8

> tfenv use 0.13.5
> terraform --version
Terraform v0.13.5

If the version is not installed already, you can use tfenv install to install it e.g. to install terraform v1.1.8:

tfenv install 1.1.8

Finally, you can check all the existing versions you have installed via tfenv list, e.g. in my case:

>  tfenv list
* 1.1.8
0.13.5

Note: the wildcard character is preffixing the version currently used by default.

tfenv commands

The most used and useful commands are:

  • tfenv list
  • tfenv use <version>
  • tfenv install <version>

More can be displayed on the manual:

> tfenv
tfenv 3.0.0
Usage: tfenv <command> [<options>]

Commands:
   install       Install a specific version of Terraform
   use           Switch a version to use
   uninstall     Uninstall a specific version of Terraform
   list          List all installed versions
   list-remote   List all installable versions
   version-name  Print current version
   init          Update environment to use tfenv correctly.
   pin           Write the current active version to ./.terraform-version

How to install tfenv

Using brew (MacOS)

brew install tfenv

Via the Github repository

  1. Git clone the tfenv repository under a new tfenv folder:

    git clone https://github.com/tfutils/tfenv.git ~/.tfenv
    
  2. Export the path to your profile:

    For bash users:

    echo 'export PATH="$HOME/.tfenv/bin:$PATH"' >> ~/.bash_profile
    

    For MacOS:

    echo 'export PATH="$HOME/.tfenv/bin:$PATH"' >> ~/.zshrc
    
  3. Turn tfenv/ into an executable binary. Thus, you can symlink it to your local bin directory:

    ln -s ~/.tfenv/bin/* /usr/local/bin
    
  4. Check that the tfenv/ binary folder is indeed synchronized with the local/bin directory:

    > which tfenv
    /usr/local/bin/tfenv
    
  5. Check the installation:

    > tfenv --version
    tfenv 3.0.0
    

Which terraform versions are available

You can check the exisitng available terraform versions on the official hashicorp releases page: releases.hashicorp.com/terraform.

Note: similarly you can use the command line interface:

> tfenv list-remote
1.3.4
1.3.3
1.3.2
1.3.1
1.3.0
1.3.0-rc1
1.3.0-beta1
1.3.0-alpha20220817

When to use tfenv and why it is useful

tfenv allows you to quickly change the version of terraform running by default on your system.

This is handy when you have multiple terraform repositories across your organization and each one of them uses a different terraform version.

To be more specific, each terraform repository requires you to set the terraform version explicitely, as you can see line 2 of the following example:

terraform {
    required_version = "1.2.2"
    required_providers {
        local = {
            source = "hashicorp/local"
            version = "~> 2.0"
        }
    }
}

When using the usual methods:

  • terraform init
  • terraform plan
  • teraform apply

it will require you to have a local terraform version matching the one specified in the terraform configuration file.

Therefore, the 1:1 mapping between your locally installed versions and the versions specified in your configuration files is required.

Warning: running one of those methods with a higher local terraform version will introduces changes on your repository that cannot be reversed. This will not only forces you to migrate the configuration files so they fit the syntax of the new terraform version, but all developers will have to install the new terraform version on their local environment too.

Swapping has never been easier! 🔥

What is a Data Engineer

What is a Data Engineer in a nutshell

A Data Engineer is like a gas or oil pipeline operator. He must:

  • oversee the full Cloud Data Warehouse;
  • make data available 24/7 on the platform;
  • move data from A to B;
  • ingest, update, retire or transform data coming from upstream data sources;
  • turn data into by-products and monitor the overall flow.

On top of ingesting, transforming, serving and storing, those tasks mandates strong proficiency in Security, Data Management, Data|FinOps, Data Architecture, Orchestration & Software Engineering.

The main objective is to serve data to the analysts/BI/data science teams + back to the software teams (reverse ETL).

It enables the company to make data-driven decisions (an example of data driven decisions is given at the end).

One big aspect though: a Data Engineer brings the oil to the different teams but is not responsible for consuming it. He is not responsible for turning data into meaningful charts or actionable decisions. We mostly do not bother much about the business-logic behind. Our role is to save data into a place (being the unique source of truth) where it can be consumed by the teams who need it.

Note: A lot of organizations or applicants tend to have a very poor understanding of what is really meant behind the term Data Engineering. They rather see the position as a patchwork of different roles and responsibilities. The profession is indeed quite new – as you can see on Google Trend the interest only exploded in late 2019. It will still needs some time before people (me included) have a full grasp over it.

Instead, Data Engineer is a technical job that requires you to be proficient in writing code (mainly Python and Java). Therefore, you need to have strong Software Engineering skills. Developers (more than Data Scientists or Data Analysts) are in turn highly valued. That is why I prefer to call it Data Software Engineer to remove any ambiguity.

The different missions

  • Build, Orchestrate and Monitor ELT pipelines (using Airflow & Google Cloud Composer).
  • Manage data infrastructures and services in the Cloud (e.g. tables, views, datasets, projects, storage, access rights, network).
  • Ingest sources from external databases (using Python, Docker, Kubernetes & Change Data Capture tools)
  • Develop REST API clients (facebook, snapchat, jira, rocketchat, gitlab, external providers)
  • Publish open-source projects e.g. on github.com/e-breuninger such as Python libraries, Bots or Google Chrome Extensions (ok, this one is rather specific to my job)
  • Lead workshops and interviews (e.g. BigQuery Introduction, Code Standardization & Best Practices)

Keep in mind: Data Engineers are Data Paddlers, not Data Keepers.

If the source data is corrupted, correcting or improving data quality is out of scope. You can see us as an incorruptible blind-folded carrier, moving the baggage assigned to us from A to B, without looking into it throughout the journey. You don’t want us to start opening the bags and fold your different shirts as it should be.

See Should Data Engineers work closely with the business logic?

Tools used by a Data Engineer

At least, those I am currently working with on a daily base:

  • Airflow to manage your ETL pipelines.
  • Google Cloud Platform as Cloud Provider.
  • Terraform as Infrastructure as Code solution.
  • Python, Bash, Docker, Kubernetes to build feeds and snapshots readers.
  • SQL as Data Manipulation Language (DML) to query data.
  • Git as versioning tool.

The main challenges in the job

Based on my own experience, my biggest challenges at the moment are:

  • Improve existing pipelines reliability (pushing for more tests, ISO-standardization, data validation & monitoring)
  • Get rid of the technical debt (moving toward 100% automation, documentation and infrastructure as code coverage)
  • Keep up with the upcoming technologies (learning new skills, going more in-deep vertically and horizontally)
  • Enforce the software development best practices and standardization principles (via multitude hours of workshops and conferences)
  • Strengthen the international part, actively connecting the teams (via department tours and promoting the use of english as primarily source of communication)

I believe them to be representative in this industry.

Get started as Data Engineer

This will need an article on its own. However, you can get started with the immediate following take-aways:

Technical books

  • Fundamentals of Data Engineering, Reis & Housley, O’REILLY, 2022
  • Google BigQuery: The Definitive Guide, Lakshmanan & Tigani, O’REILLY, 2019
  • Terraform: Up & Running, 3rd Edition, Yevgeniy Brikman, O’REILLY, 2022
  • Learning SQL, 3rd Edition, Alan Beaulieu, O’REILLY, 2020
  • Docker: Up & Running, 3rd Edition, Kane & Matthias, O’REILLY, 2023
  • The Kubernetes Book, Nigel Poulton, Edition 2022

Online courses

  • The Git & Github Bootcamp, Colt Steele on Udemy
  • Apache Airflow: The Hands-On Guide, Marc Lamberti on Udemy
  • Terraform Tutorials, HashiCorp Learn

An example of Data Driven Decisions

Imagine you are the CEO of a bicycle rental company. You have multiple stations across Manhattan. You have the following 3 want-to-know questions:

  • You want to know which stations perform the best and are in high demand so you can anticipate any disruptions ahead, having more technicians standing-by in the area, increase the fleet and anticipate future expansion.
  • On the other hand, you want to retire poorly performing stations, adjusting your implantation to make it fit the market needs more accurately.
  • You want to monitoring the overall usage so you know what are the off-peaks and rush hours, average journey length or most appreciated commute options. You can then adapt the offer accordingly, e.g. offering discounts at specific times of the day/week/month to boost customer acquisition or match better your customers’ needs.

It is part of the Data Engineering journey to consume data coming from the different sources (e.g. bikes stations, bicycles, Open-Weather API, Google Map API etc.) so the marketing and business intelligence teams can solely focus on answering your questions without getting their hands dirty, deep-diving into the data ingestion part.

To conclude, as it is often the case in history, recent jobs have many similarities with sectors that have existed long before them. For instance, the data engineering field and the energy sector share closed similarities (you have to move an expandable from A to B and distribute it to consumers). They simply inherit the lingo. Ideas remain the same but are now applied to different “objects”. Data has replaced oil but the paradigm keeps working.

However, good luck getting your car to run with it! 🏎️💨