data-engineering Archives - Olivier Bénard

No, it should not be the case that Data Engineers have to worry about the business logic behind the data they ingest. The actual business logic should only happen after data publication in datapool and should rather be done by the stakeholders (data scientists, data analysts and analytics engineers) themselves. I explain.

Data Engineers are technical experts

The two main reasons of why Data Engineers should not be interested in deep-diving into the business logic are easy to understand:

The stakeholders are the only ones to truly know what they need from the data and how to codify their needs in a query;
An increasingly technically complex world requires the formation of expert groups. As processes become increasingly complex, it is no longer possible for the handyman to carry all the tools and to know all about their sort of fashion.

Instead, Data Engineers are more equipped to handle the technical logic:

Identify and ensure uniqueness of the data (key-based de-duplication)
Type-casting (making sure that the values of an integer column are not of type string)
Schema validation
Normalisation (e.g. unnesting repeated fields)

More: What is a Data Engineer.

Technical logic vs. Business Logic

Should you ask, below are examples of technical and business logic requests one might encounter in the shoes of the Data Engineer.

Technical Logic

Ingest new source data (e.g. external databases, sftp servers, feeds and snapshots) into the Data Warehouse
Turn the raw ingested data into structured data (e.g. from json to SQL tables)
Define identity and access roles or create the Cloud infrastructure (projects, datasets, views and tables)
Define backups and security policies.

Business logic

Generate graphs and reportings for upper management (role of data analysts/business intelligence)
Conduct Machine Learning projects (role of data scientists)
Build a data catalogue and map the available data with the business object it represents e.g. shop orders or app users tables (role of data modellers)

Note: should a company fails to have a clear delimitation between both logics, it is a clear marker – at least for me – that their processes are not mature enough.

One more thing

It is always a headache when stakeholders reach you with their special requests, asking you to help them joining multiple tables against each other. Or to figure out why the business object they end up with does not matches their business needs. E.g.:

I would like to have the website’s frequentation for June associated with the average customer expenses with a 30 minutes granularity. See how it evolves the further down we enter into the month.

date	time	visitors	average_expenses
2022-06-01	00:00:00 UTC	13	75
2022-06-01	00:30:00 UTC	8	83
2022-06-01	01:00:00 UTC	17	90
2022-06-01	01:30:00 UTC	4	42

We simple don’t know. Our role is simply to bring the data there. Not to figure out the meaning of it, use it and come up with meaningful data-driven decisions.

Remember, entropy always wins 💥

What is a Data Engineer in a nutshell

A Data Engineer is like a gas or oil pipeline operator. He must:

oversee the full Cloud Data Warehouse;
make data available 24/7 on the platform;
move data from A to B;
ingest, update, retire or transform data coming from upstream data sources;
turn data into by-products and monitor the overall flow.

On top of ingesting, transforming, serving and storing, those tasks mandates strong proficiency in Security, Data Management, Data|FinOps, Data Architecture, Orchestration & Software Engineering.

The main objective is to serve data to the analysts/BI/data science teams + back to the software teams (reverse ETL).

It enables the company to make data-driven decisions (an example of data driven decisions is given at the end).

One big aspect though: a Data Engineer brings the oil to the different teams but is not responsible for consuming it. He is not responsible for turning data into meaningful charts or actionable decisions. We mostly do not bother much about the business-logic behind. Our role is to save data into a place (being the unique source of truth) where it can be consumed by the teams who need it.

Note: A lot of organizations or applicants tend to have a very poor understanding of what is really meant behind the term Data Engineering. They rather see the position as a patchwork of different roles and responsibilities. The profession is indeed quite new – as you can see on Google Trend the interest only exploded in late 2019. It will still needs some time before people (me included) have a full grasp over it.

Instead, Data Engineer is a technical job that requires you to be proficient in writing code (mainly Python and Java). Therefore, you need to have strong Software Engineering skills. Developers (more than Data Scientists or Data Analysts) are in turn highly valued. That is why I prefer to call it Data Software Engineer to remove any ambiguity.

The different missions

Build, Orchestrate and Monitor ELT pipelines (using Airflow & Google Cloud Composer).
Manage data infrastructures and services in the Cloud (e.g. tables, views, datasets, projects, storage, access rights, network).
Ingest sources from external databases (using Python, Docker, Kubernetes & Change Data Capture tools)
Develop REST API clients (facebook, snapchat, jira, rocketchat, gitlab, external providers)
Publish open-source projects e.g. on github.com/e-breuninger such as Python libraries, Bots or Google Chrome Extensions (ok, this one is rather specific to my job)
Lead workshops and interviews (e.g. BigQuery Introduction, Code Standardization & Best Practices)

Keep in mind: Data Engineers are Data Paddlers, not Data Keepers.

If the source data is corrupted, correcting or improving data quality is out of scope. You can see us as an incorruptible blind-folded carrier, moving the baggage assigned to us from A to B, without looking into it throughout the journey. You don’t want us to start opening the bags and fold your different shirts as it should be.

See Should Data Engineers work closely with the business logic?

Tools used by a Data Engineer

At least, those I am currently working with on a daily base:

Airflow to manage your ETL pipelines.
Google Cloud Platform as Cloud Provider.
Terraform as Infrastructure as Code solution.
Python, Bash, Docker, Kubernetes to build feeds and snapshots readers.
SQL as Data Manipulation Language (DML) to query data.
Git as versioning tool.

The main challenges in the job

Based on my own experience, my biggest challenges at the moment are:

Improve existing pipelines reliability (pushing for more tests, ISO-standardization, data validation & monitoring)
Get rid of the technical debt (moving toward 100% automation, documentation and infrastructure as code coverage)
Keep up with the upcoming technologies (learning new skills, going more in-deep vertically and horizontally)
Enforce the software development best practices and standardization principles (via multitude hours of workshops and conferences)
Strengthen the international part, actively connecting the teams (via department tours and promoting the use of english as primarily source of communication)

I believe them to be representative in this industry.

Get started as Data Engineer

This will need an article on its own. However, you can get started with the immediate following take-aways:

Technical books

Fundamentals of Data Engineering, Reis & Housley, O’REILLY, 2022
Google BigQuery: The Definitive Guide, Lakshmanan & Tigani, O’REILLY, 2019
Terraform: Up & Running, 3rd Edition, Yevgeniy Brikman, O’REILLY, 2022
Learning SQL, 3rd Edition, Alan Beaulieu, O’REILLY, 2020
Docker: Up & Running, 3rd Edition, Kane & Matthias, O’REILLY, 2023
The Kubernetes Book, Nigel Poulton, Edition 2022

Online courses

The Git & Github Bootcamp, Colt Steele on Udemy
Apache Airflow: The Hands-On Guide, Marc Lamberti on Udemy
Terraform Tutorials, HashiCorp Learn

An example of Data Driven Decisions

Imagine you are the CEO of a bicycle rental company. You have multiple stations across Manhattan. You have the following 3 want-to-know questions:

You want to know which stations perform the best and are in high demand so you can anticipate any disruptions ahead, having more technicians standing-by in the area, increase the fleet and anticipate future expansion.
On the other hand, you want to retire poorly performing stations, adjusting your implantation to make it fit the market needs more accurately.
You want to monitoring the overall usage so you know what are the off-peaks and rush hours, average journey length or most appreciated commute options. You can then adapt the offer accordingly, e.g. offering discounts at specific times of the day/week/month to boost customer acquisition or match better your customers’ needs.

It is part of the Data Engineering journey to consume data coming from the different sources (e.g. bikes stations, bicycles, Open-Weather API, Google Map API etc.) so the marketing and business intelligence teams can solely focus on answering your questions without getting their hands dirty, deep-diving into the data ingestion part.

To conclude, as it is often the case in history, recent jobs have many similarities with sectors that have existed long before them. For instance, the data engineering field and the energy sector share closed similarities (you have to move an expandable from A to B and distribute it to consumers). They simply inherit the lingo. Ideas remain the same but are now applied to different “objects”. Data has replaced oil but the paradigm keeps working.

However, good luck getting your car to run with it! 🏎️💨