No, it should not be the case that Data Engineers have to worry about the business logic behind the data they ingest. The actual business logic should only happen after data publication in datapool and should rather be done by the stakeholders (data scientists, data analysts and analytics engineers) themselves. I explain.
Data Engineers are technical experts
The two main reasons of why Data Engineers should not be interested in deep-diving into the business logic are easy to understand:
- The stakeholders are the only ones to truly know what they need from the data and how to codify their needs in a query;
- An increasingly technically complex world requires the formation of expert groups. As processes become increasingly complex, it is no longer possible for the handyman to carry all the tools and to know all about their sort of fashion.
Instead, Data Engineers are more equipped to handle the technical logic:
- Identify and ensure uniqueness of the data (key-based de-duplication)
- Type-casting (making sure that the values of an integer column are not of type string)
- Schema validation
- Normalisation (e.g. unnesting repeated fields)
More: What is a Data Engineer.
Technical logic vs. Business Logic
Should you ask, below are examples of technical and business logic requests one might encounter in the shoes of the Data Engineer.
Technical Logic
- Ingest new source data (e.g. external databases, sftp servers, feeds and snapshots) into the Data Warehouse
- Turn the raw ingested data into structured data (e.g. from json to SQL tables)
- Define identity and access roles or create the Cloud infrastructure (projects, datasets, views and tables)
- Define backups and security policies.
Business logic
- Generate graphs and reportings for upper management (role of data analysts/business intelligence)
- Conduct Machine Learning projects (role of data scientists)
- Build a data catalogue and map the available data with the business object it represents e.g. shop orders or app users tables (role of data modellers)
Note: should a company fails to have a clear delimitation between both logics, it is a clear marker – at least for me – that their processes are not mature enough.
One more thing
It is always a headache when stakeholders reach you with their special requests, asking you to help them joining multiple tables against each other. Or to figure out why the business object they end up with does not matches their business needs. E.g.:
I would like to have the website’s frequentation for June associated with the average customer expenses with a 30 minutes granularity. See how it evolves the further down we enter into the month.
date | time | visitors | average_expenses |
---|---|---|---|
2022-06-01 | 00:00:00 UTC | 13 | 75 |
2022-06-01 | 00:30:00 UTC | 8 | 83 |
2022-06-01 | 01:00:00 UTC | 17 | 90 |
2022-06-01 | 01:30:00 UTC | 4 | 42 |
We simple don’t know. Our role is simply to bring the data there. Not to figure out the meaning of it, use it and come up with meaningful data-driven decisions.
Remember, entropy always wins 💥