Olivier Bénard

Difference between datalake, datapool and datamart

Once the data is loaded on the data warehouse, it can be stored into different environments:

datalake: protected staging environment with limited access to Data Engineers only. It receives raw data from ingesting pipelines. Typically, tables there only contains few columns e.g. content (containing stringify json) and the ingested_at ISO 8601 timestamp.
datapool-pii: structured/prepared environment where data from datalake is parsed and technical transformations are applied on the data. Typically, the previous content field is parsed (e.g. using JSON_EXTRACT() DML alike-functions on BigQuery) and split into different fields. Accredited stakeholders can directly pick data from there. If sensitive information were on the data, they are also to be found on this environment.
datapool: same as for the datapool-pii environment but the sensitive information are removed.
datamarts: environments aggregating environments with integrated business objects and aggregations for metrics, intelligence, science and analytics. For stakeholders. You can see them as data business products. Generally, those tables are owned by an expert group, e.g. the Data Modellers and Business Intelligence team.

Note: PII stands for Personally Identifiable Information.

End-to-end Data Journey Example

Let’s examine an end-to-end business case to see how the different environments intervene into the general workflow.

Let’s say that you have the following json data containing some statistics per active users (e.g. on an app or website):

{
    "active_users": [
        {
            "id": "12e57e",
            "email": "john.doe@gmail.com",
            "results": [
                {
                    "stats": {"views": 12587, "spend": 8000}
                }
            ]
        },
        {
            "id": "r87e5z",
            "email": "jane.doe@gmail.com",
            "results": [
                {
                    "stats": {"views": 97553, "spend": 10000}
                }
            ]
        },
        {
            "id": "8tr75z",
            "email": "johnny.doe@gmail.com",
            "results": [
                {
                    "stats": {"views": 41239, "spend": 5000}
                }
            ]
        }
    ]
}

You ingested the above json data into the following raw table on datalake:

content	ingest_at
{“active_users”:[{“id”:”12e57e”,”email”:”john.doe@gmail.com”,”results”:[{“stats”:{“views”:12587,”spend”:8000}}]},{“id”:”r87e5z”,”email”:”jane.doe@gmail.com”,”results”:[{“stats”:{“views”:97553,”spend”:10000}}]},{“id”:”8tr75z”,”email”:”johnny.doe@gmail.com”,”results”:[{“stats”:{“views”:41239,”spend”:5000}}]}]}	2022-12-14T04:00:00.000 UTC

content

ingest_at

{“active_users”:[{“id”:”12e57e”,”email”:”john.doe@gmail.com”,”results”:[{“stats”:{“views”:12587,”spend”:8000}}]},{“id”:”r87e5z”,”email”:”jane.doe@gmail.com”,”results”:[{“stats”:{“views”:97553,”spend”:10000}}]},{“id”:”8tr75z”,”email”:”johnny.doe@gmail.com”,”results”:[{“stats”:{“views”:41239,”spend”:5000}}]}]}

2022-12-14T04:00:00.000 UTC

Note: content being a STRING and ingested_at a TIMESTAMP.

The table is then parsed and stored into the datapool-pii environment:

id	email	views	spend
12e57e	john.doe@gmail.com	12587	8000
r87e5z	jane.doe@gmail.com	97553	10000
8tr75z	johnny.doe@gmail.com	41239	5000

This table contains sensitive information (e.g. pii data). Thus, before granting access to stakeholders, one must hide or hash the sensitive fields:

id	email	views	spend
12e57e	375320dd9ae7ed408002f3768e16cb5f28c861062fd50dff9a3bff62e9dce4ef	12587	8000
r87e5z	831f6494ad6be4fcb3a724c3d5fef22d3ceffa3c62ef3a7984e45a0ea177f982	97553	10000
8tr75z	980b542e198802ebe4d57690a1ad3e587b5751cb537209112c0564b52bd0699f	41239	5000

The above table is stored on datapool.

Then, on datamart you will find this table aggregated with other metrics and measurements, ready to be used and consumed by the vertical teams.

To go even further

To parse the raw json content field, you can use an Airflow job that will regularly execute the following SQL query:

{% set datalake_project = var.value.datalake_project %}

WITH
  raw_lake_table AS (
  SELECT
    JSON_EXTRACT_ARRAY(content, '$.active_users') AS active_users,
    ingested_at as ingested_at
  FROM
    `{{datalake_project}}.dataset.table`
  WHERE DATE(ingested_at) = "{{ds}}"
  )
SELECT
  JSON_EXTRACT_SCALAR(unnested_active_users, '$.id') as id,
  JSON_EXTRACT_SCALAR(unnested_active_users, '$.email') as email,
  JSON_EXTRACT(results, '$.stats.views') as views,
  JSON_EXTRACT(results, '$.stats.spend') as spend,
  ingested_at
FROM
  raw_lake_table,
  UNNEST(active_users) as unnested_active_users
  LEFT JOIN UNNEST(JSON_EXTRACT_ARRAY(unnested_active_users, '$.results')) as results

Note: You can use Jinja templated variables to parametrize your query. In our case, {{datalake_project}} will be replaced at execution time by datalake-prod or datalake-dev. In the same manner, {{ds}} will be replaced by the current YYYY-MM-DD date. See Airflow Templates Reference for more information.

To hash using sha256 on python:

python3> import hashlib
python3> hashlib.sha256("your-string".encode('utf-8')).hexdigest()
'42c23acbf1b1c471c7b53efe58f34dea361d941f47584265df5dbaec1bfddd49'

On BigQuery, you can directly use the SHA256 method to encrypt your fields in your data-pipelines:

SELECT SHA256("your-string") as sha256;
+----------------------------------------------+
| sha256                                       |
+----------------------------------------------+
| QsI6y/GxxHHHtT7+WPNN6jYdlB9HWEJl31267Bv93Uk= |
+----------------------------------------------+

Caution: The return type is BYTES in base64 format. If you want to compare it with the returned above Python value, then used TO_HEX.

SELECT TO_HEX(SHA256("your-string")) as sha256;
+------------------------------------------------------------------+
| sha256                                                           |
+------------------------------------------------------------------+
| 42c23acbf1b1c471c7b53efe58f34dea361d941f47584265df5dbaec1bfddd49 |
+------------------------------------------------------------------+

Note: Even better, use SHA512.

Why stop using SFTP

You should stop using SFTP as it is an outdated technology that also comes with its share of disadvantages. The main problems with SFTP are the following:

You have to keep the servers secure (network security, access control, …). It adds extra loads on top of your list of duties.
SFTP servers only do a SFTP-to-bucket dump which is unnecessary if we would use buckets directly.
As for Google Cloud Platform, SFTP servers mount GCP buckets as files systems to a Docker container. This is something that can be done but depending on the load sent to the SFTP can become a problem (volume limitations).

Note: SFTP can be used as an interface to pass on requests between the user and the bucket it mirrors. It uses user-friendly command lines, see SFTP basic CLI commands survival kit.

Alternative to SFTP

Beside providing a friendly interface to our customers (e.g. if they use FileZilla), why using SFTP is a no go? Preaching against it sounds like we are fighting against hell. Why Data Software Engineers so adamant about it? At the end, do we now why it is an issue? It is not the best practice but is it that hurtful?

To give you the short answer: there is nothing wrong with SFTP in general, it is just an outdated technology.

On the contrary, there are multiple benefits using buckets directly:

Bucket security is done by GCP (of course, we have to do the access control – which is not so hard).
The interface is also easy since you can see the bucket contents in the GCP console (no need for external tools like FileZilla).
Buckets can handle a high amount of parallel reads.

It is not always possible though

As stated in the Fundamentals of Data Engineering book published by O’REILLY:

Data engineers rightfully cringe at the mention of SFTP. Regardless, SFTP is still a practical reality for many businesses. They work with partner businesses that consume or provide data using SFTP and are unwilling to rely on other standards.

To conclude, try to get ride of SFTP legacies but don’t get your knickers in a twist regarding this. 👟

How to create custom Airflow Operators

Disclaimer: This article is intended for users with already some hands-on experience with Airflow. If this is not the case, I am working on a Airflow Essentials Survival Kit guide to be released. The link will be posted here as soon as it is the case.

To create custom Airflow Operators, all you need is to import the Airflow BaseOperator and surcharge it with the parameters and logic you need. You just have to fill-in the template below:

from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults

class MyCustomOperator(BaseOperator):

    # add the templated param(s) your custom operator takes:
    template_fields = ("your_custom_templated_field", )

    @apply_defaults
    def __init__(
        self,
        your_custom_field,
        your_custom_templated_field,
        *args,
        **kwargs
    ):
    super().__init__(*args, **kwargs)

    # assign the normal and templated params:
    self.your_custom_field = your_custom_field
    self.your_custom_templated_field = your_custom_templated_field

    def execute(self, context):
        # the logic to perform
        # ...

Note: The execute() method and context argument are mandatory.

Then, assuming you store this module in another file, you can call it inside your DAG file:

from path.to.file import MyCustomOperator

my_custom_job = MyCustomOperator(
    task_id="my_custom_operator",
    your_custom_templated_field=f'E.g. {"{{ds}}"}',
    your_custom_field=42,
)

Notes:

Because you have subscribed to the template_fields option, your custom_templated_field accepts Jinja Templated Variables like {{ds}}. You do not necessarily need to subscribe to this option though.
You can have more than one custom field.
Not all job’s parameters accept Jinja Templated values. Look up in the documentation which are the accepted templated ones. E.g. for BigQueryInsertJobOperator.

Example: Ingesting Data From API

Context: you have data sitting on an API. You want to fetch data from this API and ingest it into Google Cloud BigQuery. In-between, you are storing the fetched raw data as temporary json files inside a Google Cloud Storage Bucket. Next step is then to flush the files’ content within a BigQuery table.

In order to do so, you need to create a custom Airflow Operator that can use your API client to fetch the data; retrieving the credentials from an Airflow Connection – and stores the retrieved data into a temporary json file on Google Cloud Storage.

Your custom made Airflow Operator, stored in an api_ingest/core.py alike file will look like:

from typing import Any
from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.hooks.base_hook import BaseHook
from custom_api_client.client import (
    CustomAPIClient,
)

class APIToGCSOperator(
    BaseOperator
): # pylint: disable=too-few-public-methods
    """
    Custom-made Airflow Operator fetching data from the API.
    """

    template_fields = (
        "destination_file",
        "ingested_at"
    )

    @apply_default
    def __init__(
        self,
        destination_file: str,
        ingested_at: str,
        entity: str,
        *args: Any,
        **kwargs: Any
    ) -> None:
        super().__init__(*args, **kwargs)

        self.destination_file = destination_file
        self.ingested_at = ingested_at
        self.entity = entity

    def execute(
        self,
        context: Any
    ) -> None # pylint: disable=unused-argument
        """
        Fetches data from the API and writes into GCS bucket.

        1. Reads the credentials stored into the Airflow Connection.
        2. Instantiates the API client.
        3. Fetches the data with the given parameters.
        4. Flushes the result into a temporary GCS bucket.
        """

        api_connection = BaseHook.get_connections("your_connection_name")
        credentials = api_connection.extra_dejson
        client_id = credentials["client_id"]
        client_secret = credentials["client_secret"]
        refresh_token = credentials["non_expiring_refresh_token"]

        custom_api_client = CustomAPIClient(
            client_id = client_id,
            client_secret = client_secret,
            refresh_token = refresh_token,
        )

        with open(
            self.destination_file,
            "w",
            encoding="utf-8"
        ) as output:

            fetched_data_json = custom_api_client.fetch_entity(
                entity=self.entity
            )

            entity_json = dict(
                content = json.dumps(
                    fetched_data_json,
                    default=str,
                    ensure_ascii=False
                    ),
                ingested_at = self.ingested_at,
            )

            json.dump(
                entity_json,
                output,
                default=str,
                ensure_ascii=False
            )
            output.write("\n")

Note: You can create your own custom-made API clients. To make sure yours is available inside your Airflow DAG, make sure to upload the package into the plugins Airflow folder beforehand.

In the main DAG’s module, inside the DAG’s with context manager, it will look like:

from airflow.operators.bash import BashOperator
from api_ingest.core import APIToGCSOperator

staging_dir = "{{var.value.local_storage_dir}}" + "/tempfiles/api_ingest"

create_staging_dir = BashOperator(
    task_id=f"create-staging-dir",
    bash_command=f"mkdir -p {staging_dir}"
)

cleanup_staging_dir = BashOperator(
    task_id=f"cleanup-staging-dir",
    bash_command=f"rm -rf {staging_dir}"
)

api_to_gcs = APIToGCSOperator(
    task_id = "api-to-gcs",
    destination_file = f'{staging_dir}/data_{"{{ds_nodash}}"}.mjson',
    ingested_at = "{{ts}}",
    entity = "campaigns",
)

Note: It is more efficient to use {{var.value.your_variable}} instead of Variable.get("your_variable"). Downside are: the real value is only gonna be replaced at execution time and only for the fields accepting Jinja Templated variables.

And here we go, you should now be able to create your own custom Airflow Operators! Have fun crafting AF; tuning it into your likings. 💫

Python Walrus Operator

The walrus operator := is a nice way to avoid repetitions of function calls and statements. It simplifies your code. You can compare the two following code snippets:

grades = [1, 12, 14, 17, 5, 8, 20]

stats = {
    'nb_grades': len(grades),
    'sum': sum(grades),
    'mean': sum(grades)/len(grades)
}

grades = [1, 12, 14, 17, 5, 8, 20]

stats = {
    'nb_grades': (nb_grades := len(grades)),
    'sum': (sum_grades := sum(grades)),
    'mean': sum_grades/nb_grades
}

Note: The parentheses are mandatory for the plain assignment to work.

Same goes for function calls:

foo = "hello world!"
if (n := len(foo)) > 4:
    print(f"foo is of length {n}")

foo is of length 12

In the above snippet, the len() method has only been called once instead of twice. More generally, you can assign values to variables on the fly without having to call the same methods more than once.

Important: The python walrus operator := (officially known as assignment expression operator) has been introduced by Python 3.8. This mean, once implemented, your code won’t be backward compatible anymore.

Note: The “walrus operator” affective appellation is due to its resemblance to the eyes and tusks of a walrus.

More examples

python> [name for _name in ["OLIVIER", "OLIVIA", "OLIVER"] if "vi" in (name := _name.lower())]
['olivier', 'olivia']

In this example, we are iterating through the list, storing each item of the list into the temporary _name variable. Then, we apply the lower() string method on the _name object, turning the upper case string into lower case. Next, we store the lower case value into the name variable using the walrus operator. Finally, we filter the values using the predicate to only keep the names containing “vi”.

The alternative without the walrus operator would have been:

python> [name.lower() for name in ["OLIVIER", "OLIVIA", "OLIVER"] if "vi" in name.lower()]
['olivier', 'olivia']

As you can see, this above code snipped is less “optimized” as you call the len() method twice on the same object.

You can also use the walrus operator without performing any filtering. The following code works like a charm:

python> [name for _name in ["OLIVIER", "OLIVIA", "OLIVER"] if (name := _name.lower())]
['olivier', 'olivia', 'oliver']

However, this is highly counter-intuitive and calls for errors. The presence of if let’s assume that there is a conditional check and filtering in place. Which is not the case. The codebase does not benefit from such design since the following snippet – on top of being clearer – is strictly equivalent in term of outcome:

python> [name.lower() for name in ["OLIVIER", "OLIVIA", "OLIVER"]]
['olivier', 'olivia', 'oliver']

Note: Developers tend to be very smart people. We sometime like to show off our smarts by demonstrate our mental juggling abilities. Resist the tentation of writing complex code. Programming is a social activity (e.g. all the collaborative open source projects). Be professional and keep the nerdy brilliant workarounds for your personal projects. Follow the KISS principle: Keep It Simple Stupid. Clarity is all that matters. You want to maximize the codebase discoverability.

Last but not least, you can also use the walrus operator inside while conditional statements:

while (answer := int(input())) != 42:
    print("This is not the Answer to the Ultimate Question of Life, the Universe, and Everything.")

python> python script.py
7
This is not the Answer to the Ultimate Question of Life, the Universe, and Everything.
3
This is not the Answer to the Ultimate Question of Life, the Universe, and Everything.
42

Note: In the above examples we have used list comprehension. This article (in progress) explains this design in more detail.

One more thing

You cannot do a plain assignment with the walrus operator. At least, it is not that easy:

python> a := 42
  File "<stdin>", line 1
    a := 42
      ^^
SyntaxError: invalid syntax

For the above code snippet to work, you need to enclose the assignment expression around parentheses:

python> a = 42
python> (a := 18)
18
python> print(a)
18

Who told you that Software Developers were not sentimental? ❤️

Do not return null values

It is always a bad idea to write a method returning a null value because it requires the client to remember to check for null:

it is foisting problems upon the caller methods, postponing conditional checks and creating work for later on that one might forget to implement.
it invites errors; all it takes is one missing none checks to have your application spinning out of control.
if you are still tempted to return none from a method, consider throwing an exception or special-case object instead.

Note: this works well for methods returning iterable types e.g. list, set, strings… as you can just return an empty list. Returning an empty custom-object such as instantiated class is more hairy. In such edge-case only, you can return null.

from config.constants import REGISTERED_ITEMS

def retrieve_registered_item_information(item):
    if item is not None:
        item_id = item.get_id()
        if item_id is not None:
            return REGISTERED_ITEMS.get_item(item_id).get_info()

As a demonstration for our second aforementioned point, did you noticed the fact that there wasn’t a null check in the last line? What about the item not being retrieved among the REGISTERED_ITEMS but you, still trying to access the get_info() method of a None element? You will get an error for sure.

Example

You have the following structured json object you want to extract the id from:

{
    "id": "42",
    "name": "some_name",
    "data": [...]
}

def get_object_id(object: dict) -> str | None:
    candidate_id = None
    try:
        candidate_id = object["id"]
    except KeyError as message:
        logger.error(f"Error retrieving the object id: {message}")
    return candidate_id

The above method is not ideal:

You have a mixed type between str and None. You do not want your method to be schizophrenic but rather it to be type-consistent instead.
Some python versions do not accept type annotations with | operators. Python 3.9+ solves this problem.

Instead, always favour the following accessor method as a nice remedy:

def get_object_id(object: dict) -> str:
    candidate_id = ""
    try:
        candidate_id = object["id"]
    except KeyError as message:
        logger.error(f"Error retrieving the object id: {message}")
    return candidate_id

There are multiple reasons and benefits for that:

You remove the returned type ambiguity and the returned type is consistent. Whatever might happens, you always return a string value.
It removes the type annotation error you might get on the old python versions otherwise.

Caution: Last but not the least, note that python always implicitly returns a None value; wether you add a return statement or not. The three following code snippets are equivalent:

def foo():
    pass

def faa():
    return

def fii():
    return None

You can try it yourself:

python> result = foo()
python> print(result)
None

The advantage of explicitly using a return is that it acts as a break statement:

def fuu():
    return 42
    a = 5
    return a

python> fuu()
42

Notes:

We have used a logger object to handle logs for us. More on python logging in this article (in progress).
We have prefixed the names of our accessor methods using get. More on how to find meaningful names for your variables in this article (in progress).

As a conclusion, you do not want to rock the boat. Be careful when returning a null value and always favour other options. Your code will be way cleaner and you will minimize the chance of getting an error 🛶

What to expect from your manager

It is essential to pick your manager very wisely, a fortiori at the early stages of your corporate life, as a good manager can make a huge difference in your career and be a strong enabler. Your manager should be your number one ally, mentor and best advocate.

Characteristics of a good manager

Your manager is familiar with the bureaucracy of the company. They know how to play the game and can get the attention of important people. As a liaison, he knows how to effectively work toward what is expected from the organization and which areas you need to focus on in order to grow expertise and bring satisfaction. He knows what matters most. There are fantastic models to learn from.
As a mentor, he can point out and evaluate opportunities, identify, assign and stretch projects where you will learn things that matters for your career. Thus, helping you gathering important knowledge, achievements and skills to help you get that promotion. Via his strong network of peers across the board and companies, he can pinpoint at any experts, mentors, conferences or resources that might be relevant for you to get in touch with.
He helps you understand the value of your work even though it is not glamourous at the first glance or provides guidances on how to resolve conflicts occurring within the team. He his trustworthy and reliable.

Note: one thing to keep in mind is that your work should speaks for itself. If your line of work is appreciated by your team i.e. you are a valuable contributor, social driver (bringing people together instead of fueling dissensions), and possess an extreme sense of ownership, automatically, chances are high that your manager will likes you too. Remain yourself. And if you are striving toward expertise (ultimately, truth) and what is commonly seen as good, you cannot do anything wrong. Are you Senior Engineer provides a checklist moving that direction.

Skills matrix

A good manager:

Helps you to play the game and navigate the corporate ladder;
Points out opportunities and help you focus on what really matters;
Knows what is important and maintains a productive environment;
Helps you achieving “mastery“;
Provides feedbacks early on to help you grow;
Schedules regular, predictable and dedicated 1-1s;
Do not uses 1-1s as status meetings to discuss about critical projects;
Allows vulnerability in front of each other to develop necessary trust;
Praises your work in public and keeps the criticisms for your 1-1s;
Possess strong communication skills;
Advocates your work and uses his network to support you;
Possess strong technical foundations giving him credibility;
Performs as mentor.

In contrast, a bad manager:

Avoids meetings with you, always reschedules or replaces the agenda at the last minute.
Micromanages you, questions every details and refuses to let you make any decisions.
Assigns without consultation high visibility projects destined to fail from the beginning, shifting responsibilities on you; literally throwing you under the bus to avoid accountability. E.g. “Olivier is happy to assist you” sent to your 3rd-level manager. Yeah, thank you, good luck with that!
Gaslights you, presenting false information making you doubt your own memory, perception or sanity e.g. “As discussed in the last meeting…” or “This is not what we have agreed upon“.
Do takes part in the office gossips or speaks evil of other employees. E.g. “You will see, John Doe is a low performer” is a no-go (on your first day and targeting a colleague with a cancer!).

Notes:

Even if we are not used to received behavioral feedbacks other than from our parents, do not be disoriented. Inevitably, everyone will screw up in some fashion and those are the fastest way to learn and progress.
Your manager remains human and strives toward what is best for the company and the team first. He will sometime be stressed, make mistakes, be unfair, harmful or say silly things.
Should this be a repetitive pattern though, bring this to his attention. If not possible, I would recommend you to speak to your skip manager, address a note with data points to your HR department or try to change team or company whenever possible. Depending of the circumstances (it is hard to build generalities), you might also wait as those kind of managers – if not supported by a deficient hierarchy – won’t last long. Those job switchers generally only focus on building a portfolio and leaves after 1.5 years.

Questions to ask in interviews

Interview is a two-ways street: it enables the future employer to know more about your likelihood to fit into the team; but it is also for you the best opportunity to know more about your future management-style, team and colleagues. It is a rare chance to develop a feeling for the company before you accept (or reject) the job offer. It might either confirm or quash your initial beliefs. Last but not the least, it is also a way for you to give a very nice first impression. Your aim is to show that you are already projecting yourself into the job, striving to be a technical asset and social enabler for those you gonna work with.

Note: if you are looking for red-flags, you will always find some. Sometimes, ignoring what you think might be off is a good strategy. For me, receiving the job offer very shortly after the second interviews (less than a few days) always felt a bit weird but more often led to very pleasant experiences. Of course, there is no way to tell wether or not you have dodged the bullet until you have waited long enough for the trigger to be pulled. The evaluation period is also a way for you to test the water. In the rare occasions where should the trial period be a bummer, you can always resume it. Turn the associated perks to your own benefit and do use it!

Straight to the point, hereafter the questions. Feel free to take from it:

What your typical day is looking like? What are the key milestones of your days and weeks?
What are the main technical and managerial challenges you are currently facing with? What are the solutions your are walking forward?
How the team stays in tune with the current and emerging technologies?
What the onboarding journey will look like? What are the learning paths or processes in place? What is your mentoring process?
Before taking any final decision, would it be possible for me to meet the whole team and the manager?
When could I start?
What can I do to surpass your expectations and be a positive element of your team and organization?
What degree of initiatives one can have within the team? How are you enabling the teams to be self-directed and proactivity?
What is your technical stack? What are the provided working devices? What kind of access rights do people have on their equipments?
What technical debts do you have? How are you coping with it?
How do you bring the team together? What are the biggest concerns shared across the team at the moment?
Who are your main stakeholders?
What is the vision the team is striving for? How the team is stirring toward those goals?
How are you making sure you are keeping track with the road map?
How are you coping with errors and mistakes to occur?
Are there any career milestones and evolution pathways already in place? What the perspectives would look like?
How the scopes, milestones, timelines and deliveries of a project are estimated?
How are you disambiguating ambiguous problem statements to get to the root of problems, incoming requests and situations?
What amount of details should I provide to the manager for him to stay in the loop without drawing him in unnecessary information? What is the satisfactory update frequency one should adopt?
Where do you draw the line, finding the good balance between action and delivery but without over-compromising on quality?
How is your code, legacy and processes documented? What is your estimated coverage?
What are the standards and best practices you have in place to guarantee good codebase quality? How to ensure the reliability of your data pipelines?
Regarding Git and Gitlab, what are your main CI pipelines jobs consist of?
What is your home-office policy? What actions are in place to stimulate the “working together” sentiment?
Beside my mother tongue, I do speak english at a very proficient level. I however ensure to speak german – which I consistently learn since two years with the objective of being perfectly fluent by 2025 – on a minimum daily base. I can so far hold causal conversations. I intend to adopt the mean of communication the team is the most comfortable with. Should it be german, what would you expect from me to ease my integration within the team, quickly close any cultural gaps that might be and promote effective cooperations?
Which data stage are you? E.g. Monolithic on-premises systems or moved already on the Cloud. Reverse ETL in place? What are your observability and DataOps (DevOps and FinOps) strategies?
Is there an explicit agreement (SLA/SLO) between the upstream data source teams and the data engineering team?
Who are your upstream and downstream stakeholders?
How is the data architects/data engineering tandem working? How involved are each parties in the decision-making process?
Is the workloads internal-facing (upright stream from source systems to analytics and ML teams) or external-facing (feedback-loop from the application to the data-pipeline)?

Note: those questions have a purpose. On top of providing useful information for you to make your choice, they are matching the inquiries any Senior Data Engineer might have. Proving at the same time that you have already owned your way in the Senior team. And if you have already those concerns in mind, congratulations, you are a Senior Data Engineer! 🥳

Are you Senior Data Engineer

It is not easy to know when you have reached the specific milestone. Here is a checklist to guide your process and helping you through. Whether you keep it merely informational or strive for the Senior Data Engineer position, the following material might help you to stay on track:

Note: starting from the beginning, you can check the scopes of a Data Engineer in the What is a Data Engineer article.

You can conduct end-to-end projects with no or at least very limited guidance (e.g. to keep track of the legacy in place before you start). You are proactive and enabled.
You have a complete overview on the business (you know what matters most for the company) and own the technical stack (you have the full picture of the toolings in place and can intervene at any steps of the process). You are a source of truth for your peers without being adamant about your viewpoints. You accept other valid solutions. You challenge but also respect code that came before you. There are probably reasons for everything that exists on production (it might even be an unclear business-related thingy people gradually became unaware of).
You are actively involved in the road map, bringing up initiatives, and keeping track of progresses. You are a mini Tech Lead and can support him on demand.
You can effectively communicate with non-technical employees, interpret and deliver on requests with minimal technical information. You are a relatable touch point for stakeholders, project managers and project owners.
You get involve with hiring for your team, leading (technical) interviews and presented technical assessments. You can support the Team Lead on demand to maintain a high bar for hiring quality candidates.
You are accountable for issues and errors to occur and can provide significant support for the team. You are a problem-solver. You do not pass the blame but rather stop the buck.
You have a track record of delivered products, projects and meaningful contributions across the board. You provide scalable solutions for high risk projects without over-engineering. You can stay pragmatic.
You constantly stay in tune with the current and emerging technologies. You share your findings with your peers and build small prototypes.
You can accurately estimate the scopes of your projects, timelines and deliver on the commitments you made. You can make your work measurable.
You are disambiguating ambiguous problem statements, constantly asking “why” until you get to the root of the problems and situations.
You maintain a high quality, genuine and trustworthy network even outside your organization or core department. You have strong endorsements to help you navigate and grow in the company. You can pinpoint referrals and recommend precise people for mentorship. You know who works on what, with whom, when and how. You can explain what other people on the team are busy with.
You keep your manager on track in the loop but without drawing him in unnecessary details (e.g. sticking to data points). You can keep people up-to-date in an efficient manner while writing professional emails.
You are good at mentoring. You are the one others refer and come to for guidance and advice. You are reachable and trustworthy. You are involved in multiples projects as consultant, reviewer and mentor. You can provide constructive feedbacks while staying away from politic or office gossiping. Praise or say nothing but never diminish another co-worker.
When working on a project, focus on action and delivery but without over-compromising on quality. You manage to push back if required. You strive for high-quality work (even from others e.g. during PR reviews) but without stretching yourself too thin to be effective. You relentlessly simply code, systems and architectures without overdoing it. You know where the good balance stands. When the incremental cost to develop is too high, you proactively prioritizing fixing the technical debt.
You documentation extensively (e.g. via readme, docstrings or Confluence) the “why” more than the “how” and demand it from others. You are involved in grooming incoming requests and actively manage onboardings or off-boardings or your peers.

Note: This aforementioned list is highly suggestive. It is the result of my personal observations, looking at the Seniors performing at best in the different workplaces I have worked by and scrutinizing what is expected from managers, mentors and C-people. I am every now and then skimming through it to know where I am at. May this help you defining your own agenda and leading effective 1-1s with your manager. All the best 💪🏻

Tools and Extensions for Data Engineering

Here are the tools I am using daily as Data Software Engineer.

Visual Studio Code Extensions

Name	Description	Purpose
Git Lens	Supercharge Git within VSC.	See the last author, date and commit message for each lines. I can retrieve the associated ticket in the Jira history (thanks to the existing git commit message conventions) or exchange directly with the original author.
Git Graph	View a Git Graph of your repository.	Help troubleshooting git operations e.g. traveling back of merging branches.
Code Spell Checker	Spelling checker for source code.	Avoid grammar mistakes in READMEs and docstrings.
Better TOML	Syntax highlighting for toml files.	Self-explanatory.
HashiCorp Terraform	Syntax highlighting and autocompletion for tf files.	Self-explanatory.
HashiCorp HCL	Syntax highlighting and autocompletion for terragrunt files.	Self-explanatory.

Command Line Interface Extensions

Since I am developing on MacOS, I am using zsh but equivalents also exist for bash.

Name	Description	Purpose
powerlevel110k	Theme for `zsh`.	Highlight the status and branch you are pointing at.
zsh-autosuggestions	Suggests commands as you type based on history.	Display the existing make commands.

Git Utilities

Regarding git, you can:

create aliases; How to create git aliases
customize your git configuration; (in progress).

Chrome Extensions

Name	Description
Ublock Origin	Block ads on your web browser.
Json Viewer	Json highlighter.
Jira Static Link	Copy static Jira link into the clipboard.
Confluence Static Link	Copy static Confluence link into the clipboard.

Git branch naming convention

When working on a project tracked with git, you will sure do create branches. You have the main branch of course, but then a good practice is also to have one branch per features you are developing. Below what it might look like at then end:

> git branch
* master
  dev/JIRA-1234
  dev/JIRA-5487_add_users_filtering
  dev/add_google_sign_in_authentication_form
  dev/ISSUE-987

Let’s have a closer look on how to name them (even though the above snippet already gives you a hint).

Main branch

The main or master branch is treated as the unique source of truth, the official working code base. This is a place where everything must be working. It is the default branch you come up with when you initialize a new git project.

Notes:

Since 2020 and 2022, Github and Gitlab (respectively) renamed their default branch from master to main to avoid language that might be seen as offensive.
If you do not want to politicize git and still prefer the old naming convention, you can still rename main for master manually. How to rename git branch explains how to do it.

Feature branches

It is time to add features into our main branch. Since main is the place where everything must be working, you first want to test your changes on a development branch. Usually, you end up with 1 branch per feature. Each of them templated as follow:

dev/JIRA-1234
dev/JIRA-1234_add_users_filtering
dev/add_google_sign_in_authentication_form
dev/ISSUE-987

In a nutshell, the following rules apply:

Prefix the non-main branch with dev/. It will makes easier to trigger the dedicated Gitlab CI jobs via branches filtering.
If you use an issue tracker like Jira, include the ticket’s ID. Gitlab includes a Jira Integration facility tool e.g. creating links to the Jira ticket on the fly.
Beside the ticket ID, stick to the good old snake_case. See also Why you should use Snake Case.
If you do not use an issue tracker e.g. for personal projects, simply describe the feature you are implementing.