encoding Archives - Olivier Bénard

Once the data is loaded on the data warehouse, it can be stored into different environments:

datalake: protected staging environment with limited access to Data Engineers only. It receives raw data from ingesting pipelines. Typically, tables there only contains few columns e.g. content (containing stringify json) and the ingested_at ISO 8601 timestamp.
datapool-pii: structured/prepared environment where data from datalake is parsed and technical transformations are applied on the data. Typically, the previous content field is parsed (e.g. using JSON_EXTRACT() DML alike-functions on BigQuery) and split into different fields. Accredited stakeholders can directly pick data from there. If sensitive information were on the data, they are also to be found on this environment.
datapool: same as for the datapool-pii environment but the sensitive information are removed.
datamarts: environments aggregating environments with integrated business objects and aggregations for metrics, intelligence, science and analytics. For stakeholders. You can see them as data business products. Generally, those tables are owned by an expert group, e.g. the Data Modellers and Business Intelligence team.

Note: PII stands for Personally Identifiable Information.

End-to-end Data Journey Example

Let’s examine an end-to-end business case to see how the different environments intervene into the general workflow.

Let’s say that you have the following json data containing some statistics per active users (e.g. on an app or website):

{
    "active_users": [
        {
            "id": "12e57e",
            "email": "john.doe@gmail.com",
            "results": [
                {
                    "stats": {"views": 12587, "spend": 8000}
                }
            ]
        },
        {
            "id": "r87e5z",
            "email": "jane.doe@gmail.com",
            "results": [
                {
                    "stats": {"views": 97553, "spend": 10000}
                }
            ]
        },
        {
            "id": "8tr75z",
            "email": "johnny.doe@gmail.com",
            "results": [
                {
                    "stats": {"views": 41239, "spend": 5000}
                }
            ]
        }
    ]
}

You ingested the above json data into the following raw table on datalake:

content	ingest_at
{“active_users”:[{“id”:”12e57e”,”email”:”john.doe@gmail.com”,”results”:[{“stats”:{“views”:12587,”spend”:8000}}]},{“id”:”r87e5z”,”email”:”jane.doe@gmail.com”,”results”:[{“stats”:{“views”:97553,”spend”:10000}}]},{“id”:”8tr75z”,”email”:”johnny.doe@gmail.com”,”results”:[{“stats”:{“views”:41239,”spend”:5000}}]}]}	2022-12-14T04:00:00.000 UTC

content

ingest_at

{“active_users”:[{“id”:”12e57e”,”email”:”john.doe@gmail.com”,”results”:[{“stats”:{“views”:12587,”spend”:8000}}]},{“id”:”r87e5z”,”email”:”jane.doe@gmail.com”,”results”:[{“stats”:{“views”:97553,”spend”:10000}}]},{“id”:”8tr75z”,”email”:”johnny.doe@gmail.com”,”results”:[{“stats”:{“views”:41239,”spend”:5000}}]}]}

2022-12-14T04:00:00.000 UTC

Note: content being a STRING and ingested_at a TIMESTAMP.

The table is then parsed and stored into the datapool-pii environment:

id	email	views	spend
12e57e	john.doe@gmail.com	12587	8000
r87e5z	jane.doe@gmail.com	97553	10000
8tr75z	johnny.doe@gmail.com	41239	5000

This table contains sensitive information (e.g. pii data). Thus, before granting access to stakeholders, one must hide or hash the sensitive fields:

id	email	views	spend
12e57e	375320dd9ae7ed408002f3768e16cb5f28c861062fd50dff9a3bff62e9dce4ef	12587	8000
r87e5z	831f6494ad6be4fcb3a724c3d5fef22d3ceffa3c62ef3a7984e45a0ea177f982	97553	10000
8tr75z	980b542e198802ebe4d57690a1ad3e587b5751cb537209112c0564b52bd0699f	41239	5000

The above table is stored on datapool.

Then, on datamart you will find this table aggregated with other metrics and measurements, ready to be used and consumed by the vertical teams.

To go even further

To parse the raw json content field, you can use an Airflow job that will regularly execute the following SQL query:

{% set datalake_project = var.value.datalake_project %}

WITH
  raw_lake_table AS (
  SELECT
    JSON_EXTRACT_ARRAY(content, '$.active_users') AS active_users,
    ingested_at as ingested_at
  FROM
    `{{datalake_project}}.dataset.table`
  WHERE DATE(ingested_at) = "{{ds}}"
  )
SELECT
  JSON_EXTRACT_SCALAR(unnested_active_users, '$.id') as id,
  JSON_EXTRACT_SCALAR(unnested_active_users, '$.email') as email,
  JSON_EXTRACT(results, '$.stats.views') as views,
  JSON_EXTRACT(results, '$.stats.spend') as spend,
  ingested_at
FROM
  raw_lake_table,
  UNNEST(active_users) as unnested_active_users
  LEFT JOIN UNNEST(JSON_EXTRACT_ARRAY(unnested_active_users, '$.results')) as results

Note: You can use Jinja templated variables to parametrize your query. In our case, {{datalake_project}} will be replaced at execution time by datalake-prod or datalake-dev. In the same manner, {{ds}} will be replaced by the current YYYY-MM-DD date. See Airflow Templates Reference for more information.

To hash using sha256 on python:

python3> import hashlib
python3> hashlib.sha256("your-string".encode('utf-8')).hexdigest()
'42c23acbf1b1c471c7b53efe58f34dea361d941f47584265df5dbaec1bfddd49'

On BigQuery, you can directly use the SHA256 method to encrypt your fields in your data-pipelines:

SELECT SHA256("your-string") as sha256;
+----------------------------------------------+
| sha256                                       |
+----------------------------------------------+
| QsI6y/GxxHHHtT7+WPNN6jYdlB9HWEJl31267Bv93Uk= |
+----------------------------------------------+

Caution: The return type is BYTES in base64 format. If you want to compare it with the returned above Python value, then used TO_HEX.

SELECT TO_HEX(SHA256("your-string")) as sha256;
+------------------------------------------------------------------+
| sha256                                                           |
+------------------------------------------------------------------+
| 42c23acbf1b1c471c7b53efe58f34dea361d941f47584265df5dbaec1bfddd49 |
+------------------------------------------------------------------+

Note: Even better, use SHA512.

Category: encoding

Difference between datalake, datapool and datamart

End-to-end Data Journey Example

To go even further