Once the data is loaded on the data warehouse, it can be stored into different environments:
-
datalake: protected staging environment with limited access to Data Engineers only. It receives raw data from ingesting pipelines. Typically, tables there only contains few columns e.g. content (containing stringify json) and the ingested_at ISO 8601 timestamp.
-
datapool-pii: structured/prepared environment where data from datalake is parsed and technical transformations are applied on the data. Typically, the previous content field is parsed (e.g. using JSON_EXTRACT() DML alike-functions on BigQuery) and split into different fields. Accredited stakeholders can directly pick data from there. If sensitive information were on the data, they are also to be found on this environment.
-
datapool: same as for the datapool-pii environment but the sensitive information are removed.
-
datamarts: environments aggregating environments with integrated business objects and aggregations for metrics, intelligence, science and analytics. For stakeholders. You can see them as data business products. Generally, those tables are owned by an expert group, e.g. the Data Modellers and Business Intelligence team.
Note: PII stands for Personally Identifiable Information.
End-to-end Data Journey Example
Let’s examine an end-to-end business case to see how the different environments intervene into the general workflow.
Let’s say that you have the following json data containing some statistics per active users (e.g. on an app or website):
{
"active_users": [
{
"id": "12e57e",
"email": "john.doe@gmail.com",
"results": [
{
"stats": {"views": 12587, "spend": 8000}
}
]
},
{
"id": "r87e5z",
"email": "jane.doe@gmail.com",
"results": [
{
"stats": {"views": 97553, "spend": 10000}
}
]
},
{
"id": "8tr75z",
"email": "johnny.doe@gmail.com",
"results": [
{
"stats": {"views": 41239, "spend": 5000}
}
]
}
]
}
You ingested the above json data into the following raw table on datalake:
content |
ingest_at |
{“active_users”:[{“id”:”12e57e”,”email”:”john.doe@gmail.com”,”results”:[{“stats”:{“views”:12587,”spend”:8000}}]},{“id”:”r87e5z”,”email”:”jane.doe@gmail.com”,”results”:[{“stats”:{“views”:97553,”spend”:10000}}]},{“id”:”8tr75z”,”email”:”johnny.doe@gmail.com”,”results”:[{“stats”:{“views”:41239,”spend”:5000}}]}]} |
2022-12-14T04:00:00.000 UTC |
Note: content being a STRING and ingested_at a TIMESTAMP.
The table is then parsed and stored into the datapool-pii environment:
id |
email |
views |
spend |
12e57e |
john.doe@gmail.com |
12587 |
8000 |
r87e5z |
jane.doe@gmail.com |
97553 |
10000 |
8tr75z |
johnny.doe@gmail.com |
41239 |
5000 |
This table contains sensitive information (e.g. pii data). Thus, before granting access to stakeholders, one must hide or hash the sensitive fields:
id |
email |
views |
spend |
12e57e |
375320dd9ae7ed408002f3768e16cb5f28c861062fd50dff9a3bff62e9dce4ef |
12587 |
8000 |
r87e5z |
831f6494ad6be4fcb3a724c3d5fef22d3ceffa3c62ef3a7984e45a0ea177f982 |
97553 |
10000 |
8tr75z |
980b542e198802ebe4d57690a1ad3e587b5751cb537209112c0564b52bd0699f |
41239 |
5000 |
The above table is stored on datapool.
Then, on datamart you will find this table aggregated with other metrics and measurements, ready to be used and consumed by the vertical teams.
To go even further
To parse the raw json content field, you can use an Airflow job that will regularly execute the following SQL
query:
{% set datalake_project = var.value.datalake_project %}
WITH
raw_lake_table AS (
SELECT
JSON_EXTRACT_ARRAY(content, '$.active_users') AS active_users,
ingested_at as ingested_at
FROM
`{{datalake_project}}.dataset.table`
WHERE DATE(ingested_at) = "{{ds}}"
)
SELECT
JSON_EXTRACT_SCALAR(unnested_active_users, '$.id') as id,
JSON_EXTRACT_SCALAR(unnested_active_users, '$.email') as email,
JSON_EXTRACT(results, '$.stats.views') as views,
JSON_EXTRACT(results, '$.stats.spend') as spend,
ingested_at
FROM
raw_lake_table,
UNNEST(active_users) as unnested_active_users
LEFT JOIN UNNEST(JSON_EXTRACT_ARRAY(unnested_active_users, '$.results')) as results
Note: You can use Jinja templated variables to parametrize your query. In our case, {{datalake_project}}
will be replaced at execution time by datalake-prod
or datalake-dev
. In the same manner, {{ds}}
will be replaced by the current YYYY-MM-DD date. See Airflow Templates Reference for more information.
To hash using sha256
on python:
python3> import hashlib
python3> hashlib.sha256("your-string".encode('utf-8')).hexdigest()
'42c23acbf1b1c471c7b53efe58f34dea361d941f47584265df5dbaec1bfddd49'
On BigQuery, you can directly use the SHA256 method to encrypt your fields in your data-pipelines:
SELECT SHA256("your-string") as sha256;
+----------------------------------------------+
| sha256 |
+----------------------------------------------+
| QsI6y/GxxHHHtT7+WPNN6jYdlB9HWEJl31267Bv93Uk= |
+----------------------------------------------+
Caution: The return type is BYTES
in base64
format. If you want to compare it with the returned above Python value, then used TO_HEX
.
SELECT TO_HEX(SHA256("your-string")) as sha256;
+------------------------------------------------------------------+
| sha256 |
+------------------------------------------------------------------+
| 42c23acbf1b1c471c7b53efe58f34dea361d941f47584265df5dbaec1bfddd49 |
+------------------------------------------------------------------+
Note: Even better, use SHA512.