Leaning Objectives
-
Build-up a general idea of what a SQL technical job interview is like;
-
Learn a handful of SQL best practices;
-
Get familiar with sqlite3
Python’s library.
Business-case
As a phone provider company, you want to know whom – among your list of customers – spent a total accumulated time of 10 or more minutes on the phone.
You can imagine that this list of identified top customers can be later used for marketing purposes.
This list of top customers should contain only the names, alphabetically sorted.
Hence, you have two tables:
-
The list of customers
;
-
The calls
history.
The table customers
contains the following information:
customer_id |
name |
phone |
1 |
Tim |
1234 |
2 |
John |
5678 |
3 |
Mona |
9101 |
The table calls
contains the following information:
id |
caller |
callee |
duration |
1 |
1234 |
5678 |
4 |
2 |
1234 |
9101 |
2 |
3 |
5678 |
1234 |
5 |
4 |
5678 |
9101 |
7 |
On this simple test-case dataset, the returned table should be:
Solution
The steps are the following:
-
For each caller – compute the total duration of their calls;
-
For each callee – compute the total duration of their calls;
-
A caller can be at other times a callee and the other way around. Thus, both tables must be merged into one using the phone number as identifier;
-
The total duration – as a caller and as a callee – must be summed up;
-
The filter can be applied to only retain the values for which the total duration is superior or equal to 10 minutes;
-
Last but not the least, in place of the phone number, we map with the associated name and sort the result alphabetically.
Local implementation
Setting sqlite3
For you to be able to work with the data itself, you can configure your own local SQL database, store the dataset and query the server with your own custom SQL statements.
Because it is easy to set up, I am using sqlite3 – a Python library.
The first step is to import the library. It is natively built-in along with your python’s distribution:
import sqlite3
Then, create a local file named database.db
where to store your tables. The command returns a connection object:
connection = sqlite3.connect("database.db")
Create a cursor
object. You can see the cursor like a mouse tracker that allow you to target specific actions on the database based on where the focus is actually on:
cursor = connection.cursor()
Using the cursor, you can create our two tables customers
and calls
:
cursor.execute(
"CREATE TABLE IF NOT EXISTS customers(customer_id, name, phone)"
)
cursor.execute(
"CREATE TABLE IF NOT EXISTS calls(id, caller, callee, duration)"
)
Then, prepare the data to be injected within our tables:
sql_query_customers = """
INSERT INTO
customers (customer_id, name, phone)
VALUES
(1, "Tim", "1234"),
(2, "John", "5678"),
(3, "Mona", "9101");
"""
sql_query_calls = """
INSERT INTO
calls (id, caller, callee, duration)
VALUES
(1, "1234", "5678", 4),
(2, "1234", "9101", 2),
(3, "5678", "1234", 5),
(4, "5678", "9101", 7);
"""
Execute the queries:
cursor.execute(sql_query_customers)
cursor.execute(sql_query_calls)
Do not forget the save the operation:
connection.commit()
Test that the data is actually there:
result = cursor.execute("SELECT * FROM customers")
print(result.fetchall())
This should return the following output:
[
(1, 'Tim', '1234'),
(2, 'John', '5678'),
(3, 'Mona', '9101')
]
Do not forget to close the connection:
connection.close()
Now that our environment is setup, let’s work toward the solution.
Building the intermediate tables
The first step is to find all the durations, grouped by caller for each callers:
WITH caller_durations AS (
SELECT
caller AS phone_number,
SUM(duration) AS duration
FROM calls
GROUP BY caller
)
SELECT * FROM caller_durations;
This gives us the following caller_durations
table:
phone_number |
duration |
1234 |
6 |
5678 |
12 |
Interpretation:
-
The customer associated with the phone number 1234 spent 6 minutes as a “caller”;
-
The customer associated with the phone number 5678 spent 12 minutes as a “caller”;
-
The customer associated with the phone number 9101 never called.
We do the same for the callee:
WITH callee_durations AS (
SELECT
callee AS phone_number,
SUM(duration) AS duration
FROM calls
GROUP BY callee
)
SELECT * FROM callee_durations;
Which gives us the following callee_durations
table:
phone_number |
duration |
1234 |
5 |
5678 |
4 |
9101 |
9 |
Interpretation:
-
The phone number 1234 received a total of 5 minutes call;
-
The phone number 5678 received a total of 4 minutes call;
-
The phone number 9101 received a total of 9 minutes call.
Next step is to combined both tables using a JOIN
statement:
WITH joined_duration AS (
SELECT
caller_durations.phone_number AS phone_number_caller,
caller_durations.duration AS duration_as_caller,
callee_durations.phone_number AS phone_number_callee,
callee_durations.duration AS duration_as_callee
FROM caller_durations
FULL JOIN callee_durations
ON caller_durations.phone_number=callee_durations.phone_number
)
SELECT * FROM joined_durations;
phone_number_caller |
duration_as_caller |
phone_number_callee |
duration_as_callee |
1234 |
6 |
1234 |
5 |
5678 |
12 |
5678 |
4 |
None |
None |
9101 |
9 |
Interpretation:
-
The phone number “1234” placed a total number of 6 minutes call as a caller and 5 as a callee;
-
The phone number “9101” never placed a phone call but received a couple of incoming phone calls for a total duration of 9 minutes;
-
The columns phone_number_caller
and phone_number_callee
should always refer to the same phone number.
We can reduce the above table in something more “readable”:
WITH trimmed_durations AS (
SELECT
COALESCE(phone_number_caller, phone_number_callee) AS phone,
COALESCE(duration_as_caller, 0) AS duration_as_caller,
COALESCE(duration_as_callee, 0) AS duration_as_callee
FROM joined_durations
)
SELECT * FROM trimmed_durations;
phone |
duration_as_caller |
duration_as_callee |
1234 |
6 |
5 |
5678 |
12 |
4 |
9101 |
0 |
9 |
Note: the COALESCE
function allows to replace the value by another one should the initial value be NULL
.
Now, we can add the durations to have the total_duration
i.e. the total number of minute the customer was one the phone:
WITH total_durations AS (
SELECT
phone AS phone,
duration_as_caller + duration_as_callee AS total_durations
FROM trimmed_durations
)
SELECT * FROM total_durations;
phone |
durations |
1234 |
11 |
5678 |
16 |
9101 |
9 |
It is now time to filter to only consider the durations >= 10:
WITH top_users AS (
SELECT
phone AS phone
FROM total_durations
WHERE total_durations >= 10
)
SELECT * FROM top_users;
However, the phone numbers must be replaced by the associated names. Thus, a second JOIN
is needed to map the values with the customers
table:
WITH final_table AS (
SELECT
customers.name AS name
FROM top_users
LEFT JOIN customers
ON top_users.phone=customers.phone
ORDER BY name ASC
)
SELECT * FROM final_table;
Which gives us the expected result: