Why stop using SFTP

You should stop using SFTP as it is an outdated technology that also comes with its share of disadvantages. The main problems with SFTP are the following:

  1. You have to keep the servers secure (network security, access control, …). It adds extra loads on top of your list of duties.

  2. SFTP servers only do a SFTP-to-bucket dump which is unnecessary if we would use buckets directly.

  3. As for Google Cloud Platform, SFTP servers mount GCP buckets as files systems to a Docker container. This is something that can be done but depending on the load sent to the SFTP can become a problem (volume limitations).

Note: SFTP can be used as an interface to pass on requests between the user and the bucket it mirrors. It uses user-friendly command lines, see SFTP basic CLI commands survival kit.

Alternative to SFTP

Beside providing a friendly interface to our customers (e.g. if they use FileZilla), why using SFTP is a no go? Preaching against it sounds like we are fighting against hell. Why Data Software Engineers so adamant about it? At the end, do we now why it is an issue? It is not the best practice but is it that hurtful?

To give you the short answer: there is nothing wrong with SFTP in general, it is just an outdated technology.

On the contrary, there are multiple benefits using buckets directly:

  1. Bucket security is done by GCP (of course, we have to do the access control – which is not so hard).

  2. The interface is also easy since you can see the bucket contents in the GCP console (no need for external tools like FileZilla).

  3. Buckets can handle a high amount of parallel reads.

It is not always possible though

As stated in the Fundamentals of Data Engineering book published by O’REILLY:

Data engineers rightfully cringe at the mention of SFTP. Regardless, SFTP is still a practical reality for many businesses. They work with partner businesses that consume or provide data using SFTP and are unwilling to rely on other standards.

To conclude, try to get ride of SFTP legacies but don’t get your knickers in a twist regarding this. 👟

sftp basic CLI commands survival guide

SFTP cheat sheet

> sftp <username>@<hostname>
sftp> lcd <change-the-local-working-directory-by-this-one>
sftp> cd <change-the-remote-working-directory-by-this-one>
sftp> get <copy-this-remote-file-into-the-local-working-directory>
sftp> put <copy-this-local-file-into-the-remote-working-directory>
sftp> exit

Note: for the put and get commands, you can add the -r flag should you download/upload directories (r standing for recursive).

Real Business Case Example – Missing data need to be reloaded on SFTP server.

You have csv files uploaded on a SFTP server at a daily rate. Those csv files are sent by an external provider (contain results from a marketing campaign).

Once a file is uploaded into the SFTP server, it is picked up by a function listening to upload events (e.g. GCP Cloud or AWS Lambda function).

When the function is triggered (after an upload event), it looks for the uploaded file matching the following pattern: feedback_YYYY_MM_DD.csv.

Then the function parses the matching file and aggregates the content into a table on your data warehouse.

You got a complain from your stakeholders that they have missing data for the 2022-11-13:

SELECT PARTITIONTIME as pt, COUNT(*) as nb_rows
FROM `datapool.marketing_campaign.feedbacks_report`
WHERE DATE(PARTITIONTIME) >= "2022-11-11"
GROUP BY PARTITIONTIME
ORDER BY PARTITIONTIME ASC

which returns the following:

pt nb_rows
2022-11-11 247
2022-11-12 308

Connecting on the sftp server (or direclty looking into the bucket it reflects – or even better – looking into the logs), you notice that there is a tipo in the filename for that missing day. So even if the upload event was detected, the file has not been parsed and ingested by the function (failed to match the defined pattern).

You need to download this file on your local machine, change the name and upload it back on the SFTP server so the ingestion process is processed again by the function:

> sftp obenard@olivierbenard.fr
obenard@olivierbenard.fr's password:
sftp> cd inputs_folder/
sftp> ls -l
feedback_2022_11_11.csv  feedback_2022_11_12.csv  fedback_2022_11_13.csv
sftp> lcd ~/Downloads/
sftp> get fedback_2022_11_13.csv

-- from your local machine, go to ~/Downloads/ and edit "fedback" for "feedback".

sftp> put feedback_2022_11_13.csv
sftp> ls -l
feedback_2022_11_11.csv  feedback_2022_11_12.csv  fedback_2022_11_13.csv  feedback_2022_11_13.csv
sftp> exit

Note: put feedback_2022_11_13.csv and not put ~/Downloads/feedback_2022_11_13.csv because your are already looking into the ~Downloads/ folder due to the lcd command you have executed before.

pt nb_rows
2022-11-11 247
2022-11-12 308
2022-11-13 296

The data is now there! ✅

Notes: Editing the name cannot be done directly from the sftp environement, e.g. using mv <old-name> <new-name>, because the function need an upload event to be triggered.

SFTP Basic File Transfer Commands in detail

  • sftp command connects you to the sftp server:

    > sftp <username>@<hostname>
    
  • exit command closes the connection:

    sftp> exit
    
  • lcd changes your Local Current working Directory on the local machine (remote file will be downloaded there):

    lcd <local-directory>
    
  • cd changes your Current working Directory on the remote sftp machine:

    sftp> cd <your-directory>
    
  • put command copies the file from the local machine to the remote sftp machine:

    sftp> put <local-file>
    
  • get command copy the file from the remote machine to the local machine:

    sftp> get <remote-file>
    

What is an SFTP server for

An SFTP server allows users to use an SFTP client – like the FileZilla software – to connect to the bucket but without directly connecting to the bucket itself. The SFTP server being used as primary interface then as relay to pass on requests on the bucket it is associated with. That way, any users using the client can query the bucket and perform operations such as retrieving, listing or consulting the files stored into that bucket.

Advantages are: the SFTP client provides a more “user-friendly” interface and you do not allow users to directly connecting into the bucket, but rather as its additional interface.

When not to use an SFTP server

On Google Cloud Platform, opening a new SFTP server is not the way to go: if set-up correctly, you do not need SFTP. We can give their google service account permissions to upload in the bucket (requires a google account though). The user can then use the gcloud command line tool to upload/download directly from the bucket.