jq linux command for JSON processing

Recently I came across a very nice tool to process JSON data from your terminal: jq.

jq is a nice tool to use in your command lines to filter and process JSON streams that you can link directly in your CLI pipes.

Installation

zsh> brew install jq # MacOS
zsh> jq --version
jq-1.6

 Examples

jq can be used to prettify JSON inputs:

zsh> echo '{"fruits": ["apple", "peer", "banana"]}' | jq "."
{
"fruits": [
    "apple",
    "peer",
    "banana"
]
}

It can also read directly from files:

zsh> cat >frenchmen.json<<EOF
heredoc> {"frenchmen": [{"name": "Napoleon"}, {"name": "Degaulle"}, {"name": "Exupery"}]}
heredoc> EOF

zsh> jq "." frenchmen.json
{
"frenchmen": [
    {
    "name": "Napoleon"
    },
    {
    "name": "Degaulle"
    },
    {
    "name": "Exupery"
    }
]
}

You can access specific properties e.g. “name”:

zsh> jq '.frenchmen | .[].name' frenchmen.json
"Napoleon"
"Degaulle"
"Exupery"

You can slice through arrays:

zsh> echo '[1,2,3,4,5,6]' | jq ".[2:-1]"
[
    3,
    4,
    5
]

You can also use functions:

zsh> echo '["blue", "white", "red"]' | jq '.[] | length'
4
5
3

zsh> echo '["blue", "white", "red"]' | jq '[.[] | length] | max'
5

zsh> echo '[1,2,3,4,5,6]' | jq '.[] | select(.%2==0)'
2
4
6

A more extensive list can be found in this article: https://www.baeldung.com/linux/jq-command-json

Change the URL of a Git repository

What you have to do:

  1. On Github/Gitlab, change the url of (remote) repository via the UI.
  2. On your local repo, use git remote set-url origin <new-url> to replicate the changes and thus restore the bridge between the remote and local repo.
  3. Confirm the changes are now reflected, using git remote -v.

Example

You created a Gitlab repo called client-snapchat-api but you made a mistake. The name should be snapchat-api-client instead. Additional complexity: you have cloned the project on your local environment already.

Note: the name convention related to API clients is quite standard in the industry; the -client part always suffixing (and not prefixing) the client’s name i.e. name-of-the-api-client and not client-name-of-the-api.

First step is to perform the modifications on the remote repository; via Gitlab’s settings on the User Interface, you edit to your likings:

  1. The project’s name
  2. The project’s path via the Settings Advanced section.

Notes:

  • It is common for the repo’s URL to respect the kebab-case formatting. See also Why using snake_case to spotlight the differences.
  • It is common for the project’s name and the repo’s URL to be the same.

Second step (since you have already cloned the project on local) is to reflect those new remote settings in your local repository.

This because your local project still uses the old URLs as “origin” (i.e. the source) to track down, fetch (pull) and push the incoming and out-coming changes.

In other terms, your local repo is still linked to the old remote URLs. But those URLs are no longer attached to an existing remote Gitlab project since you have just changed them (depreciated):

> git remote -v
origin git@gitlab.com:obenard/old-project-in-kebab-case.git (fetch)
origin git@gitlab.com:obenard/old-project-in-kebab-case.git (push)

Notes:

  • The -v option is short for --verbose
  • The URLs for fetching and pushing (i.e. where you get the changes from and where you push the changes at) can be different.

The following image may help you: it’s like uprooting a tree and plotting it somewhere else. The leaves being the local repos instantiated by your developers and the root the remote repository (aka the unique source of truth). You need to rejuvenate the thing by reconnecting the leaves to the torso.

In our example, our local repo is pointing at the now depreciated URLs:

> git remote -v
origin git@gitlab.com:obenard/client-snapchat-api.git (fetch)
origin git@gitlab.com:obenard/client-snapchat-api.git (push)

To edit the URLs:

git remote set-url origin git@gitlab.com:obenard/snapchat-api-client.git

To see the result:

> git remote -v
origin git@gitlab.com:obenard/snapchat-api-client.git (fetch)
origin git@gitlab.com:obenard/snapchat-api-client.git (push)

Congratulations, your local project is now again linked with a valid remote Gitlab repository and can send or retrieve information from it.

Run the extra mile: URI, URL and URN

Before getting any further in polishing our 360° overview and have a full grasp over the concept, we need to understand the difference between URIs, URLs and URNs:

  • An URI – Uniform Resource Identifier – is an identifier, like the id primary key in a table or the social security number for a person. It is used to uniquely discriminate a resource e.g. a Gitlab repository (two Gitlab repositories cannot have the name URI since you won’t know which one you are referring to otherwise).

  • An URL – Uniform Resource Locator – is an URI but with the additional specificity of also being a locator. An URL allows you to locate and access a resource on the Internet e.g. a web page like https://olivierbenard.fr/how-to-create-git-aliases/. This page is unique across the Internet: you can not find the same URL anywhere else, and also, the URL is associated with a page you can navigate through.

  • Every URLs are URIs but there are URIs which are not URLs. For instance, URN – Uniform Resource Name – uniquely identifies a resource by a name in a particular namespace. It is a nice way to talk about a resource but without implying anything about its location or how to access it. URNs are intended to be unique across time and space e.g. the ISBN – International Standard Book Number – is a unique worldwide book identifier.

Note: speaking about ISBNs, you may find book recommendations to get started as Data Engineer in the related article What is a Data Engineer.

Now that the semantic is set, you will notice that I have lied to you 😱

I have indeed implied that we will have to change the URLs of a Git repository. This was an incorrect statement.

What we have done instead with the git remote set-url origin <new-uri> command was to change the URIs, not the URLs:

> git remote -v
origin  git@github.com:olivierbenard/olivierbenard.git (fetch)
origin  git@github.com:olivierbenard/olivierbenard.git (push)

This makes sense: if you have a closer look, you will notice that the git@github.com:olivierbenard/olivierbenard.git thingy does not lead anywhere. For good reason: it is a pure URI!

And for sure, the repo is located by the following https://github.com/olivierbenard URL.

Notes:

  • To be even more specific, the command is changing the URIs and then, Gitlab takes over, changing the URLs in the background.
  • The same result as with git remote set-url origin <new-uri> can be obtained directly by editing the ~/local-git-repo/.git/config file:

    [remote "origin"]
        url = git@github.com:olivierbenard/olivierbenard.git
        fetch = +refs/heads/*:refs/remotes/origin/*
    [branch "master"]
            remote = origin
            merge = refs/heads/master
    

And you, what’s your excuse for not having a green thumb? 🌱

How to create Git Aliases

There are two ways to create custom git aliases:
1. Using the Command Line Interface.
2. Directly editing the git config files.

We will have a look on both methods.

Via the Command Line Interface (CLI)

For instance, git status can be shortened into git s:

git config --global alias.s status

Note: in this example, we are configuring a git alias so git status and git s will be equivalent. Therefore, git status and git s will return the same output.

Editing one of the three git config files

  • .git/config (at your git local repository level)
  • ~/.gitconfig (global)
  • ~/.config/git/config (global)

Just add the following lines within the file, e.g. using vim or nano:

[alias]
    s = status

Notes:

  • If you only edit the git config file at your local repository level, the alias will only be accessible within your current git project.
  • If you set-up the alias at one of the global git config file, the alias will be usable across your overall local environment.

List all the alias you have defined

alias = "!git config -l | grep alias | cut -c 7-"

Note: the exclamation mark tells git that everything within the quotes will be a bash script, therefore it will gives us access to the usual bash commands like grep or cut.

Example of git aliases

Here is a couple of aliases you might find useful to replicate in your configuration:

[alias]
    c=commit
    cm=commit -m
    s=status
    sw=switch
    a=add
    alias=!git config -l | grep alias | cut -c 7-
    cam=commit -am
    lo=log --oneline
    sc=switch -c
    rsm=rm -r --cached
    asm=submodule add
    reinit=!git submodule update --init --recursive && git submodule update --remote

Why using git aliases

It simply makes your life easier since you do not have to type long commands anymore. You can simply call them using a shorter name.

After you have typed the same command again and again 10+ a day, you will start to love git aliases.

Did I hear someone say that software developers are lazy? 😈

sftp basic CLI commands survival guide

SFTP cheat sheet

> sftp <username>@<hostname>
sftp> lcd <change-the-local-working-directory-by-this-one>
sftp> cd <change-the-remote-working-directory-by-this-one>
sftp> get <copy-this-remote-file-into-the-local-working-directory>
sftp> put <copy-this-local-file-into-the-remote-working-directory>
sftp> exit

Note: for the put and get commands, you can add the -r flag should you download/upload directories (r standing for recursive).

Real Business Case Example – Missing data need to be reloaded on SFTP server.

You have csv files uploaded on a SFTP server at a daily rate. Those csv files are sent by an external provider (contain results from a marketing campaign).

Once a file is uploaded into the SFTP server, it is picked up by a function listening to upload events (e.g. GCP Cloud or AWS Lambda function).

When the function is triggered (after an upload event), it looks for the uploaded file matching the following pattern: feedback_YYYY_MM_DD.csv.

Then the function parses the matching file and aggregates the content into a table on your data warehouse.

You got a complain from your stakeholders that they have missing data for the 2022-11-13:

SELECT PARTITIONTIME as pt, COUNT(*) as nb_rows
FROM `datapool.marketing_campaign.feedbacks_report`
WHERE DATE(PARTITIONTIME) >= "2022-11-11"
GROUP BY PARTITIONTIME
ORDER BY PARTITIONTIME ASC

which returns the following:

pt nb_rows
2022-11-11 247
2022-11-12 308

Connecting on the sftp server (or direclty looking into the bucket it reflects – or even better – looking into the logs), you notice that there is a tipo in the filename for that missing day. So even if the upload event was detected, the file has not been parsed and ingested by the function (failed to match the defined pattern).

You need to download this file on your local machine, change the name and upload it back on the SFTP server so the ingestion process is processed again by the function:

> sftp obenard@olivierbenard.fr
obenard@olivierbenard.fr's password:
sftp> cd inputs_folder/
sftp> ls -l
feedback_2022_11_11.csv  feedback_2022_11_12.csv  fedback_2022_11_13.csv
sftp> lcd ~/Downloads/
sftp> get fedback_2022_11_13.csv

-- from your local machine, go to ~/Downloads/ and edit "fedback" for "feedback".

sftp> put feedback_2022_11_13.csv
sftp> ls -l
feedback_2022_11_11.csv  feedback_2022_11_12.csv  fedback_2022_11_13.csv  feedback_2022_11_13.csv
sftp> exit

Note: put feedback_2022_11_13.csv and not put ~/Downloads/feedback_2022_11_13.csv because your are already looking into the ~Downloads/ folder due to the lcd command you have executed before.

pt nb_rows
2022-11-11 247
2022-11-12 308
2022-11-13 296

The data is now there! ✅

Notes: Editing the name cannot be done directly from the sftp environement, e.g. using mv <old-name> <new-name>, because the function need an upload event to be triggered.

SFTP Basic File Transfer Commands in detail

  • sftp command connects you to the sftp server:

    > sftp <username>@<hostname>
    
  • exit command closes the connection:

    sftp> exit
    
  • lcd changes your Local Current working Directory on the local machine (remote file will be downloaded there):

    lcd <local-directory>
    
  • cd changes your Current working Directory on the remote sftp machine:

    sftp> cd <your-directory>
    
  • put command copies the file from the local machine to the remote sftp machine:

    sftp> put <local-file>
    
  • get command copy the file from the remote machine to the local machine:

    sftp> get <remote-file>
    

What is an SFTP server for

An SFTP server allows users to use an SFTP client – like the FileZilla software – to connect to the bucket but without directly connecting to the bucket itself. The SFTP server being used as primary interface then as relay to pass on requests on the bucket it is associated with. That way, any users using the client can query the bucket and perform operations such as retrieving, listing or consulting the files stored into that bucket.

Advantages are: the SFTP client provides a more “user-friendly” interface and you do not allow users to directly connecting into the bucket, but rather as its additional interface.

When not to use an SFTP server

On Google Cloud Platform, opening a new SFTP server is not the way to go: if set-up correctly, you do not need SFTP. We can give their google service account permissions to upload in the bucket (requires a google account though). The user can then use the gcloud command line tool to upload/download directly from the bucket.