Introduction

This week I have read a post on Why Vim is better than VSCode. In it the author discusses a lot the operator - text object - motion pattern in Vim. The post argues that it is not only efficient but fun to learn and use.

It reminded me of the structure of the operation specification language we have in DataFrames.jl that follows the pattern:

input columns => transformation => output column names

I have already written two posts about this topic that you can find here and here. Therefore, today I decided to take the fun part of using the minilanguage.

The post is written under Julia 1.7.2 and DataFrames.jl 1.3.4.

The challenge

The user has some data frame and wants to drop a :col column from it, but the user is not sure if this column is present in the data frame.

Let us first create two test data frames on which we will test our solutions:

julia> using DataFrames

julia> df1 = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> df2 = DataFrame(a=1:2, col=["drop", "me"], b=3:4)
2×3 DataFrame
 Row │ a      col     b
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  drop        3
   2 │     2  me          4

A basic approach

A natural thing to try is using the Not selector for this task. Let us check it:

julia> select(df1, Not(:col))
ERROR: ArgumentError: column name :col not found in the data frame

julia> select(df2, Not(:col))
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

The operation worked on df2, but failed on df1.

You might ask why Not selector is so restrictive? The reason is to avoid bugs. You could accidentally mistype column name and then, if such operation worked, instead of erroring, your incorrect result would propagate.

An intermediate solution

A first solution that comes to mind is to drop the column only if it is present in a data frame so you might write something like this:

julia> select(df1, names(df1) .!= "col")
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> select(df2, names(df2) .!= "col")
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

This works, but you need to write the name of the source data frame twice, so the solution feels a bit heavy.

The fun part

What is the way I find nice to do this operation then? Here is the approach:

julia> select(df1, Cols(!=("col")))
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> select(df2, Cols(!=("col")))
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

We are using a combo of a bit advanced features here.

First !=("col") creates a function that compares its argument to "col" using !=. This is a very nice feature of Base Julia that it allows partial function application for the != operator.

Next the Cols function accepts a predicate, in our case !=("col"). Then it selects all columns of a data frame for which this predicate returns true.

Conclusions

The beauty of Julia is that it not only does the job you want done, but also is quite fun to code with. At the same time, its design often helps you with catching common possible bugs in code (like the Not behavior I have described in this post).

Enjoy!