Introduction

One of the common tasks when working with a data frame is dropping some of its columns. There are two ways to do it. You can either specify which columns you want to keep or which columns you want to drop.

One of the frequent questions I get is how to do these operations with DataFrames.jl in case the list of columns to keep or drop might not be a subset of columns of the data frame. This is the topic I want to cover in my today’s post.

The post was tested using Julia 1.9.2 and DataFrames.jl 1.6.1.

Standard column selection

First, create an example data frame:

julia> using DataFrames

julia> df = DataFrame(a=1, b=2, c=3, d=4)
1×4 DataFrame
 Row │ a      b      c      d
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      3      4

Now assume I want to keep columns :a and :c from it. You can do it by writing, for example:

julia> select(df, :a, :c)
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

You could also pass the columns as a variable using e.g., a vector:

julia> keep1 = [:a, :c]
2-element Vector{Symbol}:
 :a
 :c

julia> select(df, keep1)
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Now, let us discuss dropping columns. Assume we want to keep all columns except columns :b and :d. We can achieve this by using the Not command:

julia> select(df, Not(:b, :d))
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Also in this case we can use a helper variable:

julia> drop1 = [:b, :d]
2-element Vector{Symbol}:
 :b
 :d

julia> select(df, Not(drop1))
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

The problematic case: the selected column is not present in a data frame

In some scenarios, we might want to provide a list of columns of which not all are present in the data frame. For example, assume we want to keep columns :a and :x. We see that the :x column is not present in our df data frame.

Before we move forward, let me comment when such a situation occurs most often. Assume you have 100 data frames that describe your data. Each data frame is similar, but not identical. For example, a single data frame might represent data from one country and the list of information for the countries does not have to be identical (for some countries we might have more information, which results in more columns in a data frame). When processing such data we might want to write one general condition on which columns we want to keep or drop, and some of these columns might be present in only a subset of all data frames.

Now let us go back to our example. Let us try keeping columns :a and :x:

julia> select(df, :a, :x)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

julia> keep2 = [:a, :x]
2-element Vector{Symbol}:
 :a
 :x

julia> select(df, keep2)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

We get an error. DataFrames.jl is designed to check, by default, that the operation you want to perform on your data frame is valid. This is a conscious design decision. The reason is that in production application settings most often when you say that you want to keep columns :a and :x you assume that they are present in df. Thus you want to get an error if they would not be all present in it.

The same behavior can be observed for dropping columns. Assume we want to drop columns :b and :x:

julia> select(df, Not(:b, :x))
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

julia> drop2 = [:b, :x]
2-element Vector{Symbol}:
 :b
 :x

julia> select(df, drop2)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

So what we saw here is a default behavior that was designed to be safe. In what follows let me discuss how to perform a flexible selection.

Performing column selection when some of them are not present in a data frame

There are several solutions for column selection when some of them are not present in a data frame. Let me present the one that I find the most convenient. For this operation I typically use the Cols selector. The reason is that you can pass a condition function (a predicate) as an argument to Cols that will select columns whose names meet a passed condition.

Therefore the following operation:

julia> select(df, keep1)
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Is the same as:

julia> keep1s = string.(keep1)
2-element Vector{String}:
 "a"
 "c"

julia> select(df, Cols(in(keep1s)))
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Note what we did here. The in(keep1s) expression produces a function that checks if a value passed to it is in the keep1s vector. It is important to note that although column selection in DataFrames.jl accepts both Symbol (like :a) and strings (like "a") as column names the Cols-based selector will perform the check against strings. Therefore I had to convert the keep1 vector of symbols to a keep1s vector of strings.

So far the select(df, Cols(in(keep1s))) is more verbose than just writing select(df, keep1). However, the benefit of Cols is that when the in(keep1s) check is done we can have in keep1s vector whatever values we like, in particular, they do not have be valid column names of our df.

Therefore to keep columns :a and :x, if we are unsure if these columns are present in df we can write:

julia> select(df, Cols(in(["a", "x"])))
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

Note that this time the operation works without an error. And again, please keep in mind that in this selector we need to pass column names as strings.

Now you can probably already tell how to pass a list of columns to drop, without requiring that they are present in df. The only thing to do is to use the !in function (not-in) instead of in. Let us drop columns :b and :x from our data frame (keeping in mind that :x is not present in it):

julia> select(df, Cols(!in(["b", "x"])))
1×3 DataFrame
 Row │ a      c      d
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      4

All worked as expected - the :b column was dropped and the :x column was ignored in the column dropping operation.

Conclusions

I hope you find the examples I gave today useful.

In general, the whole design of DataFrames.jl is similar to what was discussed in this post. The default behavior is picked to be safe (as in our example: by default, select checks if the columns you pass are present in a data frame), but it is possible to switch to an unsafe mode relatively easily (in our example: using Cols with a predicate function).