Introduction

In my last post I have discussed row selector rules for data frames in DataFrames.jl.

This time I will cover available column selector options.

This post was written under Julia 1.5.3 and DataFrames 0.22.5.

First we set up the environment:

julia> using DataFrames

julia> df = DataFrame(a=1, b=2, x1=3, x2=4, y1=5, y2=6)
1×6 DataFrame
 Row │ a      b      x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────
   1 │     1      2      3      4      5      6

Column selection in indexing syntax

A starting point of column selection syntax is indexing. I will go through all available options. In later sections of this post I discuss other functions that support column indexing as they build upon the indexing syntax.

Selecting a single column

You can use either: an integer, a Symbol or a string. Also begin and end is supported just like for arrays. Here are some examples:

julia> df[:, 1]
1-element Array{Int64,1}:
 1

julia> df[:, :a]
1-element Array{Int64,1}:
 1

julia> df[:, "a"]
1-element Array{Int64,1}:
 1

julia> df[:, begin]
1-element Array{Int64,1}:
 1

julia> df[:, end]
1-element Array{Int64,1}:
 6

Note that if in indexing a single column is selected then it is extracted out from a data frame as a vector. In all other column selector options described below you always get a data frame as a result of the operation.

Selecting multiple columns using vectors

You can use the vector selectors: all integers, all Bool, all Symbol, or all string (mixing styles is not allowed). Note that you can also use begin and end to define a vector (e.g. in ranges). Here are some examples:

julia> df[:, [1, 2]]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df[:, [trues(2); falses(4)]]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df[:, [:a, :b]]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df[:, ["a", "b"]]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df[:, begin:2]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df[:, 2:end]
1×5 DataFrame
 Row │ b      x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 │     2      3      4      5      6

Special selectors

There are the following special selectors available:

  • : and All() allowing you to select all columns:
julia> df[:, :]
1×6 DataFrame
 Row │ a      b      x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────
   1 │     1      2      3      4      5      6

julia> df[:, All()]
1×6 DataFrame
 Row │ a      b      x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64  Int64  Int64
─────┼──────────────────────────────────────────
   1 │     1      2      3      4      5      6
  • a regular expression picking columns whose name matches it:
julia> df[:, r"1"]
1×2 DataFrame
 Row │ x1     y1
     │ Int64  Int64
─────┼──────────────
   1 │     3      5

julia> df[:, r"x"]
1×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     3      4
  • Between selector allowing you to specify a range of columns (you can specify the start and stop column using any of the single column selector syntaxes):
julia> df[:, Between(3, :y1)]
1×3 DataFrame
 Row │ x1     x2     y1
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3      4      5

julia> df[:, Between(begin, "x1")]
1×3 DataFrame
 Row │ a      b      x1
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3
  • Not selector allowing you to specify the columns you want to exclude from the resulting data frames. You can put any valid other column selector inside Not:
julia> df[:, Not("a")]
1×5 DataFrame
 Row │ b      x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64  Int64
─────┼───────────────────────────────────
   1 │     2      3      4      5      6

julia> df[:, Not(r"x")]
1×4 DataFrame
 Row │ a      b      y1     y2
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      5      6

julia> df[:, Not(Between(:x1, end))]
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2
  • Cols selector picking a union of other selectors passed as its arguments:
julia> df[:, Cols()]
0×0 DataFrame

julia> df[:, Cols(:x1, 1)]
1×2 DataFrame
 Row │ x1     a
     │ Int64  Int64
─────┼──────────────
   1 │     3      1

julia> df[:, Cols(r"1", r"x")]
1×3 DataFrame
 Row │ x1     y1     x2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3      5      4

Note that the order of columns in the produced data frame reflects the order in which ther are selected by consecutive selectors passed inside Cols.

Selecting just column names

Sometimes one only wants to pick names of columns meeting some condition without actually creating a new data frame. This can be achieved using the names function.

If you just write names(df) you get a list of all column names of a data frame as a vector of strings:

julia> names(df)
6-element Array{String,1}:
 "a"
 "b"
 "x1"
 "x2"
 "y1"
 "y2"

You can also pass any column selector expression we described above as a second argument to names to get a subset of the column names, e.g.:

julia> names(df, r"x")
2-element Array{String,1}:
 "x1"
 "x2"

julia> names(df, Not(Between("b", "x2")))
3-element Array{String,1}:
 "a"
 "y1"
 "y2"

Additionally you can pass a type as a second argument to the names function, and in this case you will get a vector of column names whose element type is a subtype of passed argument. Here is an example:

julia> df2 = DataFrame([[1, 2], [1.0, 2.0], ["1", "2"], [1, missing]], :auto)
2×4 DataFrame
 Row │ x1     x2       x3      x4
     │ Int64  Float64  String  Int64?
─────┼─────────────────────────────────
   1 │     1      1.0  1             1
   2 │     2      2.0  2       missing

julia> names(df2, Int)
1-element Array{String,1}:
 "x1"

julia> names(df2, Union{Int, Missing})
2-element Array{String,1}:
 "x1"
 "x4"

julia> names(df2, Real)
2-element Array{String,1}:
 "x1"
 "x2"

There is also an upcoming feature that will be soon available in DataFrames.jl 1.0 release. You are going to be able to pass as a second argument to the names function a predicate taking the column name as a string and returning true for: columns that should be kept. Here are some examples (note that this code snippet will only work if you ceck out main branch of DataFrames.jl from its GitHub repository):

julia> df3 = DataFrame(x1=1, x2=2, y=3)
1×3 DataFrame
 Row │ x1     x2     y
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> names(df3, startswith('x'))
2-element Array{String,1}:
 "x1"
 "x2"

julia> names(df3, x -> length(x) == 1)
1-element Array{String,1}:
 "y"

The names function is quite often used if you want to perform multiple transformations of columns using select, transform or combine, e.g.:

julia> select(df, r"x", names(df, r"x") .=> ByRow(sqrt))
1×4 DataFrame
 Row │ x1     x2     x1_sqrt  x2_sqrt
     │ Int64  Int64  Float64  Float64
─────┼────────────────────────────────
   1 │     3      4  1.73205      2.0

Other functions that support passing column selectors

Here is a list of functions that accept the column selectors we discussed in the indexing section above:

  • describe takes cols keyword argument allowing you to choose for which columns descriptive statistics should be computed;
  • combine, select, select!, transform, transform!, allowing you to just keep the passed columns;
  • groupby that takes a second positional argument a column selector;
  • flatten taking a column selector as a second positional argument giving informations which columns should be flattened;
  • stack allowing you to specify measure variables and id variables using the column selectors;
  • unstack where you can pass columns that specify row keys for unstacking;
  • issorted, sortperm, sort and sort! where the second positional argument is a column selector (they also allow some additional syntax to specify how sorting should be performed but this is a separate topic);
  • unique, unique!, and nonunique where second positional argument is a column selector signaling which column should be checked for uniqueness;
  • allowmissing, allowmissing!, disallowmissing, disallowmissing!, completecases, dropmissing, dropmissing! which all take as a second argument a column selector indicating which columns should be taken into account in transformation performed.

Conclusions

As you can see in this post DataFrames.jl provides a very flexible system of allowed column selectors. Also, good understanding column selector semantics will boost your efficiency when working with DataFrames.jl as we consistently support the same patterns across whole package.