Introduction

During JuliaCon 2022 I will run a tutorial on DataFrames.jl. In the tutorial I will focus on ways you can write transformation operations using the select/transform/combine functions and the operation specification syntax.

In this post I want to give you a preview of the topics I am going to cover in my tutorial.

The post was written under Julia 1.7.2, DataFrames.jl 1.3.4, and DataFramesMeta.jl 0.12.

What is operation specification syntax?

If you are new to DataFrames.jl then you probably wonder what operation specification syntax is. Fortunately it is quite easy.

If you want to transform data using one of the select/transform/combine functions you can specify the transformation you want to perform using the following general syntax:

[source columns] => [function] => [output columns]

For example if you write :a => mean => :b you will get a mean of column :a and store it in output data frame in column :b (visually: input column :a is passed to the mean function whose output is passed to output column :b).

Additionally, if you prefer assignment style of specifying operations you can use DataFramesMeta.jl package that allows you to write the same as just :b = mean(:a). To use DataFramesMeta.jl you need to prefix the appropriate function name with @ (to turn it into a macro).

Let me show you a minimal working example of such a transformation. We compute mean of column :val by groups defined by the :id column:

julia> using DataFramesMeta

julia> using Statistics

julia> df = DataFrame(id=[1, 2, 1, 2, 1, 2], val=1:6)
6×2 DataFrame
 Row │ id     val
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     1      5
   6 │     2      6

julia> combine(groupby(df, :id), :val => mean => :mean_val)
2×2 DataFrame
 Row │ id     mean_val
     │ Int64  Float64
─────┼─────────────────
   1 │     1       3.0
   2 │     2       4.0

julia> @combine(groupby(df, :id), :mean_val = mean(:val))
2×2 DataFrame
 Row │ id     mean_val
     │ Int64  Float64
─────┼─────────────────
   1 │     1       3.0
   2 │     2       4.0

This is a simple example of what operation specification syntax can do. In this post let me give you a more complex example (I explain all the details of how it works in my upcoming tutorial).

The question from StackOverflow

In this StackOverflow question the user wanted to analyze iris data set and get 25% and 75% quantiles of the Sepal.Length column.

The R code that the StackOverflow user provided was:

> library(dplyr)

> iris %>%
+        group_by(Species) %>%
+        summarise(
+           quantile(Sepal.Length, c(.25, .75)) %>%
+              matrix(nrow = 1) %>%
+              as.data.frame() %>%
+              setNames(paste0("Sepal.Length", c(.25, .75)))
+     )
# A tibble: 3 x 3
  Species    Sepal.Length0.25 Sepal.Length0.75
  <fct>                 <dbl>            <dbl>
1 setosa                 4.8               5.2
2 versicolor             5.6               6.3
3 virginica              6.22              6.9

The question was how to achieve the same with DataFrames.jl and DataFramesMeta.jl?

Here is a solution. First we need to load the iris dataset (in the code I take advantage of the fact that this dataset is bundled into DataFrames.jl installation folders):

julia> using CSV

julia> iris = CSV.read(joinpath(dirname(pathof(DataFrames)),
                                "..", "docs", "src", "assets", "iris.csv"),
                       DataFrame)
150×5 DataFrame
 Row │ SepalLength  SepalWidth  PetalLength  PetalWidth  Species
     │ Float64      Float64     Float64      Float64     String15
─────┼──────────────────────────────────────────────────────────────────
   1 │         5.1         3.5          1.4         0.2  Iris-setosa
   2 │         4.9         3.0          1.4         0.2  Iris-setosa
  ⋮  │      ⋮           ⋮            ⋮           ⋮             ⋮
 149 │         6.2         3.4          5.4         2.3  Iris-virginica
 150 │         5.9         3.0          5.1         1.8  Iris-virginica
                                                        146 rows omitted

Now we are ready to use the combine function and the operation specification syntax:

julia> combine(groupby(iris, :Species),
               :SepalLength =>
               (x -> [quantile(x, [0.25, 0.75])]) =>
               string.("SepalLength", [0.25, 0.75]))
3×3 DataFrame
 Row │ Species          SepalLength0.25  SepalLength0.75
     │ String15         Float64          Float64
─────┼───────────────────────────────────────────────────
   1 │ Iris-setosa                4.8                5.2
   2 │ Iris-versicolor            5.6                6.3
   3 │ Iris-virginica             6.225              6.9

With DataFramesMeta.jl you would write this using the @combine macro as follows (I additionally show here how to use operation chaining with @chain):

julia> @chain iris begin
           groupby(:Species)
           @combine($["SepalLength0.25", "SepalLength0.75"] = [quantile(:SepalLength, [0.25, 0.75])])
       end
3×3 DataFrame
 Row │ Species          SepalLength0.25  SepalLength0.75
     │ String15         Float64          Float64
─────┼───────────────────────────────────────────────────
   1 │ Iris-setosa                4.8                5.2
   2 │ Iris-versicolor            5.6                6.3
   3 │ Iris-virginica             6.225              6.9

Conclusions

The operation specification syntax was designed to allow doing simple transformations in an easy way, but at the same to also support quite complex operations, like the one we did on the iris data frame.

If you would like to hear a detailed explanation of how to write such code please join me during the upcoming workshop.

You can find all the examples that I will use during the workshop in the accompanying GitHub repository.